sess.log - RADOS - Ceph

Bug #46847 » sess.log

Session log with annotations. - Frank Schilder, 08/11/2020 12:04 PM

       The transcript below shows a test on a cluster with new OSDs during rebalancing.
       The test shows the effect of
       - stopping+starting a new OSD (osd-phy6, ID 289),
       - stopping+starting an old OSD (osd-phy9, ID 74).
       In any test, we wait for peering to complete before taking "ceph status".
       Set noout and norebalance to avoid disturbances during test, wait for recovery
       to cease. Status at this point:
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_WARN
                   noout,norebalance flag(s) set
                   8235970/1498751416 objects misplaced (0.550%)
 pools nearfull
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 272 up, 272 in; 46 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     8235970/1498751416 objects misplaced (0.550%)
 active+clean
 active+remapped+backfill_wait
 active+remapped+backfilling
 active+clean+scrubbing+deep
 active+clean+snaptrim
         io:
           client:   74 MiB/s rd, 42 MiB/s wr, 1.19 kop/s rd, 889 op/s wr
       # docker stop osd-phy6
       osd-phy6
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_WARN
                   noout,norebalance flag(s) set
 osds down
                   8342724/1498792326 objects misplaced (0.557%)
                   Degraded data redundancy: 5717609/1498792326 objects degraded (0.381%), 74 pgs degraded
 pools nearfull
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 271 up, 272 in; 46 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     5717609/1498792326 objects degraded (0.381%)
                    8342724/1498792326 objects misplaced (0.557%)
 active+clean
 active+undersized+degraded
 active+remapped+backfill_wait
 active+remapped+backfilling
 active+clean+scrubbing+deep
 active+clean+remapped+snaptrim
 active+clean+scrubbing
         io:
           client:   69 MiB/s rd, 45 MiB/s wr, 1.28 kop/s rd, 838 op/s wr
       # ceph health detail
       HEALTH_WARN noout,norebalance flag(s) set; 1 osds down; 8342692/1498794289 objects misplaced (0.557%); Degraded data redundancy: 5717610/1498794289 objects degraded (0.381%), 74 pgs degraded, 74 pgs undersized; 1 pools nearfull
       OSDMAP_FLAGS noout,norebalance flag(s) set
       OSD_DOWN 1 osds down
           osd.289 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is down
       OBJECT_MISPLACED 8342692/1498794289 objects misplaced (0.557%)
       PG_DEGRADED Degraded data redundancy: 5717610/1498794289 objects degraded (0.381%), 74 pgs degraded, 74 pgs undersized
           pg 11.2 is stuck undersized for 70.197385, current state active+undersized+degraded, last acting [87,292,2147483647,296,229,168,0,263]
           pg 11.16 is stuck undersized for 70.178478, current state active+undersized+degraded, last acting [2147483647,181,60,233,237,294,293,292]
           pg 11.1f is stuck undersized for 70.190040, current state active+undersized+degraded, last acting [230,238,182,292,84,2147483647,86,239]
           pg 11.39 is stuck undersized for 70.193683, current state active+undersized+degraded, last acting [158,148,293,73,168,2,2147483647,236]
           pg 11.3b is stuck undersized for 70.200823, current state active+undersized+degraded, last acting [2147483647,85,229,145,170,172,0,230]
           pg 11.47 is stuck undersized for 70.196419, current state active+undersized+degraded, last acting [3,296,2147483647,0,233,84,182,238]
           pg 11.59 is stuck undersized for 70.190002, current state active+undersized+degraded, last acting [2147483647,76,73,235,156,263,234,172]
           pg 11.63 is stuck undersized for 70.160846, current state active+undersized+degraded, last acting [0,146,1,156,2147483647,228,172,238]
           pg 11.66 is stuck undersized for 70.086237, current state active+undersized+degraded, last acting [291,159,296,233,2147483647,293,170,145]
           pg 11.6d is stuck undersized for 70.210387, current state active+undersized+degraded, last acting [84,235,73,290,295,2147483647,0,183]
           pg 11.7b is stuck undersized for 70.202578, current state active+undersized+degraded, last acting [2147483647,146,293,294,296,181,0,263]
           pg 11.7d is stuck undersized for 70.178488, current state active+undersized+degraded, last acting [294,2,263,2147483647,170,237,292,235]
           pg 11.7f is active+undersized+degraded, acting [148,232,2147483647,230,87,236,168,72]
           pg 11.146 is stuck undersized for 70.197744, current state active+undersized+degraded, last acting [235,183,156,295,2147483647,294,146,260]
           pg 11.155 is stuck undersized for 70.203091, current state active+undersized+degraded, last acting [73,72,170,259,260,63,84,2147483647]
           pg 11.15d is stuck undersized for 70.135909, current state active+undersized+degraded, last acting [259,182,0,63,234,294,233,2147483647]
           pg 11.171 is stuck undersized for 70.209391, current state active+undersized+degraded, last acting [170,168,232,72,231,172,2147483647,237]
           pg 11.176 is stuck undersized for 70.202583, current state active+undersized+degraded, last acting [146,237,181,2147483647,294,72,236,293]
           pg 11.177 is stuck undersized for 70.192564, current state active+undersized+degraded, last acting [156,146,236,235,63,2147483647,3,291]
           pg 11.179 is stuck undersized for 70.190284, current state active+undersized+degraded, last acting [87,156,233,86,2147483647,172,259,158]
           pg 11.17e is stuck undersized for 70.188938, current state active+undersized+degraded, last acting [3,231,290,260,76,183,2147483647,293]
           pg 11.181 is stuck undersized for 70.175985, current state active+undersized+degraded, last acting [2147483647,290,239,148,1,228,145,2]
           pg 11.188 is stuck undersized for 70.208638, current state active+undersized+degraded, last acting [2147483647,170,237,172,291,168,232,85]
           pg 11.18b is stuck undersized for 70.186336, current state active+undersized+degraded, last acting [233,148,228,87,2147483647,182,235,0]
           pg 11.18f is stuck undersized for 70.197416, current state active+undersized+degraded, last acting [73,237,238,2147483647,156,0,292,182]
           pg 11.19d is stuck undersized for 70.083071, current state active+undersized+degraded, last acting [291,172,146,145,238,2147483647,296,231]
           pg 11.1a5 is stuck undersized for 70.184859, current state active+undersized+degraded, last acting [293,145,2,230,159,239,85,2147483647]
           pg 11.1a6 is stuck undersized for 70.209851, current state active+undersized+degraded, last acting [229,145,158,296,0,292,2147483647,239]
           pg 11.1ac is stuck undersized for 70.192130, current state active+undersized+degraded, last acting [234,84,2147483647,86,239,183,294,232]
           pg 11.1b0 is stuck undersized for 70.180993, current state active+undersized+degraded, last acting [168,293,290,2,2147483647,159,296,73]
           pg 11.1b1 is stuck undersized for 70.175329, current state active+undersized+degraded, last acting [172,259,168,260,73,2147483647,146,263]
           pg 11.1b7 is stuck undersized for 70.208713, current state active+undersized+degraded, last acting [263,172,2147483647,259,0,87,145,228]
           pg 11.1c1 is stuck undersized for 70.170314, current state active+undersized+degraded, last acting [182,148,263,293,2,2147483647,228,294]
           pg 11.1c3 is stuck undersized for 70.192088, current state active+undersized+degraded, last acting [234,290,63,239,85,156,76,2147483647]
           pg 11.1c7 is stuck undersized for 70.192194, current state active+undersized+degraded, last acting [1,2147483647,263,232,86,234,84,172]
           pg 11.1dd is stuck undersized for 70.183525, current state active+undersized+degraded, last acting [293,172,295,156,170,237,2147483647,86]
           pg 11.1de is stuck undersized for 69.972952, current state active+undersized+degraded, last acting [296,293,76,63,231,146,2147483647,168]
           pg 11.1e8 is stuck undersized for 70.172003, current state active+undersized+degraded, last acting [172,3,290,229,236,156,2147483647,228]
           pg 11.1f2 is stuck undersized for 70.196870, current state active+undersized+degraded, last acting [234,0,159,2147483647,232,73,290,181]
           pg 11.1f5 is stuck undersized for 70.190841, current state active+undersized+degraded, last acting [238,234,73,2147483647,158,291,172,168]
           pg 11.1fc is stuck undersized for 70.181133, current state active+undersized+degraded, last acting [172,86,85,230,182,2147483647,238,233]
           pg 11.1fd is stuck undersized for 70.221124, current state active+undersized+degraded, last acting [72,145,237,293,2147483647,60,87,172]
           pg 11.203 is stuck undersized for 70.193700, current state active+undersized+degraded, last acting [2147483647,235,168,60,87,63,295,230]
           pg 11.20d is stuck undersized for 70.197909, current state active+undersized+degraded, last acting [236,172,73,182,228,168,2147483647,293]
           pg 11.20f is stuck undersized for 70.196571, current state active+undersized+degraded, last acting [85,84,76,60,238,233,159,2147483647]
           pg 11.211 is stuck undersized for 70.197522, current state active+undersized+degraded, last acting [156,2147483647,170,234,0,238,1,231]
           pg 11.212 is stuck undersized for 70.201683, current state active+undersized+degraded, last acting [148,2147483647,85,182,84,232,86,230]
           pg 11.21e is stuck undersized for 70.202044, current state active+undersized+degraded, last acting [146,156,159,2147483647,230,238,239,2]
           pg 11.224 is stuck undersized for 70.095494, current state active+undersized+degraded, last acting [291,148,237,2147483647,170,1,156,233]
           pg 11.22d is stuck undersized for 70.195735, current state active+undersized+degraded, last acting [3,168,296,158,292,236,0,2147483647]
           pg 11.22f is stuck undersized for 70.192480, current state active+undersized+degraded, last acting [1,2147483647,292,60,296,231,259,72]
       POOL_NEAR_FULL 1 pools nearfull
           pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
       # ceph pg 11.2 query | jq ".acting,.up,.recovery_state"
+      [
 ,
 ,
         2147483647,
 ,
 ,
 ,
 ,
+      ]
+      [
 ,
 ,
         2147483647,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:35:49.109129",
           "might_have_unfound": [],
           "recovery_progress": {
             "backfill_targets": [],
             "waiting_on_backfill": [],
             "last_backfill_started": "MIN",
             "backfill_info": {
               "begin": "MIN",
               "end": "MIN",
               "objects": []
             },
             "peer_backfill_info": [],
             "backfills_in_flight": [],
             "recovering": [],
             "pg_backend": {
               "recovery_ops": [],
               "read_ops": []
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:35:48.137595"
+        }
+      ]
       # docker start osd-phy6
       osd-phy6
       # After starting the OSD again, the cluster almost recovers. The PG showing
       # up as backfill_toofull is due to a known bug (fixed in 13.2.10?). No
       # degraded objects, just misplaced ones.
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_ERR
                   noout,norebalance flag(s) set
                   8181843/1498795556 objects misplaced (0.546%)
                   Degraded data redundancy (low space): 1 pg backfill_toofull
 pools nearfull
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 272 up, 272 in; 46 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     8181843/1498795556 objects misplaced (0.546%)
 active+clean
 active+remapped+backfill_wait
 active+remapped+backfilling
 active+clean+scrubbing+deep
 active+remapped+backfill_toofull
 active+clean+snaptrim
         io:
           client:   35 MiB/s rd, 23 MiB/s wr, 672 op/s rd, 686 op/s wr
       # docker stop osd-phy9
       osd-phy9
       # After stopping an old OSD, we observe an immediate degradation. The cluster seems
       # to loose track of objects already here. The recovery operation does not stop in
       # contrast to the situation when stopping a new OSD as seen above, where it shows up
       # only temporarily.
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_WARN
                   noout,norebalance flag(s) set
 osds down
                   7967641/1498798381 objects misplaced (0.532%)
                   Degraded data redundancy: 5763425/1498798381 objects degraded (0.385%), 75 pgs degraded
 pools nearfull
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 271 up, 272 in; 46 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     5763425/1498798381 objects degraded (0.385%)
                    7967641/1498798381 objects misplaced (0.532%)
 active+clean
 active+undersized+degraded
 active+remapped+backfill_wait
 active+clean+scrubbing+deep
 active+undersized+degraded+remapped+backfilling
 active+clean+scrubbing
 active+undersized+degraded+remapped+backfill_wait
 active+clean+snaptrim
         io:
           client:   76 MiB/s rd, 76 MiB/s wr, 736 op/s rd, 881 op/s wr
           recovery: 93 MiB/s, 23 objects/s
       # ceph health detail
       HEALTH_WARN noout,norebalance flag(s) set; 1 osds down; 7966306/1498798501 objects misplaced (0.532%); Degraded data redundancy: 5762977/1498798501 objects degraded (0.385%), 75 pgs degraded; 1 pools nearfull
       OSDMAP_FLAGS noout,norebalance flag(s) set
       OSD_DOWN 1 osds down
           osd.74 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is down
       OBJECT_MISPLACED 7966306/1498798501 objects misplaced (0.532%)
       PG_DEGRADED Degraded data redundancy: 5762977/1498798501 objects degraded (0.385%), 75 pgs degraded
           pg 11.4 is active+undersized+degraded, acting [86,2147483647,237,235,182,63,231,84]
           pg 11.5 is active+undersized+degraded, acting [1,2147483647,183,0,239,145,293,170]
           pg 11.a is active+undersized+degraded+remapped+backfilling, acting [170,156,148,2147483647,234,86,236,232]
           pg 11.b is active+undersized+degraded, acting [3,228,0,146,239,292,145,2147483647]
           pg 11.e is active+undersized+degraded, acting [237,181,73,183,72,290,2147483647,295]
           pg 11.1b is active+undersized+degraded, acting [85,183,72,2147483647,156,232,263,146]
           pg 11.1d is active+undersized+degraded, acting [290,296,183,86,293,2147483647,236,3]
           pg 11.37 is active+undersized+degraded, acting [60,233,181,183,2147483647,296,87,86]
           pg 11.50 is active+undersized+degraded+remapped+backfill_wait, acting [231,259,228,87,182,156,2147483647,172]
           pg 11.52 is active+undersized+degraded, acting [237,60,1,2147483647,233,232,292,86]
           pg 11.57 is active+undersized+degraded, acting [231,259,230,170,72,87,181,2147483647]
           pg 11.5a is active+undersized+degraded, acting [290,2147483647,237,183,293,84,295,1]
           pg 11.5c is active+undersized+degraded, acting [84,2147483647,259,0,85,234,146,148]
           pg 11.62 is active+undersized+degraded, acting [182,294,293,2147483647,63,234,181,0]
           pg 11.68 is active+undersized+degraded, acting [158,296,229,168,76,3,159,2147483647]
           pg 11.6a is active+undersized+degraded, acting [2147483647,288,238,172,1,237,0,290]
           pg 11.73 is active+undersized+degraded, acting [236,234,259,2147483647,170,63,3,0]
           pg 11.77 is active+undersized+degraded, acting [2147483647,72,87,183,236,156,290,293]
           pg 11.78 is active+undersized+degraded, acting [172,230,236,156,294,60,2147483647,76]
           pg 11.84 is active+undersized+degraded, acting [86,84,239,296,294,182,2147483647,293]
           pg 11.87 is active+undersized+degraded+remapped+backfilling, acting [148,60,231,260,235,87,2147483647,181]
           pg 11.8b is active+undersized+degraded, acting [263,170,2147483647,259,296,172,73,76]
           pg 11.15a is active+undersized+degraded, acting [2147483647,260,182,0,263,73,159,288]
           pg 11.15f is active+undersized+degraded+remapped+backfilling, acting [146,233,2147483647,76,234,172,181,229]
           pg 11.162 is active+undersized+degraded, acting [84,294,230,2,293,290,2147483647,295]
           pg 11.16d is active+undersized+degraded, acting [236,230,2147483647,183,0,1,235,181]
           pg 11.172 is active+undersized+degraded, acting [181,148,237,3,231,293,76,2147483647]
           pg 11.185 is active+undersized+degraded, acting [296,0,236,238,2147483647,294,181,146]
           pg 11.18a is active+undersized+degraded, acting [0,2147483647,159,145,293,233,85,146]
           pg 11.192 is active+undersized+degraded, acting [148,76,170,296,295,2147483647,3,235]
           pg 11.193 is active+undersized+degraded, acting [2147483647,148,295,230,232,168,76,290]
           pg 11.198 is active+undersized+degraded, acting [260,76,87,2147483647,145,183,229,239]
           pg 11.19a is active+undersized+degraded, acting [146,294,230,238,2147483647,0,295,288]
           pg 11.1a1 is active+undersized+degraded, acting [84,183,294,2147483647,234,170,263,238]
           pg 11.1a7 is active+undersized+degraded, acting [63,236,158,84,86,237,87,2147483647]
           pg 11.1ae is active+undersized+degraded, acting [296,172,238,2147483647,170,288,294,295]
           pg 11.1c5 is active+undersized+degraded, acting [76,172,236,232,2147483647,296,288,170]
           pg 11.1c6 is active+undersized+degraded, acting [236,72,230,170,2147483647,238,181,148]
           pg 11.1d1 is active+undersized+degraded, acting [259,170,291,3,156,2147483647,292,296]
           pg 11.1d4 is active+undersized+degraded, acting [263,228,182,84,2,2147483647,259,87]
           pg 11.1e1 is active+undersized+degraded, acting [158,145,233,1,259,296,2,2147483647]
           pg 11.1ea is active+undersized+degraded, acting [84,183,260,259,85,60,2,2147483647]
           pg 11.1ec is active+undersized+degraded, acting [292,293,233,2,2147483647,85,288,146]
           pg 11.1ed is active+undersized+degraded, acting [156,237,293,233,148,2147483647,291,85]
           pg 11.1ee is active+undersized+degraded, acting [1,229,0,63,228,2147483647,233,156]
           pg 11.201 is active+undersized+degraded, acting [229,239,296,63,76,294,182,2147483647]
           pg 11.206 is active+undersized+degraded, acting [235,288,76,158,296,263,85,2147483647]
           pg 11.20a is active+undersized+degraded, acting [158,1,263,232,0,230,292,2147483647]
           pg 11.218 is active+undersized+degraded, acting [0,296,87,2147483647,263,148,156,232]
           pg 11.21a is active+undersized+degraded, acting [2147483647,230,159,231,60,235,73,291]
           pg 11.21b is active+undersized+degraded, acting [84,159,238,87,291,230,2147483647,182]
       POOL_NEAR_FULL 1 pools nearfull
           pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
       # ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
+      [
 ,
         2147483647,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
 ,
         2147483647,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:41:55.983304",
           "might_have_unfound": [],
           "recovery_progress": {
             "backfill_targets": [],
             "waiting_on_backfill": [],
             "last_backfill_started": "MIN",
             "backfill_info": {
               "begin": "MIN",
               "end": "MIN",
               "objects": []
             },
             "peer_backfill_info": [],
             "backfills_in_flight": [],
             "recovering": [],
             "pg_backend": {
               "recovery_ops": [],
               "read_ops": [
+                {
                   "tid": 8418509,
                   "to_read": "{11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head=read_request_t(to_read=[1646592,24576,0], need={63(5)=[0,1],86(0)=[0,1],182(4)=[0,1],231(6)=[0,1],235(3)=[0,1],237(2)=[0,1]}, want_attrs=0)}",
                   "complete": "{11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head=read_result_t(r=0, errors={}, noattrs, returned=(1646592, 24576, [63(5),4096, 86(0),4096, 231(6),4096, 235(3),4096, 237(2),4096]))}",
                   "priority": 127,
                   "obj_to_source": "{11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head=63(5),86(0),182(4),231(6),235(3),237(2)}",
                   "source_to_obj": "{63(5)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,86(0)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,182(4)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,231(6)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,235(3)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,237(2)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head}",
                   "in_progress": "182(4)"
+                }
+              ]
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:41:55.202575"
+        }
+      ]
       # docker start osd-phy9
       osd-phy9
       # After starting the old OSD, a lot of objects remain degraded. It looks like PGs that were
       # in state "...+remapped+backfilling" are affected. All others seem to recover, see below.
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_ERR
                   noout,norebalance flag(s) set
                   7954306/1498800854 objects misplaced (0.531%)
                   Degraded data redundancy: 208493/1498800854 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized
                   Degraded data redundancy (low space): 4 pgs backfill_toofull
 pools nearfull
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 272 up, 272 in; 46 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     208493/1498800854 objects degraded (0.014%)
                    7954306/1498800854 objects misplaced (0.531%)
 active+clean
 active+remapped+backfill_wait
 active+clean+scrubbing+deep
 active+remapped+backfill_toofull
 active+undersized+degraded+remapped+backfilling
 active+clean+snaptrim
         io:
           client:   111 MiB/s rd, 42 MiB/s wr, 763 op/s rd, 750 op/s wr
           recovery: 66 MiB/s, 16 objects/s
       # ceph health detail
       HEALTH_ERR noout,norebalance flag(s) set; 7953632/1498800881 objects misplaced (0.531%); Degraded data redundancy: 208184/1498800881 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized; Degraded data redundancy (low space): 4 pgs backfill_toofull; 1 pools nearfull
       OSDMAP_FLAGS noout,norebalance flag(s) set
       OBJECT_MISPLACED 7953632/1498800881 objects misplaced (0.531%)
       PG_DEGRADED Degraded data redundancy: 208184/1498800881 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized
           pg 11.a is stuck undersized for 311.488352, current state active+undersized+degraded+remapped+backfilling, last acting [170,156,148,2147483647,234,86,236,232]
           pg 11.87 is stuck undersized for 311.487625, current state active+undersized+degraded+remapped+backfilling, last acting [148,60,231,260,235,87,2147483647,181]
           pg 11.ed is stuck undersized for 311.465765, current state active+undersized+degraded+remapped+backfilling, last acting [233,2147483647,156,259,159,182,230,85]
       PG_DEGRADED_FULL Degraded data redundancy (low space): 4 pgs backfill_toofull
           pg 11.8 is active+remapped+backfill_wait+backfill_toofull, acting [86,158,237,85,159,259,144,263]
           pg 11.5d is active+remapped+backfill_wait+backfill_toofull, acting [263,158,230,73,183,84,2,169]
           pg 11.165 is active+remapped+backfill_wait+backfill_toofull, acting [60,148,234,73,2,229,84,180]
           pg 11.1f0 is active+remapped+backfill_wait+backfill_toofull, acting [237,148,2,238,169,231,60,87]
       POOL_NEAR_FULL 1 pools nearfull
           pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
       # ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:46:20.882683",
           "might_have_unfound": [],
           "recovery_progress": {
             "backfill_targets": [],
             "waiting_on_backfill": [],
             "last_backfill_started": "MIN",
             "backfill_info": {
               "begin": "MIN",
               "end": "MIN",
               "objects": []
             },
             "peer_backfill_info": [],
             "backfills_in_flight": [],
             "recovering": [],
             "pg_backend": {
               "recovery_ops": [],
               "read_ops": []
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:46:19.736862"
+        }
+      ]
       # ceph pg 11.a query | jq ".acting,.up,.recovery_state"
+      [
 ,
 ,
 ,
         2147483647,
 ,
 ,
 ,
+      ]
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:41:55.982261",
           "might_have_unfound": [
+            {
               "osd": "74(3)",
               "status": "already probed"
+            }
           ],
           "recovery_progress": {
             "backfill_targets": [
               "289(3)",
               "292(2)"
             ],
             "waiting_on_backfill": [],
             "last_backfill_started": "11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head",
             "backfill_info": {
               "begin": "11:5003695e:::rbd_data.1.318f016b8b4567.000000000004aa48:head",
               "end": "11:5003e7ba:::rbd_data.1.ac314b6b8b4567.00000000000a3032:head",
               "objects": [
+                {
                   "object": "11:5003695e:::rbd_data.1.318f016b8b4567.000000000004aa48:head",
                   "version": "66191'195037"
                 },
                 [... many many similar entries removed ...]
+                {
                   "object": "11:5003e76b:::rbd_data.1.b023c26b8b4567.0000000000b8efd1:head",
                   "version": "181835'908372"
+                }
+              ]
             },
             "peer_backfill_info": [
               "289(3)",
+              {
                 "begin": "MAX",
                 "end": "MAX",
                 "objects": []
               },
               "292(2)",
+              {
                 "begin": "MAX",
                 "end": "MAX",
                 "objects": []
+              }
             ],
             "backfills_in_flight": [
               "11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head"
             ],
             "recovering": [
               "11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head"
             ],
             "pg_backend": {
               "recovery_ops": [
+                {
                   "hoid": "11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head",
                   "v": "178993'819609",
                   "missing_on": "289(3),292(2)",
                   "missing_on_shards": "2,3",
                   "recovery_info": "ObjectRecoveryInfo(11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head@178993'819609, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:{})",
                   "recovery_progress": "ObjectRecoveryProgress(!first, data_recovered_to:4202496, data_complete:true, omap_recovered_to:, omap_complete:true, error:false)",
                   "state": "WRITING",
                   "waiting_on_pushes": "289(3),292(2)",
                   "extent_requested": "0,8404992"
+                }
               ],
               "read_ops": []
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:41:55.090085"
+        }
+      ]
       # Below we temporarily change placement of new OSDs back and forth. The
       # initiated peering leads to re-discovery of all objects.
       # ceph osd crush move osd.288 host=bb-04
       moved item id 288 name 'osd.288' to location {host=bb-04} in crush map
       # ceph osd crush move osd.289 host=bb-05
       moved item id 289 name 'osd.289' to location {host=bb-05} in crush map
       # ceph osd crush move osd.290 host=bb-06
       moved item id 290 name 'osd.290' to location {host=bb-06} in crush map
       # ceph osd crush move osd.291 host=bb-21
       moved item id 291 name 'osd.291' to location {host=bb-21} in crush map
       # ceph osd crush move osd.292 host=bb-07
       moved item id 292 name 'osd.292' to location {host=bb-07} in crush map
       # ceph osd crush move osd.293 host=bb-18
       moved item id 293 name 'osd.293' to location {host=bb-18} in crush map
       # ceph osd crush move osd.295 host=bb-19
       moved item id 295 name 'osd.295' to location {host=bb-19} in crush map
       # ceph osd crush move osd.294 host=bb-22
       moved item id 294 name 'osd.294' to location {host=bb-22} in crush map
       # ceph osd crush move osd.296 host=bb-20
       moved item id 296 name 'osd.296' to location {host=bb-20} in crush map
       # All objects found at this point. Notice that a slow op shows up for
       # one of the mons (see very end). It does not clear itself, a restart
       # is required.
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_WARN
                   noout,norebalance flag(s) set
                   59942033/1498816658 objects misplaced (3.999%)
 pools nearfull
 slow ops, oldest one blocked for 62 sec, mon.ceph-03 has slow ops
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 272 up, 272 in; 419 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     59942033/1498816658 objects misplaced (3.999%)
 active+clean
 active+remapped+backfill_wait
 active+remapped+backfilling
 active+clean+snaptrim
 active+clean+snaptrim_wait
 active+clean+scrubbing+deep
         io:
           client:   130 MiB/s rd, 113 MiB/s wr, 1.38 kop/s rd, 1.53 kop/s wr
       # ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:50:26.411524",
           "might_have_unfound": [],
           "recovery_progress": {
             "backfill_targets": [],
             "waiting_on_backfill": [],
             "last_backfill_started": "MIN",
             "backfill_info": {
               "begin": "MIN",
               "end": "MIN",
               "objects": []
             },
             "peer_backfill_info": [],
             "backfills_in_flight": [],
             "recovering": [],
             "pg_backend": {
               "recovery_ops": [],
               "read_ops": []
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:50:20.931555"
+        }
+      ]
       # ceph pg 11.a query | jq ".acting,.up,.recovery_state"
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:50:18.781335",
           "might_have_unfound": [
+            {
               "osd": "74(3)",
               "status": "already probed"
             },
+            {
               "osd": "86(5)",
               "status": "already probed"
             },
+            {
               "osd": "148(2)",
               "status": "already probed"
             },
+            {
               "osd": "156(1)",
               "status": "already probed"
             },
+            {
               "osd": "232(7)",
               "status": "already probed"
             },
+            {
               "osd": "234(4)",
               "status": "already probed"
             },
+            {
               "osd": "236(6)",
               "status": "already probed"
             },
+            {
               "osd": "289(3)",
               "status": "not queried"
             },
+            {
               "osd": "292(2)",
               "status": "not queried"
+            }
           ],
           "recovery_progress": {
             "backfill_targets": [],
             "waiting_on_backfill": [],
             "last_backfill_started": "MIN",
             "backfill_info": {
               "begin": "MIN",
               "end": "MIN",
               "objects": []
             },
             "peer_backfill_info": [],
             "backfills_in_flight": [],
             "recovering": [],
             "pg_backend": {
               "recovery_ops": [],
               "read_ops": []
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:50:18.043630"
+        }
+      ]
       # ceph osd crush move osd.288 host=ceph-04
       moved item id 288 name 'osd.288' to location {host=ceph-04} in crush map
       # ceph osd crush move osd.289 host=ceph-05
       moved item id 289 name 'osd.289' to location {host=ceph-05} in crush map
       # ceph osd crush move osd.290 host=ceph-06
       moved item id 290 name 'osd.290' to location {host=ceph-06} in crush map
       # ceph osd crush move osd.291 host=ceph-21
       moved item id 291 name 'osd.291' to location {host=ceph-21} in crush map
       # ceph osd crush move osd.292 host=ceph-07
       moved item id 292 name 'osd.292' to location {host=ceph-07} in crush map
       # ceph osd crush move osd.293 host=ceph-18
       moved item id 293 name 'osd.293' to location {host=ceph-18} in crush map
       # ceph osd crush move osd.295 host=ceph-19
       moved item id 295 name 'osd.295' to location {host=ceph-19} in crush map
       # ceph osd crush move osd.294 host=ceph-22
       moved item id 294 name 'osd.294' to location {host=ceph-22} in crush map
       # ceph osd crush move osd.296 host=ceph-20
       moved item id 296 name 'osd.296' to location {host=ceph-20} in crush map
       # After these placement operations, we start observing slow ops. Not sute what
       # is going on here, but something seems not to work the way it should. We recorded
       # here 2 ceph status reports to show the transition. In between these two there
       # was a PG going down as well, it was shown as 1 pg inactive for a short time.
       # Didn't manage to catch that for the record.
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_WARN
                   noout,norebalance flag(s) set
                   8630330/1498837232 objects misplaced (0.576%)
 pools nearfull
 slow ops, oldest one blocked for 212 sec, daemons [osd.169,osd.234,osd.288,osd.63,mon.ceph-03] have slow ops.
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 272 up, 272 in; 46 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     0.156% pgs not active
                    8630330/1498837232 objects misplaced (0.576%)
 active+clean
 active+remapped+backfill_wait
 active+clean+scrubbing+deep
 active+remapped+backfilling
 activating
 activating+remapped
 active+clean+snaptrim
         io:
           client:   85 MiB/s rd, 127 MiB/s wr, 534 op/s rd, 844 op/s wr
       # ceph status
         cluster:
           id:     xxx-x-xxx
           health: HEALTH_WARN
                   noout,norebalance flag(s) set
                   8630330/1498844491 objects misplaced (0.576%)
 pools nearfull
 slow ops, oldest one blocked for 247 sec, mon.ceph-03 has slow ops
         services:
           mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
           mgr: ceph-01(active), standbys: ceph-03, ceph-02
           mds: con-fs2-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby-replay
           osd: 297 osds: 272 up, 272 in; 46 remapped pgs
                flags noout,norebalance
         data:
           pools:   11 pools, 3215 pgs
           objects: 178.0 M objects, 491 TiB
           usage:   685 TiB used, 1.2 PiB / 1.9 PiB avail
           pgs:     8630330/1498844491 objects misplaced (0.576%)
 active+clean
 active+remapped+backfill_wait
 active+clean+scrubbing+deep
 active+remapped+backfilling
 active+clean+snaptrim
         io:
           client:   51 MiB/s rd, 66 MiB/s wr, 1.22 kop/s rd, 1.18 kop/s wr
       # ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:53:20.059512",
           "might_have_unfound": [],
           "recovery_progress": {
             "backfill_targets": [],
             "waiting_on_backfill": [],
             "last_backfill_started": "MIN",
             "backfill_info": {
               "begin": "MIN",
               "end": "MIN",
               "objects": []
             },
             "peer_backfill_info": [],
             "backfills_in_flight": [],
             "recovering": [],
             "pg_backend": {
               "recovery_ops": [],
               "read_ops": []
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:53:08.947343"
+        }
+      ]
       # ceph pg 11.a query | jq ".acting,.up,.recovery_state"
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
 ,
 ,
 ,
 ,
 ,
 ,
 ,
+      ]
+      [
+        {
           "name": "Started/Primary/Active",
           "enter_time": "2020-08-11 10:53:08.842425",
           "might_have_unfound": [],
           "recovery_progress": {
             "backfill_targets": [
               "289(3)",
               "292(2)"
             ],
             "waiting_on_backfill": [],
             "last_backfill_started": "MIN",
             "backfill_info": {
               "begin": "MIN",
               "end": "MIN",
               "objects": []
             },
             "peer_backfill_info": [],
             "backfills_in_flight": [],
             "recovering": [],
             "pg_backend": {
               "recovery_ops": [],
               "read_ops": []
+            }
           },
           "scrub": {
             "scrubber.epoch_start": "0",
             "scrubber.active": false,
             "scrubber.state": "INACTIVE",
             "scrubber.start": "MIN",
             "scrubber.end": "MIN",
             "scrubber.max_end": "MIN",
             "scrubber.subset_last_update": "0'0",
             "scrubber.deep": false,
             "scrubber.waiting_on_whom": []
+          }
         },
+        {
           "name": "Started",
           "enter_time": "2020-08-11 10:53:04.962749"
+        }
+      ]
       # This operation got stuck. In other experiments I saw hundreds stuck and
       # always needed a mon restart.
       # ceph daemon mon.ceph-03 ops
+      {
           "ops": [
+              {
                   "description": "osd_pgtemp(e193916 {11.43=[236,168,85,86,228,169,60,148]} v193916)",
                   "initiated_at": "2020-08-11 10:50:20.712996",
                   "age": 310.065493,
                   "duration": 310.065506,
                   "type_data": {
                       "events": [
+                          {
                               "time": "2020-08-11 10:50:20.712996",
                               "event": "initiated"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.712996",
                               "event": "header_read"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713012",
                               "event": "throttled"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713015",
                               "event": "all_read"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713110",
                               "event": "dispatched"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713113",
                               "event": "mon:_ms_dispatch"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713113",
                               "event": "mon:dispatch_op"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713113",
                               "event": "psvc:dispatch"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713125",
                               "event": "osdmap:preprocess_query"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713155",
                               "event": "forward_request_leader"
                           },
+                          {
                               "time": "2020-08-11 10:50:20.713184",
                               "event": "forwarded"
+                          }
                       ],
                       "info": {
                           "seq": 56785507,
                           "src_is_mon": false,
                           "source": "osd.85 192.168.32.69:6820/2538607",
                           "forwarded_to_leader": true
+                      }
+                  }
+              }
           ],
           "num_ops": 1
+      }

« Previous
1
2
Next »

(2-2/2)

Project

General

Profile

Ceph » RADOS

Bug #46847 » sess.log