Bug #62811
openPGs stuck in backfilling state after their primary OSD is removed by setting its crush weight to 0.
0%
Description
I am pasting the problem description from the original BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2233777
Description of problem: We are observing that the OSDs get stuck in draining process, and all the PGs are not removed the the OSD. it looks like the drain is stuck on PGs where the OSD being removed, is the Primary OSD in that PG. We had these configs set on the cluster during testing. mon advanced mon_osd_down_out_subtree_limit host osd advanced osd_async_recovery_min_cost 1099511627776 Procedure: 1. Deploy a 4 node RHCS cluster, with 2+2 EC pool and fill data. # ceph -s cluster: id: 66070a80-2f84-11ee-bc2c-0cc47af3ea56 health: HEALTH_OK services: mon: 3 daemons, quorum argo012,argo013,argo014 (age 8d) mgr: argo013.akdhka(active, since 12d), standbys: argo012.odttqx, argo014.xfhnzv osd: 35 osds: 35 up (since 67m), 35 in (since 30h) rgw: 4 daemons active (4 hosts, 1 zones) data: pools: 7 pools, 673 pgs objects: 4.28M objects, 3.0 TiB usage: 6.6 TiB used, 7.6 TiB / 14 TiB avail pgs: 665 active+clean 7 active+clean+scrubbing+deep 1 active+clean+scrubbing io: client: 7.2 KiB/s rd, 0 B/s wr, 7 op/s rd, 4 op/s wr [ceph: root@argo012 /]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 14 TiB 7.6 TiB 6.6 TiB 6.6 TiB 46.42 TOTAL 14 TiB 7.6 TiB 6.6 TiB 6.6 TiB 46.42 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 23 MiB 5 68 MiB 0 1.8 TiB .rgw.root 2 32 1.3 KiB 4 48 KiB 0 1.8 TiB default.rgw.log 3 32 3.6 KiB 209 408 KiB 0 1.8 TiB default.rgw.control 4 32 0 B 8 0 B 0 1.8 TiB default.rgw.meta 5 32 8.9 KiB 18 207 KiB 0 1.8 TiB ec22-pool 6 512 3.0 TiB 4.28M 6.1 TiB 52.35 2.8 TiB default.rgw.buckets.index 7 32 1.1 GiB 90 3.2 GiB 0.06 1.8 TiB # ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 14.21661 - 14 TiB 6.6 TiB 6.1 TiB 3.2 GiB 61 GiB 7.6 TiB 46.42 1.00 - root default -7 3.65570 - 3.7 TiB 1.7 TiB 1.5 TiB 642 MiB 15 GiB 2.0 TiB 45.23 0.97 - host argo012 3 hdd 0.40619 1.00000 416 GiB 183 GiB 167 GiB 111 MiB 1.6 GiB 233 GiB 43.89 0.95 74 up osd.3 6 hdd 0.40619 1.00000 416 GiB 194 GiB 178 GiB 57 MiB 1.4 GiB 222 GiB 46.68 1.01 64 up osd.6 10 hdd 0.40619 1.00000 416 GiB 176 GiB 160 GiB 23 MiB 1.4 GiB 240 GiB 42.32 0.91 61 up osd.10 15 hdd 0.40619 1.00000 416 GiB 191 GiB 175 GiB 26 MiB 2.0 GiB 225 GiB 45.95 0.99 71 up osd.15 19 hdd 0.40619 1.00000 416 GiB 189 GiB 173 GiB 180 MiB 1.5 GiB 227 GiB 45.32 0.98 69 up osd.19 23 hdd 0.40619 1.00000 416 GiB 177 GiB 161 GiB 33 KiB 1.5 GiB 239 GiB 42.43 0.91 67 up osd.23 27 hdd 0.40619 1.00000 416 GiB 185 GiB 169 GiB 2.9 MiB 1.4 GiB 231 GiB 44.58 0.96 69 up osd.27 31 hdd 0.40619 1.00000 416 GiB 204 GiB 188 GiB 233 MiB 1.6 GiB 212 GiB 49.06 1.06 81 up osd.31 35 hdd 0.40619 1.00000 416 GiB 195 GiB 179 GiB 9.8 MiB 2.2 GiB 221 GiB 46.87 1.01 75 up osd.35 -3 3.65570 - 3.7 TiB 1.7 TiB 1.5 TiB 979 MiB 15 GiB 2.0 TiB 45.23 0.97 - host argo013 0 hdd 0.40619 1.00000 416 GiB 179 GiB 163 GiB 312 MiB 1.8 GiB 237 GiB 43.05 0.93 67 up osd.0 4 hdd 0.40619 1.00000 416 GiB 191 GiB 175 GiB 149 MiB 1.8 GiB 225 GiB 46.01 0.99 77 up osd.4 8 hdd 0.40619 1.00000 416 GiB 186 GiB 170 GiB 134 MiB 1.3 GiB 230 GiB 44.63 0.96 72 up osd.8 12 hdd 0.40619 1.00000 416 GiB 192 GiB 176 GiB 1 KiB 1.5 GiB 224 GiB 46.04 0.99 67 up osd.12 16 hdd 0.40619 1.00000 416 GiB 188 GiB 172 GiB 10 MiB 1.9 GiB 228 GiB 45.27 0.98 71 up osd.16 20 hdd 0.40619 1.00000 416 GiB 186 GiB 170 GiB 46 MiB 1.7 GiB 230 GiB 44.70 0.96 71 up osd.20 24 hdd 0.40619 1.00000 416 GiB 195 GiB 179 GiB 77 MiB 1.5 GiB 221 GiB 46.78 1.01 72 up osd.24 28 hdd 0.40619 1.00000 416 GiB 179 GiB 163 GiB 90 MiB 1.1 GiB 237 GiB 43.06 0.93 58 up osd.28 32 hdd 0.40619 1.00000 416 GiB 198 GiB 182 GiB 161 MiB 1.9 GiB 218 GiB 47.56 1.02 76 up osd.32 -5 3.65570 - 3.7 TiB 1.7 TiB 1.5 TiB 683 MiB 16 GiB 2.0 TiB 45.23 0.97 - host argo014 2 hdd 0.40619 1.00000 416 GiB 183 GiB 167 GiB 136 MiB 1.8 GiB 233 GiB 44.02 0.95 72 up osd.2 7 hdd 0.40619 1.00000 416 GiB 185 GiB 169 GiB 9.7 MiB 1.6 GiB 231 GiB 44.54 0.96 69 up osd.7 11 hdd 0.40619 1.00000 416 GiB 206 GiB 190 GiB 1 KiB 2.1 GiB 210 GiB 49.59 1.07 78 up osd.11 14 hdd 0.40619 1.00000 416 GiB 188 GiB 172 GiB 5.2 MiB 1.8 GiB 228 GiB 45.25 0.97 61 up osd.14 18 hdd 0.40619 1.00000 416 GiB 188 GiB 172 GiB 176 MiB 1.4 GiB 228 GiB 45.27 0.98 77 up osd.18 22 hdd 0.40619 1.00000 416 GiB 182 GiB 166 GiB 152 MiB 1.8 GiB 234 GiB 43.79 0.94 70 up osd.22 26 hdd 0.40619 1.00000 416 GiB 194 GiB 178 GiB 25 MiB 1.9 GiB 222 GiB 46.66 1.01 66 up osd.26 30 hdd 0.40619 1.00000 416 GiB 192 GiB 176 GiB 107 MiB 1.8 GiB 224 GiB 46.20 1.00 73 up osd.30 34 hdd 0.40619 1.00000 416 GiB 174 GiB 158 GiB 72 MiB 1.5 GiB 242 GiB 41.77 0.90 68 up osd.34 -9 3.24951 - 3.2 TiB 1.6 TiB 1.5 TiB 934 MiB 16 GiB 1.6 TiB 50.41 1.09 - host argo017 5 hdd 0.40619 1.00000 416 GiB 234 GiB 218 GiB 162 MiB 2.2 GiB 182 GiB 56.26 1.21 86 up osd.5 9 hdd 0.40619 1.00000 416 GiB 207 GiB 191 GiB 124 MiB 2.3 GiB 209 GiB 49.75 1.07 83 up osd.9 13 hdd 0.40619 1.00000 416 GiB 194 GiB 178 GiB 45 MiB 1.9 GiB 222 GiB 46.61 1.00 72 up osd.13 17 hdd 0.40619 1.00000 416 GiB 195 GiB 179 GiB 48 MiB 1.6 GiB 221 GiB 46.86 1.01 73 up osd.17 21 hdd 0.40619 1.00000 416 GiB 219 GiB 203 GiB 112 MiB 2.1 GiB 197 GiB 52.60 1.13 84 up osd.21 25 hdd 0.40619 1.00000 416 GiB 195 GiB 179 GiB 36 MiB 1.7 GiB 221 GiB 46.78 1.01 72 up osd.25 29 hdd 0.40619 1.00000 416 GiB 203 GiB 187 GiB 221 MiB 2.2 GiB 213 GiB 48.92 1.05 81 up osd.29 33 hdd 0.40619 1.00000 416 GiB 231 GiB 215 GiB 185 MiB 2.0 GiB 185 GiB 55.48 1.20 84 up osd.33 TOTAL 14 TiB 6.6 TiB 6.1 TiB 3.2 GiB 61 GiB 7.6 TiB 46.42 MIN/MAX VAR: 0.90/1.21 STDDEV: 3.24 2. In a healthy cluster, where all PGs are in active+ clean state, try removing a OSD from 1 host. In the below example, OSD 0 from argo013 was being removed. # date Tue Aug 22 12:10:27 UTC 2023 # ceph orch osd rm 0 --force --zap Scheduled OSD(s) for removal. [ceph: root@argo012 /]# date Tue Aug 22 12:11:28 UTC 2023 [ceph: root@argo012 /]# ceph orch osd rm status OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT 0 argo013 draining 58 False True True 2023-08-22 12:11:27.848424 3. Initially, the PGs start draining, but observing that the drain stops after it is done draining around 75% PGs. When I checked the PG dump at this point, it looked like the drain is stuck on PGs where the OSD being removed, is the Primary OSD in that PG. # ceph orch osd rm status OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT 0 argo013 draining 22 False True True 2023-08-22 12:11:27.848424 [root@argo012 ~]# date Wed Aug 23 04:07:22 AM UTC 2023 # ceph orch osd rm status OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT 0 argo013 draining 22 False True True 2023-08-22 12:11:27.848424 [root@argo012 ~]# date Wed Aug 23 05:38:27 AM UTC 2023 [root@argo012 ~]# ceph orch osd rm status OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT 0 argo013 draining 22 False True True 2023-08-22 12:11:27.848424 [root@argo012 ~]# date Wed Aug 23 11:01:42 AM UTC 2023 At this point, we do not see any recovery activity in ceph -s o/p also. # ceph -s cluster: id: 66070a80-2f84-11ee-bc2c-0cc47af3ea56 health: HEALTH_OK services: mon: 3 daemons, quorum argo012,argo013,argo014 (age 9d) mgr: argo013.akdhka(active, since 13d), standbys: argo012.odttqx, argo014.xfhnzv osd: 35 osds: 35 up (since 21h), 35 in (since 2d); 22 remapped pgs rgw: 4 daemons active (4 hosts, 1 zones) data: pools: 7 pools, 673 pgs objects: 4.42M objects, 3.0 TiB usage: 6.6 TiB used, 7.6 TiB / 14 TiB avail pgs: 171950/17688441 objects misplaced (0.972%) 651 active+clean 16 active+remapped+backfilling 6 active+remapped+backfill_wait io: client: 10 KiB/s rd, 143 KiB/s wr, 14 op/s rd, 27 op/s wr [root@argo012 ~]# ceph pg dump | grep backfill 6.1bd 8679 0 0 8679 0 6519914496 0 0 7605 7605 active+remapped+backfilling 2023-08-22T12:11:40.514461+0000 19194'35959 19194:222764 [4,3,9,18] 4 [0,3,9,18] 0 18421'35707 2023-08-20T18:45:07.674753+0000 18421'35707 2023-08-18T14:56:15.799803+0000 0 36 queued for scrub 8427 0 6.1b8 8517 0 0 17034 0 6200819712 0 0 7393 7393 active+remapped+backfilling 2023-08-22T12:11:41.238886+0000 19194'35586 19194:218781 [20,23,26,13] 20 [0,23,26,25] 0 18422'35313 2023-08-20T17:33:29.237283+0000 18422'35313 2023-08-20T17:33:29.237283+0000 0 175 queued for scrub 8244 0 6.1af 8479 0 0 8479 0 6370885632 0 0 7261 7261 active+remapped+backfilling 2023-08-22T12:11:40.475878+0000 19194'35370 19194:192817 [12,18,3,9] 12 [0,18,3,9] 0 18422'35131 2023-08-22T12:09:30.568001+0000 18422'35131 2023-08-22T12:09:30.568001+0000 0 560 periodic scrub scheduled @ 2023-08-23T15:52:04.077250+0000 8240 0 6.1aa 8662 0 0 8662 0 6562643968 0 0 7425 7425 active+remapped+backfilling 2023-08-22T12:11:41.434841+0000 19194'35518 19194:193262 [12,11,6,13] 12 [0,11,6,13] 0 18422'35250 2023-08-20T16:57:40.112433+0000 18337'33987 2023-08-14T17:18:34.663615+0000 0 36 queued for deep scrub 8394 0 6.19d 8627 0 0 8627 0 6525550592 0 0 7314 7314 active+remapped+backfilling 2023-08-22T12:11:39.263735+0000 19194'35434 19194:197745 [16,25,26,6] 16 [0,25,26,6] 0 18422'35151 2023-08-20T09:25:48.183947+0000 18396'34264 2023-08-16T11:51:33.787266+0000 0 35 queued for scrub 8344 0 6.180 8826 0 0 8826 0 6708264960 0 0 6784 6784 active+remapped+backfilling 2023-08-22T12:11:40.553126+0000 19194'34345 19194:182555 [32,30,17,31] 32 [0,30,17,31] 0 18421'34084 2023-08-20T01:22:52.567792+0000 18421'34084 2023-08-20T01:22:52.567792+0000 0 176 queued for scrub 8565 0 6.59 8661 0 0 8661 0 6445924352 0 0 7796 7796 active+remapped+backfilling 2023-08-22T14:53:10.041447+0000 19194'36355 19194:117635 [28,18,13,15] 28 [0,18,13,15] 0 18421'36082 2023-08-22T11:26:01.536220+0000 18421'36082 2023-08-22T11:26:01.536220+0000 0 382 periodic scrub scheduled @ 2023-08-23T22:36:35.973627+0000 8388 0 6.4d 8510 0 0 8510 0 6429343744 0 0 7680 7680 active+remapped+backfill_wait 2023-08-22T14:48:22.748410+0000 19194'35958 19194:159698 [12,25,18,19] 12 [0,25,18,19] 0 18422'35685 2023-08-22T11:38:43.676947+0000 18411'35524 2023-08-17T05:34:58.535235+0000 0 35 periodic scrub scheduled @ 2023-08-23T11:46:45.368400+0000 8237 0 6.4a 8603 0 0 17206 0 6389039104 0 0 7624 7624 active+remapped+backfill_wait 2023-08-22T14:52:58.531620+0000 19194'35895 19194:145778 [12,22,25,27] 12 [0,22,33,27] 0 18422'35640 2023-08-22T11:29:58.371470+0000 18422'35640 2023-08-19T23:11:14.163755+0000 0 35 periodic scrub scheduled @ 2023-08-23T21:00:28.665562+0000 8348 0 7.1b 3 0 0 3 0 0 41666444 189764 10048 10048 active+remapped+backfilling 2023-08-22T12:11:41.250313+0000 19194'812502 19194:916015 [28,3,13] 28 [0,3,13] 0 18951'800381 2023-08-22T11:26:05.828515+0000 18951'800381 2023-08-22T11:26:05.828515+0000 0 4 periodic scrub scheduled @ 2023-08-23T23:25:02.991379+0000 3 0 3.6 6 0 0 12 0 0 0 0 9495 9495 active+remapped+backfill_wait 2023-08-22T14:53:10.040326+0000 19194'153995 19194:252905 [26,13,27] 26 [0,13,2] 0 18951'148821 2023-08-22T11:37:46.197728+0000 18510'134991 2023-08-20T11:05:55.008644+0000 0 1 periodic scrub scheduled @ 2023-08-23T19:10:36.035866+0000 6 0 7.4 4 0 0 4 0 0 31724167 138831 10068 10068 active+remapped+backfilling 2023-08-22T12:11:43.275135+0000 19194'416071 19194:486357 [8,9,2] 8 [0,9,2] 0 18422'403982 2023-08-20T14:21:33.629927+0000 18396'351102 2023-08-16T12:45:54.473320+0000 0 1 queued for scrub 4 0 7.6 7 0 0 7 0 0 67901442 315891 4422 4422 active+remapped+backfilling 2023-08-22T12:11:41.434578+0000 18508'472722 19193:565304 [12,22,5] 12 [0,22,5] 0 18508'472722 2023-08-22T11:30:09.680199+0000 18508'472722 2023-08-20T09:18:04.840742+0000 0 1 periodic scrub scheduled @ 2023-08-23T20:39:20.602754+0000 7 0 6.16 8668 0 0 8668 0 6560940032 0 0 6920 6920 active+remapped+backfill_wait 2023-08-22T14:48:33.793550+0000 19194'34730 19194:69433 [28,7,23,29] 28 [0,7,23,29] 0 18422'34477 2023-08-21T02:38:11.482845+0000 18333'33321 2023-08-14T10:17:48.909163+0000 0 36 queued for deep scrub 8415 0 6.b3 8588 0 0 8588 0 6560677888 0 0 7315 7315 active+remapped+backfill_wait 2023-08-22T14:48:41.968101+0000 19194'35707 19194:159836 [4,3,17,34] 4 [0,3,17,34] 0 18422'35433 2023-08-21T00:43:44.822688+0000 18422'35433 2023-08-21T00:43:44.822688+0000 0 151 queued for scrub 8314 0 6.c2 8542 0 0 8542 0 6335627264 0 0 7566 7566 active+remapped+backfill_wait 2023-08-22T14:52:39.323031+0000 19194'35591 19194:187451 [16,3,2,13] 16 [0,3,2,13] 0 18422'35321 2023-08-21T00:15:30.088884+0000 18406'34902 2023-08-16T22:56:36.951863+0000 0 35 queued for scrub 8272 0 6.c8 8385 0 0 8385 0 6342639616 0 0 7740 7740 active+remapped+backfilling 2023-08-22T13:53:02.847391+0000 19194'36287 19194:155784 [8,9,3,22] 8 [0,9,3,22] 0 18422'36007 2023-08-20T18:07:00.914425+0000 18337'34831 2023-08-14T15:39:09.546106+0000 0 35 queued for deep scrub 8105 0 6.ea 8700 0 0 8700 0 6701318144 0 0 8400 8400 active+remapped+backfilling 2023-08-22T12:37:57.250097+0000 19194'36507 19194:215173 [12,2,29,27] 12 [0,2,29,27] 0 18422'36214 2023-08-20T08:30:19.565009+0000 18422'36214 2023-08-19T02:49:58.782983+0000 0 36 queued for scrub 8407 0 6.f5 8692 0 0 8692 0 6698369024 0 0 8315 8315 active+remapped+backfilling 2023-08-22T12:11:40.501888+0000 19194'36206 19194:305781 [12,30,31,17] 12 [0,30,31,17] 0 18422'35937 2023-08-20T17:30:28.074025+0000 18422'35937 2023-08-20T17:30:28.074025+0000 0 170 queued for scrub 8423 0 6.f7 8521 0 0 8521 0 6457851904 0 0 8230 8230 active+remapped+backfilling 2023-08-22T12:11:40.120079+0000 19194'36752 19194:206013 [4,23,2,5] 4 [0,23,2,5] 0 18422'36454 2023-08-19T23:33:43.864135+0000 17969'35091 2023-08-13T21:55:02.307873+0000 0 35 queued for deep scrub 8223 0 dumped all 6.173 8556 0 0 8556 0 6451757056 0 0 8113 8113 active+remapped+backfilling 2023-08-22T12:11:40.476004+0000 19194'36649 19194:163973 [24,31,5,22] 24 [0,31,5,22] 0 18422'36378 2023-08-20T06:28:28.826979+0000 18422'36378 2023-08-19T00:02:15.118486+0000 0 35 queued for scrub 8285 0 6.177 8588 0 0 8588 0 6445072384 0 0 8132 8132 active+remapped+backfilling 2023-08-22T12:11:40.491822+0000 19194'36654 19194:177980 [8,22,29,3] 8 [0,22,29,3] 0 18422'36388 2023-08-20T05:53:14.247617+0000 18422'36388 2023-08-20T05:53:14.247617+0000 0 155 queued for scrub 8322 0 4. The PG would remain in the same state, and there would be no improvement in the drain and OSD removal. The OSD service is running and the target host during the entire procedure. Version-Release number of selected component (if applicable): ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable) How reproducible: 2/2 OSDs on 1 Cluster Steps to Reproduce: 1. Deploy 4 node RHCS cluster, deploy EC 2+2 pool, and fill with data. 2. Remove a running OSD from any host, cmd : ceph orch osd rm <id> --force 3. Observe that drain does not complete and is stuck for a few PGs on the OSD. Actual results: PGs stuck draining and removal does not go through Expected results: PGs drained and OSD removed. Additional info: Observed that, If i stop the service, the drain completes within a few hours, and the OSD is removed from the cluster.
Files
Updated by Sridhar Seshasayee 8 months ago
- File stuck_pgs_pg_query_6.15a_acting_primary_osd34.txt stuck_pgs_pg_query_6.15a_acting_primary_osd34.txt added
- File bz2233777_osd.34_data.tgz bz2233777_osd.34_data.tgz added
- Priority changed from Normal to High
Analysis of the issue is performed by taking a single PG (6.15a) on osd.34 and on which backfill didn't start.
I have attached a couple of log files that show:
1. PGs stuck in backfilling whose acting primary is osd.34.
2. Ceph command dumps from osd.34. (Copied from the BZ)
ANALYSIS
1. On pg 6.15as0, the first recovery item was enqueued at:
2023-09-01T01:55:35.036+0000 7fa3d9a44640 20 osd.34 op_wq(1) _enqueue OpSchedulerItem(6.15as0 PGRecovery(pgid=6.15as0 epoch_queued=21584 reserved_pushes=1) class_id 1 prio 5 cost 573165 e21584 reserved_pushes 1)
At this point, the op_wq(1) state is shown below:
2023-09-01T01:55:35.037+0000 7fa3d9a44640 30 dequeue status: ... "OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":28,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue:: { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } } ...
The immediate queue size was 28. These are mainly peering events and SubOpReads.
Note that the scheduler queue size is 1 indicating that the PGRecovery item was enqueued.
2. Later, apart from peering events, a SubOpRead was enqueued on another PG (6.3ds2) and on op_wq(1):
2023-09-01T01:55:35.041+0000 7fa3f52ae640 15 osd.34 21584 enqueue_op 0x55c794546840 prio 5 type 110 cost 0 latency 0.000026 epoch 21584 MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3 2023-09-01T01:55:35.041+0000 7fa3f52ae640 20 osd.34 op_wq(1) _enqueue OpSchedulerItem(6.3ds2 PGOpItem(op=MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3) class_id 2 prio 5 cost 0 e21584)
3. Before the above SubOpRead was dequeued, a bunch of miscellaneous peering events were enqueued and dequeued.
The above SubOpRead item was dequeued eventually after around 50 ms:
2023-09-01T01:55:35.095+0000 7fa3d9a44640 20 osd.34 op_wq(1) _process 6.3ds2 to_process <OpSchedulerItem(6.3ds2 PGOpItem(op=MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3) class_id 2 prio 5 cost 0 e21584)> waiting <> waiting_peering {} 2023-09-01T01:55:35.095+0000 7fa3d9a44640 20 osd.34 op_wq(1) _process OpSchedulerItem(6.3ds2 PGOpItem(op=MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3) class_id 2 prio 5 cost 0 e21584) pg 0x55c7a03dc000 ... 2023-09-01T01:55:35.096+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c794546840 prio 5 cost 0 latency 0.055139 MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.3ds2( v 21580'39113 (18397'34762,21580'39113] local-lis/les=21582/21583 n=11743 ec=916/247 lis/c=21582/20930 les/c/f=21583/20931/0 sis=21582) [28,6,7,25]/[28,6,34,25]p28(0) r=2 lpr=21582 pi=[20930,21582)/1 luod=0'0 crt=21580'39113 mlcod 0'0 active+remapped mbc={}]
At this stage, the op_wq(1) looked like the following:
2023-09-01T01:55:35.096+0000 7fa3d9243640 30 dequeue status: ... "OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":3,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue:: { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } } ...
4. Note that the mClock scheduler size is still 1 with the PGRecovery item yet to be picked up.
(Time in mClock Queue: ~59 ms)
The above SubOpRead completed at:
2023-09-01T01:55:35.211+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c794546840 finish
Therefore, the SubOpRead op took approximately 170 ms to complete from the time it was enqueued.
On op_wq(1), nothing else was processed during the time. A few more items got added to the immediate queue during this time.
This is indicated by the queue state at this point:
2023-09-01T01:55:35.211+0000 7fa3d9a44640 30 dequeue status: ... "OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":6,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue:: { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } } ...
The above shows that the PGRecovery item is still in the scheduler queue.
(Time spent in the scheduler queue: 175 ms)
5. Another SubOpRead is dequeued:
2023-09-01T01:55:35.215+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7f0413600 prio 5 cost 0 latency 0.101390 MOSDECSubOpRead(6.182s2 21584/21582 ECSubRead(tid=124582, to_read={6:4180003a:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-273272:head=0,4194304,0}, subchunks={6:4180003a:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-273272:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.182s2( v 21580'37910 (18416'34408,21580'37910] local-lis/les=21582/21583 n=11783 ec=16227/247 lis/c=21582/21560 les/c/f=21583/21561/0 sis=21582) [23,20,2,5]/[23,20,34,5]p23(0) r=2 lpr=21582 pi=[21560,21582)/1 luod=0'0 crt=21580'37910 mlcod 0'0 active+remapped mbc={}] | | Time to complete -> 70 ms | 2023-09-01T01:55:35.285+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7f0413600 finish
Another example:
2023-09-01T01:55:35.317+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7d21af080 prio 5 cost 0 latency 0.160248 MOSDECSubOpRead(6.191s3 21584/21582 ECSubRead(tid=77137, to_read={6:89800139:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-816949:head=0,4194304,0}, subchunks={6:89800139:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-816949:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.191s3( v 21580'38412 (18409'34614,21580'38412] local-lis/les=21582/21583 n=11915 ec=16301/247 lis/c=21582/20929 les/c/f=21583/20930/0 sis=21582) [31,16,13,18]/[31,16,13,34]p31(0) r=3 lpr=21582 pi=[20929,21582)/1 luod=0'0 crt=21580'38412 mlcod 0'0 active+remapped mbc={}] | | Time to complete -> 210 ms | 2023-09-01T01:55:35.527+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7d21af080 finish
6. During the above times, additional peering and SubOpRead items get enqueued on op_wp(1) and the state is:
2023-09-01T01:55:35.722+0000 7fa3da245640 30 dequeue status: ... "OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":8,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue:: { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } } ...
7. The above shows that SubOpReads are taking a significant amount of time to complete (200 ms - 500 ms)
At this point another SubOpRead gets dequeued:
2023-09-01T01:55:35.692+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c8277f7340 prio 5 cost 0 latency 0.455817 MOSDECSubOpRead(6.1cds2 21584/21582 ECSubRead(tid=241069, to_read={6:b38002c2:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-1669911:head=0,4194304,0}, subchunks={6:b38002c2:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-1669911:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.1cds2( v 21580'39273 (18396'34976,21580'39273] local-lis/les=21582/21583 n=11710 ec=16613/247 lis/c=21582/21558 les/c/f=21583/21559/0 sis=21582) [27,5,22,4]/[27,5,34,4]p27(0) r=2 lpr=21582 pi=[21558,21582)/1 luod=0'0 crt=21580'39273 mlcod 0'0 active+remapped mbc={}]
8. At the same time, the osd_map is advanced as a result of which a bunch of peering events (NullEvt) are generated:
2023-09-01T01:55:35.827+0000 7fa3e725f640 20 osd.34 21584 OSD::ms_dispatch: osd_map(21584..21585 src has 21043..21585) v4 2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 do_waiters -- start 2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 do_waiters -- finish 2023-09-01T01:55:35.827+0000 7fa3e725f640 20 osd.34 21584 _dispatch 0x55c7e2c288c0 osd_map(21584..21585 src has 21043..21585) v4 2023-09-01T01:55:35.827+0000 7fa3e725f640 3 osd.34 21584 handle_osd_map epochs [21584,21585], i have 21584, src has [21043,21585] 2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 handle_osd_map got inc map for epoch 21585 2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 add_map_bl 21584 27223 bytes 2023-09-01T01:55:35.827+0000 7fa3e725f640 20 osd.34 21584 got_full_map 21585, nothing requested 2023-09-01T01:55:35.828+0000 7fa3e725f640 30 osd.34 21584 get_map 21584 -cached 2023-09-01T01:55:35.828+0000 7fa3e725f640 20 osd.34 21584 handle_osd_map pg_num_history pg_num_history(e21585 pg_nums {1={55=1},2={109=32},3={109=32},4={111=32},5={111=32},6={16915=512},7={416=32}} deleted_pools ) ... 2023-09-01T01:55:35.829+0000 7fa3e3257640 10 osd.34 21584 advance to epoch 21585 (<= last 21585 <= newest_map 21585) 2023-09-01T01:55:35.829+0000 7fa3e3257640 30 osd.34 21584 get_map 21585 -cached 2023-09-01T01:55:35.829+0000 7fa3e3257640 7 osd.34 21585 consume_map version 21585 ... 2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.18fs3 2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.199s2 2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.1ads0 2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.2cs0 2023-09-01T01:55:35.829+0000 7fa3e3257640 10 osd.34 21585 release_reserved_pushes(0), recovery_ops_reserved 3 -> 3 --> recovery_op_reserved is at the max level 2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 _recover_now active 0 + reserved 3 >= max 3 2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 enqueue_peering_evt 6.1c2s1 epoch_sent: 21585 epoch_requested: 21585 NullEvt 2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34 op_wq(0) _enqueue OpSchedulerItem(6.1c2s1 PGPeeringEvent(epoch_sent: 21585 epoch_requested: 21585 NullEvt) class_id 2 prio 255 cost 10 e21585) 2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 enqueue_peering_evt 6.1eas2 epoch_sent: 21585 epoch_requested: 21585 NullEvt 2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34 op_wq(0) _enqueue OpSchedulerItem(6.1eas2 PGPeeringEvent(epoch_sent: 21585 epoch_requested: 21585 NullEvt) class_id 2 prio 255 cost 10 e21585) 2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 enqueue_peering_evt 6.1d6s1 epoch_sent: 21585 epoch_requested: 21585 NullEvt
Due to the above, the immediate queue size increased significantly (19) with recovery_ops_reserved still set to 3:
2023-09-01T01:55:35.966+0000 7fa3d8a42640 30 dequeue status: ... "OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":19,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue:: { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } } ...
9. The SubOpRead that was dequeued before the Advmap completed around this time (Time to finish: ~340 ms):
2023-09-01T01:55:36.035+0000 7fa3d9a44640 10 osd.34 21585 dequeue_op 0x55c8277f7340 finish
Note that the PGRecovery item is still in the scheduler queue. The above cycle repeats and the
recovery item is never dequeued resulting in backfill operations getting stuck indefinitely.
Even towards the end of the logs on osd.34, the PGRecovery item can still be seen in the scheduler queue
as shown below:
2023-09-01T01:56:11.125+0000 7fa3d8a42640 30 dequeue status: ... "OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":10,"scheduler":2},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue:: { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:2 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } } ...
This is ~36 secs since the PGRecovery item was enqueued!
The above would result in timeouts on the client and slow ops reported in the cluster logs since IOs
cannot be serviced due to the primary not getting backfilled.
ISSUE SUMMARY:
This is a case where backfill is stuck when,- The target being backfilled is for the primary OSD -> new up osd for the acting primary.
- There are slow SubOpReads issued by other PGs during which the osd shard waits.
- PGRecovery item is continuously preempted by peering events as part of AdvMap in addition to SubOpReads issued by other PGs. This is a normal sequence but resulted in the starvation of the items in the mClock queue.
POSSIBLE SOLUTION:
In this situation where the primary itself is being removed and has a backfill target, the PGRecovery item should probably be accorded high priority so that mClockScheduler can schedule this as early as possible.
- Pull in the changes made for tracker https://tracker.ceph.com/issues/62293. This is because dump op queue in some cases shows more than the expected number of mclock clients (3 Nos). The fix ensures that a service type (client, bg recovery or best effort) with different client_ids are bucketed into the relevant scheduler class by ignoring client_ids. This guarantees that a scheduler class doesn't use more than the allocated osd IOPS capacity limit.
- Backport the high priority queue implementation already available in upstream reef.
- Come up with a solution to consider the case where the primary itself has a backfill target and assign a high priority for the PGRecovery item. This will ensure that the recovery item gets picked up as soon as possible thus allowing backfill to progress according to the mClock profile settings.
The above requires further discussion.