Actions
Bug #58498
openceph: pgs stuck backfilling
Status:
New
Priority:
Normal
Assignee:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
Observed on the LRC cluster after resolving https://tracker.ceph.com/issues/58460
cluster: id: 28f7427e-5558-4ffd-ae1a-51ec3042759a health: HEALTH_WARN Degraded data redundancy: 145548/540755983 objects degraded (0.027%), 2 pgs degraded, 2 pgs undersized 1470 pgs not deep-scrubbed in time 1470 pgs not scrubbed in time services: mon: 5 daemons, quorum reesi003,reesi002,reesi001,ivan02,ivan01 (age 3d) mgr: reesi006.erytot(active, since 3w), standbys: reesi005.xxyjcw, reesi004.tplfrt mds: 4/4 daemons up, 5 standby, 1 hot standby osd: 166 osds: 166 up (since 5h), 166 in (since 13d); 16 remapped pgs rgw: 2 daemons active (2 hosts, 1 zones) tcmu-runner: 4 portals active (4 hosts) data: volumes: 4/4 healthy pools: 24 pools, 2965 pgs objects: 112.97M objects, 127 TiB usage: 224 TiB used, 836 TiB / 1.0 PiB avail pgs: 145548/540755983 objects degraded (0.027%) 665241/540755983 objects misplaced (0.123%) 2942 active+clean 14 active+remapped+backfilling 7 active+clean+scrubbing+deep 2 active+undersized+degraded+remapped+backfilling io: client: 177 KiB/s rd, 4.7 MiB/s wr, 6 op/s rd, 28 op/s wr recovery: 0 B/s, 6 objects/s progress: Global Recovery Event (2w) [===========================.] (remaining: 2h)
For most of the pgs, recovery is not proceeding at all.
sjust@reesi002:~/2023-01-18-lrc-investigation/reesi002$ sudo ceph pg dump | grep backfilling | grep -v 'dumped all' 0.24d 2540 0 0 2540 0 9184352048 0 0 2664 2664 active+remapped+backfilling 2023-01-02T16:08:14.354424+0000 10675336'5920294 10682782:18541558 [44,98,58] 44 [41,98,58] 41 10675336'5920294 2023-01-02T11:11:45.415900+0000 10675336'5920294 2022-12-31T04:42:51.776041+0000 0 1 queued for deep scrub 2540 0 119.25d 65450 0 0 60524 0 118040914986 0 0 7635 7635 active+remapped+backfilling 2023-01-02T16:08:14.161260+0000 10682783'3606255 10682783:10902266 [27,161,89,120,72,125] 27 [41,161,89,120,72,125] 41 10676278'3594937 2022-12-31T01:03:38.315674+0000 10675860'3589907 2022-12-27T05:27:21.909004+0000 0 13 queued for deep scrub 59930 0 0.1a7 2547 0 0 2277 0 9240477039 0 0 2708 2708 active+remapped+backfilling 2023-01-02T16:08:17.214871+0000 10675299'6355360 10682782:17752489 [42,134,131] 42 [41,134,131] 41 10675299'6355360 2023-01-02T05:19:35.110378+0000 10675299'6355360 2023-01-02T05:19:35.110378+0000 0 254 queued for deep scrub 2547 0 0.19d 2698 0 0 2418 0 9774354741 0 0 2658 2658 active+remapped+backfilling 2023-01-02T16:08:14.289081+0000 10675706'6186906 10682782:17607234 [42,152,110] 42 [41,152,110] 41 10675706'6186906 2023-01-01T17:59:01.366870+0000 10675706'6186906 2022-12-31T14:38:58.383181+0000 0 1 queued for deep scrub 2698 0 119.d3 65969 0 0 193815 0 117703625513 0 0 7607 7607 active+remapped+backfilling 2023-01-02T16:08:23.978465+0000 10682783'3611895 10682783:11308049 [44,118,94,111,89,14] 44 [41,49,94,111,89,118] 41 10676765'3601252 2023-01-01T21:58:55.209069+0000 10676765'3601252 2023-01-01T21:58:55.209069+0000 0 1345 queued for deep scrub 60475 0 114.d6 152158 0 136190 0 0 21296461 0 0 3197 3197 active+undersized+degraded+remapped+backfilling 2023-01-18T23:08:22.508697+0000 10682783'8022177 10682783:16372025 [59,118,128] 59 [128,118] 128 10676768'8007280 2023-01-01T22:12:37.218704+0000 10676100'8002894 2022-12-28T20:25:52.849234+0000 0 21 queued for deep scrub 138310 0 119.1b 65967 0 0 65967 0 119233083117 0 0 7638 7638 active+remapped+backfilling 2023-01-02T16:08:18.617519+0000 10682782'3597673 10682782:10560296 [27,161,114,29,134,56] 27 [41,161,114,29,134,56] 41 10676772'3586960 2023-01-01T23:15:29.597688+0000 10676242'3586378 2022-12-30T01:35:48.857076+0000 0 11 queued for deep scrub 60571 0 119.1c 65933 0 0 65626 0 119149682806 0 0 7545 7545 active+remapped+backfilling 2023-01-02T16:08:14.309859+0000 10682782'3598669 10682782:11055329 [24,59,69,16,127,68] 24 [41,59,69,16,127,68] 41 10676872'3587860 2023-01-02T15:23:00.227516+0000 10676872'3587860 2023-01-02T15:23:00.227516+0000 0 1960 queued for deep scrub 60925 0 114.70 152214 0 9019 0 0 50474801 0 0 2552 2552 active+undersized+degraded+remapped+backfilling 2023-01-18T23:08:22.515880+0000 10682783'7970909 10682783:17076347 [49,161,119] 49 [119,161] 119 10676589'7955714 2023-01-01T10:06:25.077020+0000 10676452'7955711 2022-12-31T09:54:07.920177+0000 0 21 queued for deep scrub 137979 0 119.a2 65720 0 0 65720 0 118804757552 0 0 7590 7590 active+remapped+backfilling 2023-01-02T16:08:24.588335+0000 10682782'3613140 10682782:9884637 [44,117,63,18,30,77] 44 [41,117,63,18,30,77] 41 10676728'3602120 2023-01-01T16:44:11.572865+0000 10676728'3602120 2023-01-01T16:44:11.572865+0000 0 1969 queued for deep scrub 60092 0 0.ee 2700 0 0 4900 0 9716980648 0 0 2711 2711 active+remapped+backfilling 2023-01-02T16:08:26.105168+0000 10675306'5989142 10682782:17648413 [90,71,160] 90 [41,71,51] 41 10675306'5989142 2023-01-02T03:57:15.992196+0000 10675306'5989142 2023-01-02T03:57:15.992196+0000 0 203 queued for deep scrub 2700 0 119.13e 65851 0 0 65573 0 118382459569 0 0 7567 7567 active+remapped+backfilling 2023-01-02T16:08:14.310947+0000 10682783'3599482 10682783:11087707 [44,61,123,12,48,28] 44 [41,61,123,12,48,28] 41 10676278'3588377 2022-12-31T04:50:40.590459+0000 10676237'3588314 2022-12-29T21:44:43.925143+0000 0 11 queued for deep scrub 60305 0 0.310 2534 0 0 2534 0 9257925379 0 0 2680 2680 active+remapped+backfilling 2023-01-02T16:08:14.354476+0000 10675347'6226589 10682782:17549076 [24,76,155] 24 [41,76,155] 41 10675347'6226589 2023-01-02T02:36:57.124264+0000 10675347'6226589 2022-12-29T13:17:02.127406+0000 0 1 queued for deep scrub 2534 0 119.31a 65464 0 0 65464 0 117938540621 0 0 7589 7589 active+remapped+backfilling 2023-01-02T16:08:14.354365+0000 10682783'3605046 10682783:11453191 [43,160,5,89,16,72] 43 [41,160,5,89,16,72] 41 10676772'3594695 2023-01-01T23:23:55.938149+0000 10676278'3594272 2022-12-31T21:12:35.209263+0000 0 11 queued for deep scrub 60198 0 dumped all 119.304 65530 0 0 65530 0 119050260451 0 0 7634 7634 active+remapped+backfilling 2023-01-02T16:08:20.108832+0000 10682783'3618838 10682783:10395163 [26,21,68,3,7,84] 26 [41,21,68,3,7,84] 41 10676768'3607766 2023-01-01T22:12:15.129272+0000 10675906'3603297 2022-12-28T06:31:23.333029+0000 0 11 queued for deep scrub 60027 0 0.3c8 2614 0 0 2354 0 9288911888 0 0 2667 2667 active+remapped+backfilling 2023-01-02T16:08:14.400845+0000 10675405'6245090 10682782:18518475 [42,61,90] 42 [41,61,90] 41 10675405'6245090 2023-01-01T03:27:31.597761+0000 10675405'6245090 2022-12-31T01:35:47.818437+0000 0 1 queued for deep scrub
Note, all but 2 have 41 as the primary. Earlier in the day, there were another 16 with 52 as the primary. Marking 52 down resolved those stuck pgs. ceph pg <pgid> query hung for those pgs with 52 as the primary at the time. The ones with 41 as the primary, however, do not. Attached is a file with the pg query results for those pgs.
Files
Actions