Bug #59513
openScrubbing PGs from device_health_metrics takes suspiciously long
0%
Description
I have in "ceph df":
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 512 257 MiB 36 772 MiB 0 17 TiB
This pool stores extremely little data. I'd expect that scrubbing it is instantaneous.
Yet, in "ceph pg ls" I can see that:
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP ... 1.18 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:409663 [28,16,6]p28 [28,16,6]p28 2023-04-14T21:38:21.593846+0000 2023-04-14T21:38:21.593846+0000 1.19 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:483155 [29,9,16]p29 [29,9,16]p29 2023-04-14T21:42:16.403577+0000 2023-04-14T21:42:16.403577+0000 1.1a 0 0 0 0 0 0 0 0 active+clean+scrubbing+deep 78m 0'0 16802:394661 [26,10,16]p26 [26,10,16]p26 2023-04-07T20:11:51.880176+0000 2023-04-07T20:11:51.880176+0000 1.1b 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:645676 [16,5,26]p16 [16,5,26]p16 2023-04-15T04:01:17.971270+0000 2023-04-15T04:01:17.971270+0000 1.1c 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:376623 [18,31,10]p18 [18,31,10]p18 2023-04-20T18:47:32.289124+0000 2023-04-20T18:47:32.289124+0000 ... 1.1b9 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:462744 [18,5,33]p18 [18,5,33]p18 2023-04-12T13:31:38.817328+0000 2023-04-12T13:31:38.817328+0000 1.1ba 1 0 0 0 0 8067648 180 539 active+clean 2d 16770'539 16802:378419 [14,6,30]p14 [14,6,30]p14 2023-04-19T07:27:41.530105+0000 2023-04-19T07:27:41.530105+0000 1.1bb 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:351857 [32,8,17]p32 [32,8,17]p32 2023-04-13T21:08:25.449731+0000 2023-04-13T21:08:25.449731+0000 1.1bc 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:468128 [31,3,23]p31 [31,3,23]p31 2023-04-15T03:55:38.432160+0000 2023-04-15T03:55:38.432160+0000 1.1bd 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:254273 [23,31,10]p23 [23,31,10]p23 2023-04-14T15:04:21.328568+0000 2023-04-14T15:04:21.328568+0000 1.1be 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:442864 [26,16,5]p26 [26,16,5]p26 2023-04-11T05:57:27.832297+0000 2023-04-11T05:57:27.832297+0000 1.1bf 0 0 0 0 0 0 0 0 active+clean+scrubbing+deep 78m 0'0 16802:5378 [3,17,27]p3 [3,17,27]p3 2023-04-11T00:22:41.764823+0000 2023-04-11T00:22:41.764823+0000 1.1c0 0 0 0 0 0 0 0 0 active+clean 2d 0'0 16802:424055 [28,23,3]p28 [28,23,3]p28 2023-04-19T01:42:34.012205+0000 2023-04-19T01:42:34.012205+0000
It seems suspicious to me that PG 1.1bf has been scrubbing for "78m" even though it contains "0 BYTES".
Updated by Niklas Hambuechen about 1 year ago
I am suspecting this might be https://tracker.ceph.com/issues/54172#note-14:
In short, this happens whenever a deep scrub is started while the noscrub flag is set. The stuck scrub can be cleared by restarting the primary OSD associated with the PG
I had set `noscrub`, then restarted all OSDs, and then removed `noscrub`. That brought me into the situatio of the issue description.
Doing another full restart of all OSDs seems to have fixed the issue, so I think the chances are quite high that that provided workaround worked.