Bug #54172
ceph version 16.2.7 PG scrubs not progressing
0%
Description
A week ago I've upgraded a 16.2.4 cluster (3 nodes, 33 osds) to 16.2.7 using cephadm and since then we're experiencing hanging PG scrubs. Because other scrubs are delayed we now have the cluster in WARN with PG_NOT_DEEP_SCRUBBED. The date/time of information gathering is 7 Feb 02:32am UTC.
The 4 PGs that are stuck scrubbing were started at 2022-02-06T05:20 UTC.
Things I've tried so far:
- Restarting the affected OSDs (and tried restarting all OSDs).
- Set/unset the nodeep-scrub and noscrub flags.
This is the information I have, if there's anything missing feel free to let me know:
$ sudo ceph pg dump pgs_brief | grep deep dumped pgs_brief 4.6b active+clean+scrubbing+deep [12,24,23] 12 [12,24,23] 12 4.4e active+clean+scrubbing+deep [21,14,22] 21 [21,14,22] 21 4.3d active+clean+scrubbing+deep [4,25,26] 4 [4,25,26] 4 4.12b active+clean+scrubbing+deep [29,19,13] 29 [29,19,13] 29 $ sudo ceph versions { "mon": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 3 }, "osd": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 33 }, "mds": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 3 }, "overall": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 42 } }
Scrubber information from the PGs:
$ sudo ceph pg 4.12b query | jq .scrubber { "epoch_start": "11332", "active": true, "start": "4:d4800000::::0", "end": "4:d4800000::::0", "m_max_end": "MIN", "subset_last_update": "0'0", "deep": true, "must_scrub": false, "must_deep_scrub": false, "must_repair": false, "need_auto": false, "req_scrub": false, "time_for_deep": true, "auto_repair": false, "check_repair": false, "deep_scrub_on_error": false, "scrub_reg_stamp": "2022-01-30T07:37:41.422075+0000", "priority": 5, "shallow_errors": 0, "deep_errors": 0, "fixed": 0, "waiting_on_whom": [] }
$ sudo ceph pg 4.4e query | jq .scrubber { "epoch_start": "11321", "active": true, "start": "4:72000000::::0", "end": "4:72000000::::0", "m_max_end": "MIN", "subset_last_update": "0'0", "deep": true, "must_scrub": false, "must_deep_scrub": false, "must_repair": false, "need_auto": false, "req_scrub": false, "time_for_deep": true, "auto_repair": false, "check_repair": false, "deep_scrub_on_error": false, "scrub_reg_stamp": "2022-01-29T12:47:32.977259+0000", "priority": 5, "shallow_errors": 0, "deep_errors": 0, "fixed": 0, "waiting_on_whom": [] }
$ sudo ceph pg 4.3d query | jq .scrubber { "epoch_start": "11330", "active": true, "start": "4:bc000000::::0", "end": "4:bc000000::::0", "m_max_end": "MIN", "subset_last_update": "0'0", "deep": true, "must_scrub": false, "must_deep_scrub": false, "must_repair": false, "need_auto": false, "req_scrub": false, "time_for_deep": true, "auto_repair": false, "check_repair": false, "deep_scrub_on_error": false, "scrub_reg_stamp": "2022-01-30T07:34:25.060565+0000", "priority": 5, "shallow_errors": 0, "deep_errors": 0, "fixed": 0, "waiting_on_whom": [] }
$ sudo ceph pg 4.12b query | jq .scrubber { "epoch_start": "11332", "active": true, "start": "4:d4800000::::0", "end": "4:d4800000::::0", "m_max_end": "MIN", "subset_last_update": "0'0", "deep": true, "must_scrub": false, "must_deep_scrub": false, "must_repair": false, "need_auto": false, "req_scrub": false, "time_for_deep": true, "auto_repair": false, "check_repair": false, "deep_scrub_on_error": false, "scrub_reg_stamp": "2022-01-30T07:37:41.422075+0000", "priority": 5, "shallow_errors": 0, "deep_errors": 0, "fixed": 0, "waiting_on_whom": [] }
An active+clean PG looks like this:
$ sudo ceph pg 4.bd query | jq .scrubber { "epoch_start": "0", "active": false }
I tried searching the bug tracker for a simialr problem but I did not find any. If this is already a known problem feel free to close this ticket.
Related issues
History
#1 Updated by Daan van Gorkum about 2 years ago
Additional information:
I tried disabling all client I/O and after that there's zero I/O on the devices hosting the OSD so nothing seems to happen on the device.
#2 Updated by Neha Ojha about 2 years ago
- Project changed from Ceph to RADOS
#3 Updated by Neha Ojha about 2 years ago
- Assignee set to Ronen Friedman
What is osd_scrub_sleep set to?
Ronen, this sounds similar to one of the issues you were looking into, here it is deep-scrubs though. Can we use your script to rule out a slow peer in this case as well?
#4 Updated by Daan van Gorkum about 2 years ago
This is set to 0.0. I think we did this to speed up recovery after we did some CRUSH tuning.
$ sudo ceph config get osd osd_scrub_sleep 0.000000
#5 Updated by Daan van Gorkum about 2 years ago
Correction. osd_recovery_sleep_hdd was set to 0.0 from the original 0.1. osd_scrub_sleep has been untouched.
#6 Updated by Daan van Gorkum about 2 years ago
One more thing to add is that when I set the noscrub, nodeep-scrub flags the pgs actually don't stop scrubbing either (flags have been set for 30+ min at this point):
$ sudo ceph health detail | grep noscrub HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 43 pgs not deep-scrubbed in time; 15 pgs not scrubbed in time [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set $ sudo ceph pg dump pgs_brief | grep deep dumped pgs_brief 4.6b active+clean+scrubbing+deep [12,24,23] 12 [12,24,23] 12 4.4e active+clean+scrubbing+deep [21,14,22] 21 [21,14,22] 21 4.3d active+clean+scrubbing+deep [4,25,26] 4 [4,25,26] 4 4.12b active+clean+scrubbing+deep [29,19,13] 29 [29,19,13] 29
The amount of PGs no (deep-)scrubbed is increasing rapidly:
$ sudo ceph health detail | grep PG_NOT [WRN] PG_NOT_DEEP_SCRUBBED: 43 pgs not deep-scrubbed in time [WRN] PG_NOT_SCRUBBED: 15 pgs not scrubbed in time
#7 Updated by Daan van Gorkum about 2 years ago
Some more information on the non-default config settings (excluding MGR):
$ sudo ceph config dump | grep -v mgr WHO MASK LEVEL OPTION VALUE RO global basic container_image quay.io/ceph/ceph@sha256:bb6a71f7f481985f6d3b358e3b9ef64c6755b3db5aa53198e0aac38be5c8ae54 * global advanced objecter_inflight_ops 819200 global basic osd_memory_target 8589934592 mon advanced auth_allow_insecure_global_id_reclaim false mon advanced public_network 10.0.3.0/24 * osd dev bluestore_cache_size_hdd 4294967296 osd advanced bluestore_default_buffered_write true osd dev osd_memory_cache_min 4294967296 mds basic mds_cache_memory_limit 17179869184 mds.production basic mds_join_fs production
#8 Updated by Ronen Friedman about 2 years ago
There was (I'll have to check in which Ceph versions) a bug, where setting noscrub or nodeepscrub at the "wrong"
time might cause the scrub to hung forever.
Daan - Can you share the osd logs with me?
#9 Updated by Daan van Gorkum about 2 years ago
Added the logs for OSD 12,23,24 part of pg 4.6b. I don't think the logs are from the beginning when the osd booted and also not with debug enabled. Let me know if you need logs with a certain log level.
#10 Updated by Daan van Gorkum about 2 years ago
I would just like to add that scrubs started all of the sudden and the cluster is HEALTH_OK again.
#11 Updated by André Cruz about 2 years ago
I'm also seeing the same issue on 16.2.7, but it's been going on for almost two weeks. Already set and unset noscrub/nodeep-scrub, and had to restart OSDs to stop some scrubs.
I don't see any disk IO related to these scrubs...
#12 Updated by Wout van Heeswijk almost 2 years ago
We've seen this at a customers cluster as well. A simple repeer of the pg gets it unstuck. We've not investigated any rot cause of it yet.
ceph pg repeer x.x
#13 Updated by Neha Ojha almost 2 years ago
- Priority changed from Normal to High
#14 Updated by Cory Snyder almost 2 years ago
We've experienced this issue as well, on both 16.2.6 and 16.2.7, and I've identified the cause. Here's the scenario:
- Start with a cluster that has nodeep-scrub and noscrub flags set
- nodeep-scrub flag is unset
- deep scrubs started, progress to ActiveScrubbing,PendingTimer state
- deep scrubs are aborted because [PGScrubber::should_abort](https://github.com/ceph/ceph/blob/v16.2.6/src/osd/pg_scrubber.cc#L128) returns true for deep scrubs if either nodeep-scrub or noscrub are set (and noscrub is still set). Usually happens while checking whether PGScrubResched op is still relevant in [PGScrubber::send_scrub_reshed](https://github.com/ceph/ceph/blob/v16.2.6/src/osd/pg_scrubber.cc#L211). Scrub state is cleared [up the call stack](https://github.com/ceph/ceph/blob/v16.2.6/src/osd/pg_scrubber.cc#L122).
- The PG reschedules the deep scrub again for the same PG on the next tick. It again progresses to the ActiveScrubbing,PendingTimer state. It is aborted again for the same reason. This time, however, the scrub state is not cleared because the epoch has not changed and [this condition](https://github.com/ceph/ceph/blob/v16.2.6/src/osd/pg_scrubber.cc#L121) is false. This time, the abort only has the effect of causing the scrub reschedule op to be [deemed irrelevant](https://github.com/ceph/ceph/blob/v16.2.6/src/osd/pg_scrubber.cc#L211), which prevents the InternalSchedScrub FSM event from being sent. At this point, the scrub machine is stuck in the ActiveScrubbing,PendingTimer state indefinitely (until primary OSD is restarted).
Here are some OSD log snippets (with irrelevant details omitted) that supports the explanation above:
Jun 16 16:41:52 ceph20 scrub_send_scrub_resched queued at: 108117 Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) scrubber event -->> send_scrub_resched epoch: 108117 Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) noscrub set, aborting Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) verify_against_abort aborting. incoming epoch: 108117 vs last-aborted: 108109 Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) scrub_clear_state Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) clear_pgscrub_state Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) clear_scrub_reservations Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) reset_internal_state Jun 16 16:41:52 ceph20 scrubberFSM pg(2.7ebs0) ActiveScrubbing::react(const FullReset&) Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) unreserve_replicas Jun 16 16:41:52 ceph20 scrubberFSM pg(2.7ebs0) -- state -->> NotActive Jun 16 16:41:52 ceph20 scrubber pg(2.7ebs0) scrubber event --<< send_scrub_resched ... Jun 16 16:41:56 ceph20 scrub_send_scrub_resched queued at: 108117 Jun 16 16:41:56 ceph20 scrubber pg(2.7ebs0) scrubber event -->> send_scrub_resched epoch: 108117 Jun 16 16:41:56 ceph20 scrubber pg(2.7ebs0) noscrub set, aborting Jun 16 16:41:56 ceph20 scrubber pg(2.7ebs0) verify_against_abort aborting. incoming epoch: 108117 vs last-aborted: 108117 Jun 16 16:41:56 ceph20 scrubber pg(2.7ebs0) scrubber event --<< send_scrub_resched
In short, this happens whenever a deep scrub is started while the noscrub flag is set. The stuck scrub can be cleared by restarting the primary OSD associated with the PG, or may be cleared randomly if the epoch advances and an update_applied_notification triggered by client I/O causes another abort.
I believe that this PR should resolve the issue: https://github.com/ceph/ceph/pull/46727
#15 Updated by Ronen Friedman almost 2 years ago
@CorySnider: thanks. Your suggestion is spot on. The suggested fix
solves one issue. There is another problem related to noscrub, but I will separate the
solutions, and will help push this one forward.
#16 Updated by Neha Ojha almost 2 years ago
- Backport set to quincy,pacific
- Pull request ID set to 46727
#17 Updated by Neha Ojha almost 2 years ago
- Status changed from New to Fix Under Review
#18 Updated by Cory Snyder over 1 year ago
- Status changed from Fix Under Review to Pending Backport
#19 Updated by Backport Bot over 1 year ago
- Copied to Backport #56408: quincy: ceph version 16.2.7 PG scrubs not progressing added
#20 Updated by Backport Bot over 1 year ago
- Copied to Backport #56409: pacific: ceph version 16.2.7 PG scrubs not progressing added
#21 Updated by Yuri Weinstein over 1 year ago
#22 Updated by Backport Bot over 1 year ago
- Tags set to backport_processed
#23 Updated by Konstantin Shalygin over 1 year ago
- Status changed from Pending Backport to Resolved
- Tags deleted (
backport_processed)