Bug #5238
closed
osd: slow recovery (uselessly dirtying pg logs during peering)
Added by Sage Weil almost 11 years ago.
Updated almost 11 years ago.
Description
seeing several failures due to slow recovery. it looks like the health checks stop, and teuthology continues on for ages.
- Priority changed from Urgent to Immediate
I think this might be a teuthology problem: i can't find any ceph process running on the cluster when it hangs. trying again with some debug crap surrounding raw_cluster_cmd()...
- Subject changed from osd: slow recovery / hung health checks to osd: slow recovery
the health checks was a red herring. wait_for_recovery calls assert, but the other thread(s) finish before we see the exception appear (or something like that). recovery really is slow.
- Priority changed from Immediate to Urgent
Looking more closely it appears that for the qa job the problem is just that the recovery gets very low priority due to a large number of small object writes.
- Assignee changed from Sage Weil to Samuel Just
For the slow peering case, I think the first problem is that we unconditionally dirty the log in activate(). Since merge_log and friends already take care of that, we should be able to just not do that. The more complicated solution is to try to track dirty key ranges in the log object, but hopefully that won't need to be backported.
This one is missing in upstream/cuttlefish ? It helps a lot.
we are going to tset it a bit more in master before putting it in teh cuttlefish branch. good to know this is helping, thanks!
sam is also working on a more involved fix for the log rewrites.
For what it's worth, I also tried it (wip_5238_cuttlefish specifically) per Sam's suggestion while troubleshooting #5084 and it made no significant difference.
- Subject changed from osd: slow recovery to osd: slow recovery (uselessly dirtying pg logs during peering)
- Status changed from New to Pending Backport
- Status changed from Pending Backport to Resolved
Also available in: Atom
PDF