Bug #5238
closedosd: slow recovery (uselessly dirtying pg logs during peering)
0%
Description
seeing several failures due to slow recovery. it looks like the health checks stop, and teuthology continues on for ages.
Updated by Sage Weil almost 11 years ago
- Priority changed from Urgent to Immediate
Updated by Sage Weil almost 11 years ago
I think this might be a teuthology problem: i can't find any ceph process running on the cluster when it hangs. trying again with some debug crap surrounding raw_cluster_cmd()...
Updated by Sage Weil almost 11 years ago
- Subject changed from osd: slow recovery / hung health checks to osd: slow recovery
the health checks was a red herring. wait_for_recovery calls assert, but the other thread(s) finish before we see the exception appear (or something like that). recovery really is slow.
Updated by Stefan Priebe almost 11 years ago
Hi sage is this related to my one? http://tracker.ceph.com/issues/5232
Updated by Sage Weil almost 11 years ago
Stefan Priebe wrote:
Hi sage is this related to my one? http://tracker.ceph.com/issues/5232
Only sort of.. one is about peering, the other is about object recovery.
Updated by Sage Weil almost 11 years ago
- Priority changed from Immediate to Urgent
Looking more closely it appears that for the qa job the problem is just that the recovery gets very low priority due to a large number of small object writes.
Updated by Samuel Just almost 11 years ago
- Assignee changed from Sage Weil to Samuel Just
For the slow peering case, I think the first problem is that we unconditionally dirty the log in activate(). Since merge_log and friends already take care of that, we should be able to just not do that. The more complicated solution is to try to track dirty key ranges in the log object, but hopefully that won't need to be backported.
Updated by Stefan Priebe almost 11 years ago
This one is missing in upstream/cuttlefish ? It helps a lot.
Updated by Sage Weil almost 11 years ago
we are going to tset it a bit more in master before putting it in teh cuttlefish branch. good to know this is helping, thanks!
sam is also working on a more involved fix for the log rewrites.
Updated by Faidon Liambotis almost 11 years ago
For what it's worth, I also tried it (wip_5238_cuttlefish specifically) per Sam's suggestion while troubleshooting #5084 and it made no significant difference.
Updated by Stefan Priebe almost 11 years ago
Maybe something different i've this one:
http://tracker.ceph.com/issues/5232
and it makes a HUGE difference regarding that one ;-)
Updated by Sage Weil almost 11 years ago
- Subject changed from osd: slow recovery to osd: slow recovery (uselessly dirtying pg logs during peering)
- Status changed from New to Pending Backport
Updated by Sage Weil almost 11 years ago
- Status changed from Pending Backport to Resolved