Project

General

Profile

Bug #5238

osd: slow recovery (uselessly dirtying pg logs during peering)

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

seeing several failures due to slow recovery. it looks like the health checks stop, and teuthology continues on for ages.


Related issues

Related to Ceph - Fix #5232: osd: slow peering due to pg log rewrites Resolved 06/02/2013

Associated revisions

Revision 5deece1d (diff)
Added by Samuel Just over 10 years ago

PG: don't dirty log unconditionally in activate()

merge_log and friends all take care of dirtying the log
as necessary.

Fixes: #5238
Signed-off-by: Samuel Just <>

Revision eace9987 (diff)
Added by Samuel Just over 10 years ago

PG: don't dirty log unconditionally in activate()

merge_log and friends all take care of dirtying the log
as necessary.

Fixes: #5238
Signed-off-by: Samuel Just <>
(cherry picked from commit 5deece1d034749bf72b7bd04e4e9c5d97e5ad6ce)

History

#1 Updated by Sage Weil over 10 years ago

  • Priority changed from Urgent to Immediate

#2 Updated by Sage Weil over 10 years ago

I think this might be a teuthology problem: i can't find any ceph process running on the cluster when it hangs. trying again with some debug crap surrounding raw_cluster_cmd()...

#3 Updated by Sage Weil over 10 years ago

  • Subject changed from osd: slow recovery / hung health checks to osd: slow recovery

the health checks was a red herring. wait_for_recovery calls assert, but the other thread(s) finish before we see the exception appear (or something like that). recovery really is slow.

#4 Updated by Stefan Priebe over 10 years ago

Hi sage is this related to my one? http://tracker.ceph.com/issues/5232

#5 Updated by Sage Weil over 10 years ago

Stefan Priebe wrote:

Hi sage is this related to my one? http://tracker.ceph.com/issues/5232

Only sort of.. one is about peering, the other is about object recovery.

#6 Updated by Sage Weil over 10 years ago

  • Priority changed from Immediate to Urgent

Looking more closely it appears that for the qa job the problem is just that the recovery gets very low priority due to a large number of small object writes.

#7 Updated by Samuel Just over 10 years ago

  • Assignee changed from Sage Weil to Samuel Just

For the slow peering case, I think the first problem is that we unconditionally dirty the log in activate(). Since merge_log and friends already take care of that, we should be able to just not do that. The more complicated solution is to try to track dirty key ranges in the log object, but hopefully that won't need to be backported.

#8 Updated by Stefan Priebe over 10 years ago

This one is missing in upstream/cuttlefish ? It helps a lot.

#9 Updated by Sage Weil over 10 years ago

we are going to tset it a bit more in master before putting it in teh cuttlefish branch. good to know this is helping, thanks!

sam is also working on a more involved fix for the log rewrites.

#10 Updated by Faidon Liambotis over 10 years ago

For what it's worth, I also tried it (wip_5238_cuttlefish specifically) per Sam's suggestion while troubleshooting #5084 and it made no significant difference.

#11 Updated by Stefan Priebe over 10 years ago

Maybe something different i've this one:
http://tracker.ceph.com/issues/5232

and it makes a HUGE difference regarding that one ;-)

#12 Updated by Sage Weil over 10 years ago

  • Subject changed from osd: slow recovery to osd: slow recovery (uselessly dirtying pg logs during peering)
  • Status changed from New to Pending Backport

#13 Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF