Project

General

Profile

Actions

Bug #5238

closed

osd: slow recovery (uselessly dirtying pg logs during peering)

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

seeing several failures due to slow recovery. it looks like the health checks stop, and teuthology continues on for ages.


Related issues 1 (0 open1 closed)

Related to Ceph - Fix #5232: osd: slow peering due to pg log rewritesResolvedSamuel Just06/02/2013

Actions
Actions #1

Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to Immediate
Actions #2

Updated by Sage Weil almost 11 years ago

I think this might be a teuthology problem: i can't find any ceph process running on the cluster when it hangs. trying again with some debug crap surrounding raw_cluster_cmd()...

Actions #3

Updated by Sage Weil almost 11 years ago

  • Subject changed from osd: slow recovery / hung health checks to osd: slow recovery

the health checks was a red herring. wait_for_recovery calls assert, but the other thread(s) finish before we see the exception appear (or something like that). recovery really is slow.

Actions #4

Updated by Stefan Priebe almost 11 years ago

Hi sage is this related to my one? http://tracker.ceph.com/issues/5232

Actions #5

Updated by Sage Weil almost 11 years ago

Stefan Priebe wrote:

Hi sage is this related to my one? http://tracker.ceph.com/issues/5232

Only sort of.. one is about peering, the other is about object recovery.

Actions #6

Updated by Sage Weil almost 11 years ago

  • Priority changed from Immediate to Urgent

Looking more closely it appears that for the qa job the problem is just that the recovery gets very low priority due to a large number of small object writes.

Actions #7

Updated by Samuel Just almost 11 years ago

  • Assignee changed from Sage Weil to Samuel Just

For the slow peering case, I think the first problem is that we unconditionally dirty the log in activate(). Since merge_log and friends already take care of that, we should be able to just not do that. The more complicated solution is to try to track dirty key ranges in the log object, but hopefully that won't need to be backported.

Actions #8

Updated by Stefan Priebe almost 11 years ago

This one is missing in upstream/cuttlefish ? It helps a lot.

Actions #9

Updated by Sage Weil almost 11 years ago

we are going to tset it a bit more in master before putting it in teh cuttlefish branch. good to know this is helping, thanks!

sam is also working on a more involved fix for the log rewrites.

Actions #10

Updated by Faidon Liambotis almost 11 years ago

For what it's worth, I also tried it (wip_5238_cuttlefish specifically) per Sam's suggestion while troubleshooting #5084 and it made no significant difference.

Actions #11

Updated by Stefan Priebe almost 11 years ago

Maybe something different i've this one:
http://tracker.ceph.com/issues/5232

and it makes a HUGE difference regarding that one ;-)

Actions #12

Updated by Sage Weil almost 11 years ago

  • Subject changed from osd: slow recovery to osd: slow recovery (uselessly dirtying pg logs during peering)
  • Status changed from New to Pending Backport
Actions #13

Updated by Sage Weil almost 11 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF