Bug #5238: osd: slow recovery (uselessly dirtying pg logs during peering) - Ceph - Ceph

Custom queries

Backports: mimic
Backports: missing release
Backports: nautilus
Bluestore
Bug queue
Bug queue - no subprojects
Bug triage
Ceph backlog
Crash queue
Crash triage
Feature Requests
Feedback
My issues
Need Review
Pending backports
Priority queue
Product Backlog Scrub
Project Triage
Test Failures

Actions

Copy link

Bug #5238

closed

osd: slow recovery (uselessly dirtying pg logs during peering)

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

seeing several failures due to slow recovery. it looks like the health checks stop, and teuthology continues on for ages.

Related issues 1 (0 open — 1 closed)

Related to Ceph - Fix #5232: osd: slow peering due to pg log rewrites

Resolved

Samuel Just

06/02/2013

Actions

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Priority changed from Urgent to Immediate

Actions

Copy link

Updated by Sage Weil almost 11 years ago

I think this might be a teuthology problem: i can't find any ceph process running on the cluster when it hangs. trying again with some debug crap surrounding raw_cluster_cmd()...

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Subject changed from osd: slow recovery / hung health checks to osd: slow recovery

the health checks was a red herring. wait_for_recovery calls assert, but the other thread(s) finish before we see the exception appear (or something like that). recovery really is slow.

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

Hi sage is this related to my one? http://tracker.ceph.com/issues/5232

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Stefan Priebe wrote:

Hi sage is this related to my one? http://tracker.ceph.com/issues/5232

Only sort of.. one is about peering, the other is about object recovery.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Priority changed from Immediate to Urgent

Looking more closely it appears that for the qa job the problem is just that the recovery gets very low priority due to a large number of small object writes.

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Assignee changed from Sage Weil to Samuel Just

For the slow peering case, I think the first problem is that we unconditionally dirty the log in activate(). Since merge_log and friends already take care of that, we should be able to just not do that. The more complicated solution is to try to track dirty key ranges in the log object, but hopefully that won't need to be backported.

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

This one is missing in upstream/cuttlefish ? It helps a lot.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

we are going to tset it a bit more in master before putting it in teh cuttlefish branch. good to know this is helping, thanks!

sam is also working on a more involved fix for the log rewrites.

Actions

Copy link

#10

Updated by Faidon Liambotis almost 11 years ago

For what it's worth, I also tried it (wip_5238_cuttlefish specifically) per Sam's suggestion while troubleshooting #5084 and it made no significant difference.

Actions

Copy link

#11

Updated by Stefan Priebe almost 11 years ago

Maybe something different i've this one:
http://tracker.ceph.com/issues/5232

and it makes a HUGE difference regarding that one ;-)

Actions

Copy link

#12

Updated by Sage Weil almost 11 years ago

Subject changed from osd: slow recovery to osd: slow recovery (uselessly dirtying pg logs during peering)
Status changed from New to Pending Backport

Actions

Copy link

#13

Updated by Sage Weil almost 11 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #5238

osd: slow recovery (uselessly dirtying pg logs during peering)

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago