Project

General

Profile

Actions

Bug #759

closed

osd: pgs spend a long time peering when marking osds out

Added by Sage Weil about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On the playground (with lots of data), I see that some PGs spend a long time in peering state after marking an OSD as out. This isn't supposed to happen...


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #793: osd: avoid blocking in scrub_wqResolvedSamuel Just02/09/2011

Actions
Actions #1

Updated by Sage Weil about 13 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Sage Weil about 13 years ago

this appears to be scrubbing related:

- we get a new osdmap. handle_osd_map tries to pause the op threadpool.
- a long running scrub op takes forever to complete
- handle_osd_map finally continues.

during that whole time the main dispatch thread is blocked up, and peering gets backed up as a result.

Actions #3

Updated by Sage Weil about 13 years ago

  • Assignee changed from Sage Weil to Samuel Just

the replica scrub needs to go in a different work queue (not op_wq). scrub_wq, or something else that's assigned to the disk threadpool disk_tp.

Actions #4

Updated by Samuel Just about 13 years ago

1a01e5ee1b88a217547873296e0371858be13f37 merged in a branch moving replica scrubbing to rep_scrub_wq with a new non-osdop message for initiating a replica scrub. Scrub still blocks in the disk_tp while waiting for replicas to scrub, though, working on that now.

Actions #5

Updated by Sage Weil about 13 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF