Project

General

Profile

Actions

Bug #7520

closed

Lock contention during scrubbing which could potentially hang the OSD for a couple of seconds

Added by Guang Yang about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are using Ceph as object store (via radosgw) and each time the cluster starts doing scrubbing, the performance degrades (e.g. latency increase 20%).

With some investigation, one pattern we found which made the cluster slow was a lock contention happening during scrubbing, here is the data flow:
1. The replica OSD receives msg MOSDRepScrub for a PG and then it locks the PG to process (https://github.com/ceph/ceph/blob/master/src/osd/OSD.h#L1799)
2. The OSD tick thread:
2.1 lock the osd and run a scheduled scrub (with holding OSD lock)
2.2 foreach pg in OSDService::last_scrub_pg
2.2.1 lock the pg // if this pg happens to be the pg in the item 1, it could block holding OSD lock for a while (in our cluster up to several seconds)
2.2.2 try to get a local / remote reserver and queue the scrub
2.2.3 unlock the pg
2.3 unlock the osd
The lock contention could happen as step 2.2.1, as it could try to acquire a lock in step 1, which blocks, and as it holds a OSD lock, the messenger are not able to do dispatch and enqueue op, as result, the OSD hand a couple of seconds.

An easy way to fix is, at step 2.2, before locking the pg, check if it is not a primary PG, if yes, just skip it.

Actions #2

Updated by Sage Weil about 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF