Bug #12096: Tail latency during deep scrubbing - RADOS - Ceph

Actions

Copy link

Bug #12096

open

Tail latency during deep scrubbing

Added by Guang Yang almost 9 years ago. Updated almost 7 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Scrub/Repair

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We saw a large number of timeouts (with 5 seconds timeout at client side) when enabling deep scrubbing, investigation shows the timeout happens because the op thread fail to acquire the pg lock, which is being hold by disk thread doing scrubbing, the most time consuming part on the disk thread is to build the scrub map. By default configuration, it reads up to 25 objects to build the local scrub map, and that could take up to several seconds.

Do we need to hold the PG lock during the entire life-cycle of each round of scrubbing? As I understand it, the purpose is to make sure the object range being scrubbed is not updated during the time, and we have already have something like write_block_by_scrub for such purpose.

Please correct me if I am wrong here...

------
Ceph version: v0.87

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Project changed from Ceph to RADOS
Category set to Scrub/Repair
Component(RADOS) OSD added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #12096

Tail latency during deep scrubbing

Updated by Greg Farnum almost 7 years ago