Project

General

Profile

Bug #12096

Tail latency during deep scrubbing

Added by Guang Yang about 7 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Scrub/Repair
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We saw a large number of timeouts (with 5 seconds timeout at client side) when enabling deep scrubbing, investigation shows the timeout happens because the op thread fail to acquire the pg lock, which is being hold by disk thread doing scrubbing, the most time consuming part on the disk thread is to build the scrub map. By default configuration, it reads up to 25 objects to build the local scrub map, and that could take up to several seconds.

Do we need to hold the PG lock during the entire life-cycle of each round of scrubbing? As I understand it, the purpose is to make sure the object range being scrubbed is not updated during the time, and we have already have something like write_block_by_scrub for such purpose.

Please correct me if I am wrong here...

------
Ceph version: v0.87

History

#1 Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to RADOS
  • Category set to Scrub/Repair
  • Component(RADOS) OSD added

Also available in: Atom PDF