Project

General

Profile

Actions

Bug #4772

closed

(deep?) scrubbing scheduling misses PGs

Added by Faidon Liambotis about 11 years ago. Updated about 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a 144 OSD (135 in) cluster, partioned in ~10 pools and 16760 pgs in total. The cluster runs Ceph 0.56.4 using Ubuntu 12.04.

As part of bug #4743, I had to run "for o in $osds; do ceph osd deep-scrub $o; done" on my cluster. I've set --osd-max-scrubs to 4, so about 140-150 pgs are getting scrubbed concurrently. This takes about 2 days, after which there are no pgs in deep scrubbing.

It turns out, though, that while most pgs did in fact get deep scrubbed, some of those were missed. In at least the two first rounds, possibly the third too. I'm afraid that grepping is hard, since I've basically run the three rounds too close to each other, so extracting more concrete numbers would be hard. The PGs being scrubbed are definitely in the right ballpark though, so I'm guessing we're talking about a few dozens PGs being missed.

This is obviously not very important but I'm filing it here as to not get completely forgotten :)

Actions #1

Updated by Sage Weil about 11 years ago

  • Status changed from New to Need More Info

Scrubbing skips pgs that are degraded... was the cluster active+clean when you did the scheduling?

Actions #2

Updated by Faidon Liambotis about 11 years ago

Yes it was and there are no indications of flapping OSDs that I can see.

I think I found the same pgs being scrubbed (not deep scrubbed) though; could it be that a normal scrubbing was fired by period scheduling and cancelled the deep scrub command?

Actions #3

Updated by David Zafman almost 11 years ago

  • Assignee set to David Zafman
Actions #4

Updated by Sage Weil about 9 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF