Bug #41526
closedChoosing the next PG for a deep scrubs wrong.
0%
Description
I have ceph cluster in this state:
# ceph health detail HEALTH_WARN 27 pgs not deep-scrubbed in time; PG_NOT_DEEP_SCRUBBED 27 pgs not deep-scrubbed in time pg 15.e4 not deep-scrubbed since 2019-08-14 19:15:30.699834 pg 2.db not deep-scrubbed since 2019-08-14 14:14:25.173851 pg 15.d8 not deep-scrubbed since 2019-08-14 20:19:03.937229 pg 9.d9 not deep-scrubbed since 2019-08-14 19:40:16.157361 pg 9.db not deep-scrubbed since 2019-08-15 04:24:14.865325 pg 2.5c not deep-scrubbed since 2019-08-14 20:22:04.605225 pg 17.45 not deep-scrubbed since 2019-08-15 01:43:06.099446 pg 6.51 not deep-scrubbed since 2019-08-14 13:22:51.959783 pg 6.4a not deep-scrubbed since 2019-08-15 03:39:44.701350 pg 2.4d not deep-scrubbed since 2019-08-14 23:54:40.245206 pg 7.33 not deep-scrubbed since 2019-08-15 03:32:24.927287 pg 17.27 not deep-scrubbed since 2019-08-14 14:15:42.543503 pg 17.17 not deep-scrubbed since 2019-08-14 23:09:42.728755 pg 6.3 not deep-scrubbed since 2019-08-14 14:10:45.757717 pg 10.12 not deep-scrubbed since 2019-08-15 03:11:14.487778 pg 17.c not deep-scrubbed since 2019-08-14 22:43:25.739869 pg 9.2f not deep-scrubbed since 2019-08-15 01:20:00.682925 pg 10.22 not deep-scrubbed since 2019-08-14 19:50:51.555694 pg 17.67 not deep-scrubbed since 2019-08-14 06:07:33.732999 pg 6.79 not deep-scrubbed since 2019-08-15 03:28:21.388018 pg 10.86 not deep-scrubbed since 2019-08-14 23:56:32.100535 pg 2.92 not deep-scrubbed since 2019-08-15 04:20:55.337923 pg 9.93 not deep-scrubbed since 2019-08-15 01:06:08.958664 pg 2.9b not deep-scrubbed since 2019-08-15 04:28:54.398524 pg 6.a7 not deep-scrubbed since 2019-08-15 04:05:45.886923 pg 10.a7 not deep-scrubbed since 2019-08-14 12:34:06.675859 pg 17.a4 not deep-scrubbed since 2019-08-14 22:21:07.467654
But ceph select strange (random?) pg for deep scrubing.
I do next command:
ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $25, $26, $1}' | sort > b1
After time again:
ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $25, $26, $1}' | sort > b2
And now - diff between:
diff b1 b2 558d557 < 2019-08-17 03:24:30.058471 17.18 2821a2821 > 2019-08-27 11:21:32.744345 17.18
As you can see - for deep scrubing will be selected not so old pg 17.18 although there are older pg.
Updated by David Zafman over 4 years ago
- Assignee set to David Zafman
You never know what what scrubs can run with osd_max_scrubs (especially defaulting to 1). Without looking at which OSDs are involved in each PG and what other scrubbing is already in progress, it is hard to tell why scrubs will run in the order they do. I believe that on a test cluster as you raise osd_max_scrubs the cluster will tend to start scrubbing the most out of date PGs first. Remember each OSD for each of its primary PGs, will try to scrub the most out of date first, but is competing with all other OSDs for the osd_max_scrubs slots available on ALL replicas. So even though processing of scrubs starts from most out of date, priority goes to spreading scrubs across different nodes as opposed to bunching up and slowing down nodes due to client access.
So with osd_max_scrubs == 1
pg 1.0 [0,1,2] - most out of date
pg 1.1 [1,2,3]
pg 1.2 [2,3,4]
pg 1.3 [5,6,7] - least out of date
In this fictitous set of PGS, we can't ever scrub pg 1,0, 1,1, 1.2 at the same time because of the overlapping OSD locations. So 1.3 could scrub at the same time as 1.0 even though 1.1 and 1.2 are more out of date than 1.3.