Bug #49104
crush weirdness: degraded PGs not marked as such, and choose_total_tries = 50 is too low for this cluster
0%
Description
With a 14.2.11 cluster, I found a strange case where using the nautilus optimal tunables, the default choose_total_tries is not enough to find 3 OSDs for all PGs.
And some PGs are degraded but not shown as such!
Here is the tree:
# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -13 0 root drain -1 694.59943 root default -4 163.73932 room 0513-R-0050 -18 54.57977 rack RJ47 -17 54.57977 host p05151113613837 76 hdd 2.72899 osd.76 up 0 1.00000 85 hdd 2.72899 osd.85 up 0 1.00000 86 hdd 2.72899 osd.86 up 0 1.00000 87 hdd 2.72899 osd.87 up 0 1.00000 93 hdd 2.72899 osd.93 up 0 1.00000 94 hdd 2.72899 osd.94 up 0 1.00000 95 hdd 2.72899 osd.95 up 0 1.00000 96 hdd 2.72899 osd.96 up 0 1.00000 97 hdd 2.72899 osd.97 up 0 1.00000 101 hdd 2.72899 osd.101 up 0 1.00000 102 hdd 2.72899 osd.102 up 0 1.00000 103 hdd 2.72899 osd.103 up 0 1.00000 104 hdd 2.72899 osd.104 up 0 1.00000 105 hdd 2.72899 osd.105 up 0 1.00000 106 hdd 2.72899 osd.106 up 0 1.00000 107 hdd 2.72899 osd.107 up 0 1.00000 108 hdd 2.72899 osd.108 up 0 1.00000 109 hdd 2.72899 osd.109 up 0 1.00000 110 hdd 2.72899 osd.110 up 0 1.00000 111 hdd 2.72899 osd.111 up 0 1.00000 -12 54.57977 rack RJ55 -11 54.57977 host p05151113748698 0 hdd 2.72899 osd.0 up 1.00000 1.00000 8 hdd 2.72899 osd.8 up 1.00000 1.00000 14 hdd 2.72899 osd.14 up 1.00000 1.00000 17 hdd 2.72899 osd.17 up 1.00000 1.00000 19 hdd 2.72899 osd.19 up 1.00000 1.00000 20 hdd 2.72899 osd.20 up 1.00000 1.00000 21 hdd 2.72899 osd.21 up 1.00000 1.00000 26 hdd 2.72899 osd.26 up 1.00000 1.00000 30 hdd 2.72899 osd.30 up 1.00000 1.00000 33 hdd 2.72899 osd.33 up 1.00000 1.00000 34 hdd 2.72899 osd.34 up 1.00000 1.00000 37 hdd 2.72899 osd.37 up 1.00000 1.00000 42 hdd 2.72899 osd.42 up 1.00000 1.00000 44 hdd 2.72899 osd.44 up 1.00000 1.00000 45 hdd 2.72899 osd.45 up 1.00000 1.00000 47 hdd 2.72899 osd.47 up 1.00000 1.00000 52 hdd 2.72899 osd.52 up 1.00000 1.00000 69 hdd 2.72899 osd.69 up 1.00000 1.00000 72 hdd 2.72899 osd.72 up 1.00000 1.00000 73 hdd 2.72899 osd.73 up 1.00000 1.00000 -22 54.57977 rack RJ57 -21 54.57977 host p05151113782262 112 hdd 2.72899 osd.112 up 1.00000 1.00000 113 hdd 2.72899 osd.113 up 1.00000 1.00000 114 hdd 2.72899 osd.114 up 1.00000 1.00000 115 hdd 2.72899 osd.115 up 1.00000 1.00000 116 hdd 2.72899 osd.116 up 1.00000 1.00000 117 hdd 2.72899 osd.117 up 1.00000 1.00000 118 hdd 2.72899 osd.118 up 1.00000 1.00000 119 hdd 2.72899 osd.119 up 1.00000 1.00000 120 hdd 2.72899 osd.120 up 1.00000 1.00000 121 hdd 2.72899 osd.121 up 1.00000 1.00000 122 hdd 2.72899 osd.122 up 1.00000 1.00000 123 hdd 2.72899 osd.123 up 1.00000 1.00000 124 hdd 2.72899 osd.124 up 1.00000 1.00000 125 hdd 2.72899 osd.125 up 1.00000 1.00000 126 hdd 2.72899 osd.126 up 1.00000 1.00000 127 hdd 2.72899 osd.127 up 1.00000 1.00000 128 hdd 2.72899 osd.128 up 1.00000 1.00000 129 hdd 2.72899 osd.129 up 1.00000 1.00000 130 hdd 2.72899 osd.130 up 1.00000 1.00000 131 hdd 2.72899 osd.131 up 1.00000 1.00000 -5 530.86011 room 0513-S-0034 -3 530.86011 rack SJ07 -2 530.86011 host cephdata20b-106a771b77 1 hdd 11.05959 osd.1 up 1.00000 1.00000 2 hdd 11.05959 osd.2 up 1.00000 1.00000 3 hdd 11.05959 osd.3 up 1.00000 1.00000 4 hdd 11.05959 osd.4 up 1.00000 1.00000 5 hdd 11.05959 osd.5 up 1.00000 1.00000 6 hdd 11.05959 osd.6 up 1.00000 1.00000 7 hdd 11.05959 osd.7 up 1.00000 1.00000 9 hdd 11.05959 osd.9 up 1.00000 1.00000 10 hdd 11.05959 osd.10 up 1.00000 1.00000 11 hdd 11.05959 osd.11 up 1.00000 1.00000 12 hdd 11.05959 osd.12 up 1.00000 1.00000 13 hdd 11.05959 osd.13 up 1.00000 1.00000 15 hdd 11.05959 osd.15 up 1.00000 1.00000 16 hdd 11.05959 osd.16 up 1.00000 1.00000 18 hdd 11.05959 osd.18 up 1.00000 1.00000 22 hdd 11.05959 osd.22 up 1.00000 1.00000 23 hdd 11.05959 osd.23 up 1.00000 1.00000 24 hdd 11.05959 osd.24 up 1.00000 1.00000 25 hdd 11.05959 osd.25 up 1.00000 1.00000 27 hdd 11.05959 osd.27 up 1.00000 1.00000 28 hdd 11.05959 osd.28 up 1.00000 1.00000 29 hdd 11.05959 osd.29 up 1.00000 1.00000 31 hdd 11.05959 osd.31 up 1.00000 1.00000 32 hdd 11.05959 osd.32 up 1.00000 1.00000 35 hdd 11.05959 osd.35 up 1.00000 1.00000 36 hdd 11.05959 osd.36 up 1.00000 1.00000 38 hdd 11.05959 osd.38 up 1.00000 1.00000 39 hdd 11.05959 osd.39 up 1.00000 1.00000 40 hdd 11.05959 osd.40 up 1.00000 1.00000 41 hdd 11.05959 osd.41 up 1.00000 1.00000 43 hdd 11.05959 osd.43 up 1.00000 1.00000 46 hdd 11.05959 osd.46 up 1.00000 1.00000 48 hdd 11.05959 osd.48 up 1.00000 1.00000 49 hdd 11.05959 osd.49 up 1.00000 1.00000 50 hdd 11.05959 osd.50 up 1.00000 1.00000 51 hdd 11.05959 osd.51 up 1.00000 1.00000 53 hdd 11.05959 osd.53 up 1.00000 1.00000 54 hdd 11.05959 osd.54 up 1.00000 1.00000 55 hdd 11.05959 osd.55 up 1.00000 1.00000 56 hdd 11.05959 osd.56 up 1.00000 1.00000 57 hdd 11.05959 osd.57 up 1.00000 1.00000 58 hdd 11.05959 osd.58 up 1.00000 1.00000 59 hdd 11.05959 osd.59 up 1.00000 1.00000 60 hdd 11.05959 osd.60 up 1.00000 1.00000 61 hdd 11.05959 osd.61 up 1.00000 1.00000 62 hdd 11.05959 osd.62 up 1.00000 1.00000 63 hdd 11.05959 osd.63 up 1.00000 1.00000 64 hdd 11.05959 osd.64 up 1.00000 1.00000
All of the osds in host p05151113613837 were marked out days ago and most PGs backfilled elsewhere.
And the OSDs in p05151113613837 can be up or down -- starting or stopping those osds doesn't affect what I'll write below.
Pools are 3x replicated with a simple rule to replicate across hosts from default root:
pool 173 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 342059 flags hashpspool,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 0 compression_algorithm zstd compression_mode none application cephfs pool 174 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 342060 flags hashpspool,nodelete,nopgchange,nosizechange stripe_width 0 deep_scrub_interval 604800 pg_autoscale_bias 100 scrub_max_interval 604800 scrub_min_interval 86400 application cephfs pool 179 'tapetest' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 342062 lfor 0/303646/311385 flags hashpspool,nodelete,nopgchange,nosizechange stripe_width 0 application cta pool 183 'eospps' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 342063 lfor 0/300253/311387 flags hashpspool,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 0 application eos
Problem A:
Several PGs are stuck active+undersized+degraded and several other PGs are stuck active+clean+remapped:
:
# ceph pg ls undersized PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 173.11 128172 128172 0 0 35015262966 0 0 3099 active+undersized+degraded 2m 348705'44714700 348705:69588008 [16,45]p16 [16,45]p16 2021-02-01 23:14:07.895966 2021-01-30 18:40:12.078605 173.54 127846 127846 0 0 34763230317 0 0 3074 active+undersized+degraded 2m 348705'47023180 348705:68195726 [29,37]p29 [29,37]p29 2021-02-01 23:41:57.834821 2021-02-01 23:41:57.834821 173.da 127112 127112 0 0 35179867232 0 0 3070 active+undersized+degraded 2m 348705'45398574 348705:66752382 [15,69]p15 [15,69]p15 2021-02-01 15:51:20.605883 2021-01-29 18:54:59.173094 173.e1 127513 127513 0 0 35736987192 0 0 3022 active+undersized+degraded 2m 348705'46732368 348705:65452108 [62,123]p62 [62,123]p62 2021-02-01 19:38:59.001676 2021-01-26 15:45:15.061754 174.4b 18219 18219 0 0 12582912 37185174 75873 3058 active+undersized+degraded 2m 348660'13700480 348704:14796045 [62,127]p62 [62,127]p62 2021-02-01 11:00:17.618501 2021-01-26 00:50:47.631000 174.54 18161 18161 0 0 4194304 40213710 81595 3030 active+undersized+degraded 2m 348524'13523550 348704:14671548 [53,45]p53 [53,45]p53 2021-02-01 21:17:53.439965 2021-02-01 21:17:53.439965 174.8a 18174 18174 0 0 8388608 36977086 74759 3029 active+undersized+degraded 2m 348524'13526179 348704:14534279 [53,130]p53 [53,130]p53 2021-02-01 13:43:11.968673 2021-01-30 02:52:39.605898
# ceph pg ls remapped ... 179.6 1 0 1 0 22 0 0 0 active+clean+remapped 8m 250376'293285 348710:1627158 [62,121]p62 [62,121,26]p62 2021-02-01 13:57:31.492398 2021-01-29 09:58:50.208557 179.e 4 0 4 0 608 0 0 0 active+clean+remapped 8m 250376'293285 348710:1627170 [40,113]p40 [40,113,17]p40 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548 183.12 14 0 14 0 54542336 0 0 0 active+clean+remapped 8m 248986'46016 348710:85761 [49,129]p49 [49,129,0]p49 2021-02-01 15:51:49.357897 2021-01-26 01:36:48.023650
Let's look at those active+clean+remapped PGs first. The peer_info for 179.e includes osd.17: https://termbin.com/oa1k
I don't know how to query which PGs an OSD thinks it owns (via the asok).
But if I restart osd.17 then this PG becomes degraded:
# ceph osd down 17 marked down osd.17. # sleep 30 # ceph pg ls | grep 179.e 179.e 4 4 0 0 608 0 0 0 active+undersized+degraded 18s 250376'293285 348774:1627177 [5,113]p5 [5,113]p5 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548
So looks like the peer info is stale for that PG? I thought maybe the peer info is stale because the pools are idle -- but I ran some rados bench's on the relevant pools and the PGs stay active+clean+remapped until the osd in the stale peer info is restarted.
But then I restart osd.5 (the primary for PG 179.e), and then the PG switches back to active+remapped+clean!!
# ceph pg ls | grep 179.e 179.e 4 4 0 0 608 0 0 0 active+undersized+degraded 6m 250376'293285 348780:1627183 [5,113]p5 [5,113]p5 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548 # ceph osd down 5 marked down osd.5. # sleep 30 # ceph pg ls | grep 179.e 179.e 4 0 4 0 608 0 0 0 active+clean+remapped 26s 250376'293285 348787:1627195 [5,113]p5 [5,113,17]p5 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548
So that represents some very strange peering issue, doesn't it?
Problem B:
Those PGs really are undersized because crush can't find 3 OSDs.
But if I inject a crush map with `tunable choose_total_tries 200` then all of those strange PGs start backfilling, e.g. :
174.8a 18174 0 16830 0 8388608 36977086 74759 3029 active+remapped+backfilling 63s 348524'13526179 348725:14537005 [53,130,33]p53 [53,130,102]p53 2021-02-01 13:43:11.968673 2021-01-30 02:52:39.605898
Should we increase the default choose_total_tries ?
History
#1 Updated by Neha Ojha almost 3 years ago
I don't see an issue with increasing choose_total_tries as long as it not a very high value.
Regarding Problem A, will it be possible for you to share osd logs with debug_osd=20 to demonstrate the issue?
#2 Updated by Dan van der Ster almost 3 years ago
Neha Ojha wrote:
Regarding Problem A, will it be possible for you to share osd logs with debug_osd=20 to demonstrate the issue?
OK, starting with this choose_total_tries = 50 I have
pgs: 1064935/79517613 objects misplaced (1.339%) 975 active+clean 26 active+clean+remapped 19 active+clean+scrubbing 4 active+clean+scrubbing+deep+repair 1 active+clean+remapped+scrubbing
Here is the first remapped PG:
38.17 343 0 343 0 2840652338 126 6 3016 active+clean+remapped 2m 350262'46683 350353:411428 [36,26]p36 [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288
Then I enabled debug_osd 20 everywhere and downed osd 36:
# ceph osd down 36 marked down osd.36. # ceph pg ls remapped | head PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 38.17 343 0 343 0 2840652338 126 6 3016 active+clean+remapped 37s 350262'46683 350359:411440 [36,26]p36 [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288
The PG is not yet undersized ^^.
Now I down osd.116 then the PG becomes undersized:
# ceph osd down 116 marked down osd.116. # ceph pg ls undersized | head PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 38.17 343 343 0 0 2840652338 126 6 3016 active+undersized+degraded 31s 350262'46683 350363:411452 [36,26]p36 [36,26]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288
And I down osd.36 again and the PG is no longer undersized:
# ceph osd down 36 marked down osd.36. # ceph pg ls remapped | head PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 38.17 343 0 343 0 2840652338 126 6 3016 active+clean+remapped 17s 350262'46683 350369:411463 [36,26]p36 [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288
Finally i injected the crushmap with choose_total_tries = 200 and the PG is no longer remapped:
# ceph osd set crushmap -i crush.map.200 77959 # ceph pg ls | grep 38.17 38.17 343 0 0 0 2840652338 126 6 3016 active+clean 45s 350262'46683 350372:411476 [36,26,116]p36 [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288
The logs for those three osds, plus the crushmaps, plus ceph.log are posted: ceph-post-file: 3c5ec76e-6f8c-4fbf-997f-de4a0af2897e
debug_osd=20 begins at `2021-02-19 14:39:08` in those logs.
Thanks!
#3 Updated by Neha Ojha almost 3 years ago
- Status changed from New to Triaged
Thanks for the detailed logs!
Firstly, the pg dump output can sometimes be a little laggy, so I am basing my assessment on the logs you've provided.
The log line including "handle_advance_map" shows us the UP/ACTING osds associated with a particular osdmap change. Note that the acting set may change after peering.
This corresponds to the time when osd.36 was brought down. Until this time, UP/ACTING = [36,26]/[36,26,116] STATE=active+clean+remapped. This is because the acting set has 3 osds, and hence we aren't undersized
2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350354 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] handle_advance_map [26]/[26,116] -- 26/26 2021-02-19 14:40:27.869 7fc96e97c700 20 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] new interval newup [26] newacting [26,116] 2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] state<Started/Primary/Active>: Active advmap interval change, fast return 2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] state<Started>: Started advmap 2021-02-19 14:40:27.869 7fc96e97c700 20 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] new interval newup [26] newacting [26,116] 2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] state<Started>: should_restart_peering, transitioning to Reset 2021-02-19 14:40:27.869 7fc96e97c700 5 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] exit Started/Primary/Active/Clean 291.539228 16 0.000182
since 36 is down, 26 becomes that primary and UP/ACTING = [26]/[26,116] STATE = active+undersized+remapped (due to one less peer in the acting set)
2021-02-19 14:40:23.845 7f3a9e1fa700 10 osd.26 pg_epoch: 350354 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=1 lpr=350348 luod=0'0 crt=350262'46683 lcod 0'0 active+remapped mbc={}] handle_advance_map [26]/[26,116] -- 26/26 2021-02-19 14:40:23.846 7f3a9e1fa700 10 osd.26 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350355/350355/350355) [26]/[26,116] r=0 lpr=350355 pi=[350348,350355)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] build_prior prior osd.36 is down 2021-02-19 14:40:24.668 7f3a9e1fa700 20 osd.26 pg_epoch: 350356 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350355/350355/350355) [26]/[26,116] r=0 lpr=350355 pi=[350348,350355)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 active+undersized+remapped mbc={}] _update_calc_stats actingset 26,116 upset 26 acting_recovery_backfill 26,116
osd.36 brought back
2021-02-19 14:40:28.751 7f3a9e1fa700 10 osd.26 pg_epoch: 350356 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350355/350355/350355) [26]/[26,116] r=0 lpr=350355 pi=[350348,350355)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 active+undersized+remapped mbc={}] handle_advance_map [36,26]/[26,116] -- 36/26 2021-02-19 14:40:28.751 7f3a9e1fa700 1 osd.26 pg_epoch: 350357 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350357/350357/350355) [36,26]/[26,116] r=0 lpr=350357 pi=[350348,350357)/1 luod=0'0 crt=350262'46683 lcod 0'0 mlcod 0'0 active+remapped mbc={}] start_peering_interval up [26] -> [36,26], acting [26,116] -> [26,116], acting_primary 26 -> 26, up_primary 26 -> 36, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2021-02-19 14:40:29.754 7f3a9e1fa700 20 osd.26 pg_epoch: 350358 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350357/350357/350355) [36,26]/[26,116] r=0 lpr=350357 pi=[350348,350357)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 remapped mbc={}] new interval newup [36,26] newacting [36,26,116]
osd.36 becomes the primary again and STATE = active+clean+remapped
2021-02-19 14:40:30.870 7fc96e97c700 20 osd.36 pg_epoch: 350359 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350358/350359 n=343 ec=311391/19547 lis/c 350358/350358 les/c/f 350359/350359/0 350357/350358/350358) [36,26]/[36,26,116] r=0 lpr=350358 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] _update_calc_stats actingset 26,36,116 upset 26,36 acting_recovery_backfill 26,36,116
Similarly, we'll see the pg go to undersized when osd.116 is brought down.
We can clearly see that choose_total_tries has a part to play here since when you changed it to 200, we see osd.116 in the up set.
2021-02-19 14:44:21.086 7fc96e97c700 10 osd.36 pg_epoch: 350369 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350368/350369 n=343 ec=311391/19547 lis/c 350368/350368 les/c/f 350369/350369/0 350367/350368/350368) [36,26]/[36,26,116] r=0 lpr=350368 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] handle_advance_map [36,26,116]/[36,26,116] -- 36/36 2021-02-19 14:44:21.086 7fc96e97c700 20 osd.36 pg_epoch: 350370 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350368/350369 n=343 ec=311391/19547 lis/c 350368/350368 les/c/f 350369/350369/0 350367/350368/350368) [36,26]/[36,26,116] r=0 lpr=350368 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] new interval newup [36,26,116] newacting [36,26,116]
There have been other reports about the need to increase the value of choose_total_tries for crush to try harder to find placement targets(when pgs are in active+clean+remapped). In the past, we increased the default from 19 to 50, perhaps, we can make even higher now, maybe 100.
#4 Updated by Dan van der Ster almost 3 years ago
Thanks for the analysis Neha.
Something that perhaps wasn't clear in comment 2 -- in each case where I print the `ceph pg ls` output, this is the steady state after all three osds are again up and all pgs have re-peered. To me it is strange that the pg stays "undersized" even long after all the needed OSDs are back up.
Down/up of a non primary, e.g. osd.116, makes the PG go undersized and it stays that way.
Down/up of the primary osd 36 then "fixes" the pg: it is active+clean after this, without even needing any recovery.
Do you understand why the PG remains undersized after we down/up osd.116 ?
To me it looks like 116 isn't correctly advertising that it has the pg contents when it boots. And that somehow reminds me of #46847
Down/up of osd.116 was triggered here, in case it helps with finding the relevant sections of the osd logs:
2021-02-19 14:42:09.952790 mon.cephdwight-mon-1633994557 (mon.0) 5275077 : cluster [INF] osd.116 [v2:128.142.23.102:6802/2682786,v1:128.142.23.102:6811/2682786] boot 2021-02-19 14:42:09.830519 osd.116 (osd.116) 1 : cluster [WRN] Monitor daemon marked osd.116 down, but it is still running 2021-02-19 14:42:09.830536 osd.116 (osd.116) 2 : cluster [DBG] map e350361 wrongly marked me down at e350360
I checked osd.36's log at that time, and indeed I don't see 116 joining the up set even though it is up!
#5 Updated by Dan van der Ster over 2 years ago
"Problem A" is probably a dupe of #46847