Project

General

Profile

Bug #49104

crush weirdness: degraded PGs not marked as such, and choose_total_tries = 50 is too low for this cluster

Added by Dan van der Ster almost 3 years ago. Updated over 2 years ago.

Status:
Triaged
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With a 14.2.11 cluster, I found a strange case where using the nautilus optimal tunables, the default choose_total_tries is not enough to find 3 OSDs for all PGs.
And some PGs are degraded but not shown as such!

Here is the tree:

# ceph osd tree
ID  CLASS WEIGHT    TYPE NAME                               STATUS REWEIGHT PRI-AFF
-13               0 root drain
 -1       694.59943 root default
 -4       163.73932     room 0513-R-0050
-18        54.57977         rack RJ47
-17        54.57977             host p05151113613837
 76   hdd   2.72899                 osd.76                      up        0 1.00000
 85   hdd   2.72899                 osd.85                      up        0 1.00000
 86   hdd   2.72899                 osd.86                      up        0 1.00000
 87   hdd   2.72899                 osd.87                      up        0 1.00000
 93   hdd   2.72899                 osd.93                      up        0 1.00000
 94   hdd   2.72899                 osd.94                      up        0 1.00000
 95   hdd   2.72899                 osd.95                      up        0 1.00000
 96   hdd   2.72899                 osd.96                      up        0 1.00000
 97   hdd   2.72899                 osd.97                      up        0 1.00000
101   hdd   2.72899                 osd.101                     up        0 1.00000
102   hdd   2.72899                 osd.102                     up        0 1.00000
103   hdd   2.72899                 osd.103                     up        0 1.00000
104   hdd   2.72899                 osd.104                     up        0 1.00000
105   hdd   2.72899                 osd.105                     up        0 1.00000
106   hdd   2.72899                 osd.106                     up        0 1.00000
107   hdd   2.72899                 osd.107                     up        0 1.00000
108   hdd   2.72899                 osd.108                     up        0 1.00000
109   hdd   2.72899                 osd.109                     up        0 1.00000
110   hdd   2.72899                 osd.110                     up        0 1.00000
111   hdd   2.72899                 osd.111                     up        0 1.00000
-12        54.57977         rack RJ55
-11        54.57977             host p05151113748698
  0   hdd   2.72899                 osd.0                       up  1.00000 1.00000
  8   hdd   2.72899                 osd.8                       up  1.00000 1.00000
 14   hdd   2.72899                 osd.14                      up  1.00000 1.00000
 17   hdd   2.72899                 osd.17                      up  1.00000 1.00000
 19   hdd   2.72899                 osd.19                      up  1.00000 1.00000
 20   hdd   2.72899                 osd.20                      up  1.00000 1.00000
 21   hdd   2.72899                 osd.21                      up  1.00000 1.00000
 26   hdd   2.72899                 osd.26                      up  1.00000 1.00000
 30   hdd   2.72899                 osd.30                      up  1.00000 1.00000
 33   hdd   2.72899                 osd.33                      up  1.00000 1.00000
 34   hdd   2.72899                 osd.34                      up  1.00000 1.00000
 37   hdd   2.72899                 osd.37                      up  1.00000 1.00000
 42   hdd   2.72899                 osd.42                      up  1.00000 1.00000
 44   hdd   2.72899                 osd.44                      up  1.00000 1.00000
 45   hdd   2.72899                 osd.45                      up  1.00000 1.00000
 47   hdd   2.72899                 osd.47                      up  1.00000 1.00000
 52   hdd   2.72899                 osd.52                      up  1.00000 1.00000
 69   hdd   2.72899                 osd.69                      up  1.00000 1.00000
 72   hdd   2.72899                 osd.72                      up  1.00000 1.00000
 73   hdd   2.72899                 osd.73                      up  1.00000 1.00000
-22        54.57977         rack RJ57
-21        54.57977             host p05151113782262
112   hdd   2.72899                 osd.112                     up  1.00000 1.00000
113   hdd   2.72899                 osd.113                     up  1.00000 1.00000
114   hdd   2.72899                 osd.114                     up  1.00000 1.00000
115   hdd   2.72899                 osd.115                     up  1.00000 1.00000
116   hdd   2.72899                 osd.116                     up  1.00000 1.00000
117   hdd   2.72899                 osd.117                     up  1.00000 1.00000
118   hdd   2.72899                 osd.118                     up  1.00000 1.00000
119   hdd   2.72899                 osd.119                     up  1.00000 1.00000
120   hdd   2.72899                 osd.120                     up  1.00000 1.00000
121   hdd   2.72899                 osd.121                     up  1.00000 1.00000
122   hdd   2.72899                 osd.122                     up  1.00000 1.00000
123   hdd   2.72899                 osd.123                     up  1.00000 1.00000
124   hdd   2.72899                 osd.124                     up  1.00000 1.00000
125   hdd   2.72899                 osd.125                     up  1.00000 1.00000
126   hdd   2.72899                 osd.126                     up  1.00000 1.00000
127   hdd   2.72899                 osd.127                     up  1.00000 1.00000
128   hdd   2.72899                 osd.128                     up  1.00000 1.00000
129   hdd   2.72899                 osd.129                     up  1.00000 1.00000
130   hdd   2.72899                 osd.130                     up  1.00000 1.00000
131   hdd   2.72899                 osd.131                     up  1.00000 1.00000
 -5       530.86011     room 0513-S-0034
 -3       530.86011         rack SJ07
 -2       530.86011             host cephdata20b-106a771b77
  1   hdd  11.05959                 osd.1                       up  1.00000 1.00000
  2   hdd  11.05959                 osd.2                       up  1.00000 1.00000
  3   hdd  11.05959                 osd.3                       up  1.00000 1.00000
  4   hdd  11.05959                 osd.4                       up  1.00000 1.00000
  5   hdd  11.05959                 osd.5                       up  1.00000 1.00000
  6   hdd  11.05959                 osd.6                       up  1.00000 1.00000
  7   hdd  11.05959                 osd.7                       up  1.00000 1.00000
  9   hdd  11.05959                 osd.9                       up  1.00000 1.00000
 10   hdd  11.05959                 osd.10                      up  1.00000 1.00000
 11   hdd  11.05959                 osd.11                      up  1.00000 1.00000
 12   hdd  11.05959                 osd.12                      up  1.00000 1.00000
 13   hdd  11.05959                 osd.13                      up  1.00000 1.00000
 15   hdd  11.05959                 osd.15                      up  1.00000 1.00000
 16   hdd  11.05959                 osd.16                      up  1.00000 1.00000
 18   hdd  11.05959                 osd.18                      up  1.00000 1.00000
 22   hdd  11.05959                 osd.22                      up  1.00000 1.00000
 23   hdd  11.05959                 osd.23                      up  1.00000 1.00000
 24   hdd  11.05959                 osd.24                      up  1.00000 1.00000
 25   hdd  11.05959                 osd.25                      up  1.00000 1.00000
 27   hdd  11.05959                 osd.27                      up  1.00000 1.00000
 28   hdd  11.05959                 osd.28                      up  1.00000 1.00000
 29   hdd  11.05959                 osd.29                      up  1.00000 1.00000
 31   hdd  11.05959                 osd.31                      up  1.00000 1.00000
 32   hdd  11.05959                 osd.32                      up  1.00000 1.00000
 35   hdd  11.05959                 osd.35                      up  1.00000 1.00000
 36   hdd  11.05959                 osd.36                      up  1.00000 1.00000
 38   hdd  11.05959                 osd.38                      up  1.00000 1.00000
 39   hdd  11.05959                 osd.39                      up  1.00000 1.00000
 40   hdd  11.05959                 osd.40                      up  1.00000 1.00000
 41   hdd  11.05959                 osd.41                      up  1.00000 1.00000
 43   hdd  11.05959                 osd.43                      up  1.00000 1.00000
 46   hdd  11.05959                 osd.46                      up  1.00000 1.00000
 48   hdd  11.05959                 osd.48                      up  1.00000 1.00000
 49   hdd  11.05959                 osd.49                      up  1.00000 1.00000
 50   hdd  11.05959                 osd.50                      up  1.00000 1.00000
 51   hdd  11.05959                 osd.51                      up  1.00000 1.00000
 53   hdd  11.05959                 osd.53                      up  1.00000 1.00000
 54   hdd  11.05959                 osd.54                      up  1.00000 1.00000
 55   hdd  11.05959                 osd.55                      up  1.00000 1.00000
 56   hdd  11.05959                 osd.56                      up  1.00000 1.00000
 57   hdd  11.05959                 osd.57                      up  1.00000 1.00000
 58   hdd  11.05959                 osd.58                      up  1.00000 1.00000
 59   hdd  11.05959                 osd.59                      up  1.00000 1.00000
 60   hdd  11.05959                 osd.60                      up  1.00000 1.00000
 61   hdd  11.05959                 osd.61                      up  1.00000 1.00000
 62   hdd  11.05959                 osd.62                      up  1.00000 1.00000
 63   hdd  11.05959                 osd.63                      up  1.00000 1.00000
 64   hdd  11.05959                 osd.64                      up  1.00000 1.00000

All of the osds in host p05151113613837 were marked out days ago and most PGs backfilled elsewhere.
And the OSDs in p05151113613837 can be up or down -- starting or stopping those osds doesn't affect what I'll write below.

Pools are 3x replicated with a simple rule to replicate across hosts from default root:

pool 173 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 342059 flags hashpspool,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 0 compression_algorithm zstd compression_mode none application cephfs
pool 174 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 342060 flags hashpspool,nodelete,nopgchange,nosizechange stripe_width 0 deep_scrub_interval 604800 pg_autoscale_bias 100 scrub_max_interval 604800 scrub_min_interval 86400 application cephfs
pool 179 'tapetest' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 342062 lfor 0/303646/311385 flags hashpspool,nodelete,nopgchange,nosizechange stripe_width 0 application cta
pool 183 'eospps' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 342063 lfor 0/300253/311387 flags hashpspool,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 0 application eos

Problem A:

Several PGs are stuck active+undersized+degraded and several other PGs are stuck active+clean+remapped:
:

# ceph pg ls undersized
PG     OBJECTS DEGRADED MISPLACED UNFOUND BYTES       OMAP_BYTES* OMAP_KEYS* LOG  STATE                      SINCE VERSION         REPORTED        UP          ACTING      SCRUB_STAMP                DEEP_SCRUB_STAMP
173.11  128172   128172         0       0 35015262966           0          0 3099 active+undersized+degraded    2m 348705'44714700 348705:69588008  [16,45]p16  [16,45]p16 2021-02-01 23:14:07.895966 2021-01-30 18:40:12.078605
173.54  127846   127846         0       0 34763230317           0          0 3074 active+undersized+degraded    2m 348705'47023180 348705:68195726  [29,37]p29  [29,37]p29 2021-02-01 23:41:57.834821 2021-02-01 23:41:57.834821
173.da  127112   127112         0       0 35179867232           0          0 3070 active+undersized+degraded    2m 348705'45398574 348705:66752382  [15,69]p15  [15,69]p15 2021-02-01 15:51:20.605883 2021-01-29 18:54:59.173094
173.e1  127513   127513         0       0 35736987192           0          0 3022 active+undersized+degraded    2m 348705'46732368 348705:65452108 [62,123]p62 [62,123]p62 2021-02-01 19:38:59.001676 2021-01-26 15:45:15.061754
174.4b   18219    18219         0       0    12582912    37185174      75873 3058 active+undersized+degraded    2m 348660'13700480 348704:14796045 [62,127]p62 [62,127]p62 2021-02-01 11:00:17.618501 2021-01-26 00:50:47.631000
174.54   18161    18161         0       0     4194304    40213710      81595 3030 active+undersized+degraded    2m 348524'13523550 348704:14671548  [53,45]p53  [53,45]p53 2021-02-01 21:17:53.439965 2021-02-01 21:17:53.439965
174.8a   18174    18174         0       0     8388608    36977086      74759 3029 active+undersized+degraded    2m 348524'13526179 348704:14534279 [53,130]p53 [53,130]p53 2021-02-01 13:43:11.968673 2021-01-30 02:52:39.605898
# ceph pg ls remapped
...
179.6        1        0         1       0          22           0          0    0
 active+clean+remapped    8m   250376'293285  348710:1627158 [62,121]p62  [62,121,26]p62 2021-02-01 13:57:31.492398 2021-01-29 09:58:50.208557
179.e        4        0         4       0         608           0          0    0
 active+clean+remapped    8m   250376'293285  348710:1627170 [40,113]p40  [40,113,17]p40 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548
183.12      14        0        14       0    54542336           0          0    0
 active+clean+remapped    8m    248986'46016    348710:85761 [49,129]p49   [49,129,0]p49 2021-02-01 15:51:49.357897 2021-01-26 01:36:48.023650

Let's look at those active+clean+remapped PGs first. The peer_info for 179.e includes osd.17: https://termbin.com/oa1k
I don't know how to query which PGs an OSD thinks it owns (via the asok).
But if I restart osd.17 then this PG becomes degraded:

# ceph osd down 17
marked down osd.17.
# sleep 30
# ceph pg ls | grep 179.e
179.e        4        4         0       0         608           0          0    0         active+undersized+degraded   18s    250376'293285   348774:1627177       [5,113]p5       [5,113]p5 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548

So looks like the peer info is stale for that PG? I thought maybe the peer info is stale because the pools are idle -- but I ran some rados bench's on the relevant pools and the PGs stay active+clean+remapped until the osd in the stale peer info is restarted.

But then I restart osd.5 (the primary for PG 179.e), and then the PG switches back to active+remapped+clean!!

# ceph pg ls | grep 179.e
179.e        4        4         0       0         608           0          0    0         active+undersized+degraded    6m    250376'293285   348780:1627183       [5,113]p5       [5,113]p5 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548
# ceph osd down 5
marked down osd.5.
# sleep 30
# ceph pg ls | grep 179.e
179.e        4        0         4       0         608           0          0    0
    active+clean+remapped   26s    250376'293285   348787:1627195       [5,113]p5    [5,113,17]p5 2021-02-01 17:42:14.190548 2021-02-01 17:42:14.190548

So that represents some very strange peering issue, doesn't it?


Problem B:

Those PGs really are undersized because crush can't find 3 OSDs.
But if I inject a crush map with `tunable choose_total_tries 200` then all of those strange PGs start backfilling, e.g. :

174.8a   18174        0     16830       0     8388608    36977086      74759 3029 active+remapped+backfilling   63s 348524'13526179 348725:14537005 [53,130,33]p53 [53,130,102]p53 2021-02-01 13:43:11.968673 2021-01-30 02:52:39.605898

Should we increase the default choose_total_tries ?

History

#1 Updated by Neha Ojha almost 3 years ago

I don't see an issue with increasing choose_total_tries as long as it not a very high value.
Regarding Problem A, will it be possible for you to share osd logs with debug_osd=20 to demonstrate the issue?

#2 Updated by Dan van der Ster almost 3 years ago

Neha Ojha wrote:

Regarding Problem A, will it be possible for you to share osd logs with debug_osd=20 to demonstrate the issue?

OK, starting with this choose_total_tries = 50 I have

    pgs:     1064935/79517613 objects misplaced (1.339%)
             975 active+clean
             26  active+clean+remapped
             19  active+clean+scrubbing
             4   active+clean+scrubbing+deep+repair
             1   active+clean+remapped+scrubbing

Here is the first remapped PG:

38.17      343        0       343       0  2840652338         126          6 3016
 active+clean+remapped    2m    350262'46683   350353:411428  [36,26]p36 [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288

Then I enabled debug_osd 20 everywhere and downed osd 36:

# ceph osd down 36
marked down osd.36.
# ceph pg ls remapped | head
PG     OBJECTS DEGRADED MISPLACED UNFOUND BYTES       OMAP_BYTES* OMAP_KEYS* LOG  STATE                           SINCE VERSION         REPORTED        UP          ACTING         SCRUB_STAMP                DEEP_SCRUB_STAMP
38.17      343        0       343       0  2840652338         126          6 3016
 active+clean+remapped   37s    350262'46683   350359:411440  [36,26]p36 [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288

The PG is not yet undersized ^^.

Now I down osd.116 then the PG becomes undersized:

# ceph osd down 116
marked down osd.116.
# ceph pg ls undersized | head
PG    OBJECTS DEGRADED MISPLACED UNFOUND BYTES       OMAP_BYTES* OMAP_KEYS* LOG  STATE                      SINCE VERSION         REPORTED        UP         ACTING     SCRUB_STAMP                DEEP_SCRUB_STAMP
38.17     343      343         0       0  2840652338         126          6 3016 active+undersized+degraded   31s    350262'46683   350363:411452 [36,26]p36 [36,26]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288

And I down osd.36 again and the PG is no longer undersized:

# ceph osd down 36
marked down osd.36.
# ceph pg ls remapped | head
PG     OBJECTS DEGRADED MISPLACED UNFOUND BYTES       OMAP_BYTES* OMAP_KEYS* LOG  STATE                           SINCE VERSION         REPORTED        UP          ACTING         SCRUB_STAMP                DEEP_SCRUB_STAMP
38.17      343        0       343       0  2840652338         126          6 3016
 active+clean+remapped   17s    350262'46683   350369:411463  [36,26]p36 [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288

Finally i injected the crushmap with choose_total_tries = 200 and the PG is no longer remapped:

# ceph osd set
crushmap -i crush.map.200
77959
# ceph pg ls | grep 38.17
38.17      343        0         0       0  2840652338         126          6 3016
             active+clean   45s     350262'46683    350372:411476  [36,26,116]p36  [36,26,116]p36 2021-02-19 01:37:34.661932 2021-02-16 15:59:54.108288

The logs for those three osds, plus the crushmaps, plus ceph.log are posted: ceph-post-file: 3c5ec76e-6f8c-4fbf-997f-de4a0af2897e

debug_osd=20 begins at `2021-02-19 14:39:08` in those logs.

Thanks!

#3 Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Triaged

Thanks for the detailed logs!

Firstly, the pg dump output can sometimes be a little laggy, so I am basing my assessment on the logs you've provided.

The log line including "handle_advance_map" shows us the UP/ACTING osds associated with a particular osdmap change. Note that the acting set may change after peering.

This corresponds to the time when osd.36 was brought down. Until this time, UP/ACTING = [36,26]/[36,26,116] STATE=active+clean+remapped. This is because the acting set has 3 osds, and hence we aren't undersized

2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350354 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] handle_advance_map [26]/[26,116] -- 26/26
2021-02-19 14:40:27.869 7fc96e97c700 20 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] new interval newup [26] newacting [26,116]
2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] state<Started/Primary/Active>: Active advmap interval change, fast return
2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] state<Started>: Started advmap
2021-02-19 14:40:27.869 7fc96e97c700 20 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] new interval newup [26] newacting [26,116]
2021-02-19 14:40:27.869 7fc96e97c700 10 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] state<Started>: should_restart_peering, transitioning to Reset
2021-02-19 14:40:27.869 7fc96e97c700  5 osd.36 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=0 lpr=350348 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] exit Started/Primary/Active/Clean 291.539228 16 0.000182

since 36 is down, 26 becomes that primary and UP/ACTING = [26]/[26,116] STATE = active+undersized+remapped (due to one less peer in the acting set)

2021-02-19 14:40:23.845 7f3a9e1fa700 10 osd.26 pg_epoch: 350354 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350347/350348/350305) [36,26]/[36,26,116] r=1 lpr=350348 luod=0'0 crt=350262'46683 lcod 0'0 active+remapped mbc={}] handle_advance_map [26]/[26,116] -- 26/26
2021-02-19 14:40:23.846 7f3a9e1fa700 10 osd.26 pg_epoch: 350355 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350348/350349 n=343 ec=311391/19547 lis/c 350348/350348 les/c/f 350349/350349/0 350355/350355/350355) [26]/[26,116] r=0 lpr=350355 pi=[350348,350355)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] build_prior  prior osd.36 is down
2021-02-19 14:40:24.668 7f3a9e1fa700 20 osd.26 pg_epoch: 350356 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350355/350355/350355) [26]/[26,116] r=0 lpr=350355 pi=[350348,350355)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 active+undersized+remapped mbc={}] _update_calc_stats actingset 26,116 upset 26 acting_recovery_backfill 26,116

osd.36 brought back

2021-02-19 14:40:28.751 7f3a9e1fa700 10 osd.26 pg_epoch: 350356 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350355/350355/350355) [26]/[26,116] r=0 lpr=350355 pi=[350348,350355)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 active+undersized+remapped mbc={}] handle_advance_map [36,26]/[26,116] -- 36/26
2021-02-19 14:40:28.751 7f3a9e1fa700  1 osd.26 pg_epoch: 350357 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350357/350357/350355) [36,26]/[26,116] r=0 lpr=350357 pi=[350348,350357)/1 luod=0'0 crt=350262'46683 lcod 0'0 mlcod 0'0 active+remapped mbc={}] start_peering_interval up [26] -> [36,26], acting [26,116] -> [26,116], acting_primary 26 -> 26, up_primary 26 -> 36, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199
2021-02-19 14:40:29.754 7f3a9e1fa700 20 osd.26 pg_epoch: 350358 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350355/350356 n=343 ec=311391/19547 lis/c 350355/350348 les/c/f 350356/350349/0 350357/350357/350355) [36,26]/[26,116] r=0 lpr=350357 pi=[350348,350357)/1 crt=350262'46683 lcod 0'0 mlcod 0'0 remapped mbc={}] new interval newup [36,26] newacting [36,26,116]

osd.36 becomes the primary again and STATE = active+clean+remapped

2021-02-19 14:40:30.870 7fc96e97c700 20 osd.36 pg_epoch: 350359 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350358/350359 n=343 ec=311391/19547 lis/c 350358/350358 les/c/f 350359/350359/0 350357/350358/350358) [36,26]/[36,26,116] r=0 lpr=350358 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] _update_calc_stats actingset 26,36,116 upset 26,36 acting_recovery_backfill 26,36,116

Similarly, we'll see the pg go to undersized when osd.116 is brought down.

We can clearly see that choose_total_tries has a part to play here since when you changed it to 200, we see osd.116 in the up set.

2021-02-19 14:44:21.086 7fc96e97c700 10 osd.36 pg_epoch: 350369 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350368/350369 n=343 ec=311391/19547 lis/c 350368/350368 les/c/f 350369/350369/0 350367/350368/350368) [36,26]/[36,26,116] r=0 lpr=350368 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] handle_advance_map [36,26,116]/[36,26,116] -- 36/36
2021-02-19 14:44:21.086 7fc96e97c700 20 osd.36 pg_epoch: 350370 pg[38.17( v 350262'46683 (344642'43667,350262'46683] local-lis/les=350368/350369 n=343 ec=311391/19547 lis/c 350368/350368 les/c/f 350369/350369/0 350367/350368/350368) [36,26]/[36,26,116] r=0 lpr=350368 crt=350262'46683 lcod 0'0 mlcod 0'0 active+clean+remapped] new interval newup [36,26,116] newacting [36,26,116]

There have been other reports about the need to increase the value of choose_total_tries for crush to try harder to find placement targets(when pgs are in active+clean+remapped). In the past, we increased the default from 19 to 50, perhaps, we can make even higher now, maybe 100.

#4 Updated by Dan van der Ster almost 3 years ago

Thanks for the analysis Neha.

Something that perhaps wasn't clear in comment 2 -- in each case where I print the `ceph pg ls` output, this is the steady state after all three osds are again up and all pgs have re-peered. To me it is strange that the pg stays "undersized" even long after all the needed OSDs are back up.

Down/up of a non primary, e.g. osd.116, makes the PG go undersized and it stays that way.
Down/up of the primary osd 36 then "fixes" the pg: it is active+clean after this, without even needing any recovery.

Do you understand why the PG remains undersized after we down/up osd.116 ?
To me it looks like 116 isn't correctly advertising that it has the pg contents when it boots. And that somehow reminds me of #46847

Down/up of osd.116 was triggered here, in case it helps with finding the relevant sections of the osd logs:

2021-02-19 14:42:09.952790 mon.cephdwight-mon-1633994557 (mon.0) 5275077 : cluster [INF] osd.116 [v2:128.142.23.102:6802/2682786,v1:128.142.23.102:6811/2682786] boot
2021-02-19 14:42:09.830519 osd.116 (osd.116) 1 : cluster [WRN] Monitor daemon marked osd.116 down, but it is still running
2021-02-19 14:42:09.830536 osd.116 (osd.116) 2 : cluster [DBG] map e350361 wrongly marked me down at e350360

I checked osd.36's log at that time, and indeed I don't see 116 joining the up set even though it is up!

#5 Updated by Dan van der Ster over 2 years ago

"Problem A" is probably a dupe of #46847

Also available in: Atom PDF