Bug #56995
PGs go inactive after failed OSD comes up and is marked as in
0%
Description
I observe a problem with peering after an OSD goes down and comes back up again. A varying number of PGs end up inactive.
Expected behaviour is that all PGs become active.
This is observed on an octopus test cluster [ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)] with ceph fs as the only application. Here a full session showing the problem and how to reproduce it:
# On a ceph fs client run a benchmark that produces high IOPs load. Then: [root@tceph-01 ~]# ceph osd set noup noup is set [root@tceph-01 ~]# ceph osd down 6 marked down osd.6. [root@tceph-01 ~]# ceph osd out 6 marked out osd.6. [root@tceph-01 ~]# ceph osd unset noup noup is unset
At this point in time the OSD comes up again and all PGs become active.
[root@tceph-01 ~]# ceph osd in 6 marked in osd.6.
At this point things go wrong. PGs peer, but a random number of PGs remains stuck inactive (random=repeating this procedure leaves a different number of PGs inactive). A "ceph pg re-peer" does not help. Forcing recovery at least schedules the inactive PGs faster for recovery, but "slop OPS" problems remain. Here the last part of the session:
[root@tceph-01 ~]# ceph status cluster: id: bf1f51f5-b381-4cf7-b3db-88d044c1960c health: HEALTH_WARN 1 MDSs report slow metadata IOs Degraded data redundancy: 63459/4563189 objects degraded (1.391%), 138 pgs degraded services: mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 5d) mgr: tceph-01(active, since 6d), standbys: tceph-02, tceph-03 mds: fs:1 {0=tceph-03=up:active} 2 up:standby osd: 9 osds: 9 up (since 65s), 9 in (since 43s); 16 remapped pgs data: pools: 4 pools, 321 pgs objects: 1.07M objects, 49 GiB usage: 232 GiB used, 2.2 TiB / 2.4 TiB avail pgs: 4.050% pgs not active 63459/4563189 objects degraded (1.391%) 10/4563189 objects misplaced (0.000%) 183 active+clean 118 active+recovery_wait+degraded 13 recovery_wait+undersized+degraded+remapped+peered 4 active+recovering+degraded 2 active+recovery_wait+undersized+degraded+remapped 1 active+undersized+degraded+remapped+backfill_wait io: client: 0 B/s rd, 19 MiB/s wr, 11 op/s rd, 619 op/s wr recovery: 2.0 MiB/s, 529 keys/s, 65 objects/s [root@tceph-01 ~]# ceph health detail HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 13 pgs inactive; Degraded data redundancy: 62454/4594533 objects degraded (1.359%), 130 pgs degraded, 15 pgs undersized [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs mds.tceph-03(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 65 secs [WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive pg 4.1 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,1,5,0,2147483647,2] pg 4.2 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,5,2147483647,0,3,2] pg 4.14 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,4,2147483647,2,5,1] pg 4.1c is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,4,2,3,0] pg 4.33 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,4,2147483647,1,2,0] pg 4.3e is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,2147483647,2147483647,1,0,3] pg 4.4c is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,2147483647,1,2147483647,2,3] pg 4.4d is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,1,3,0] pg 4.53 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,4,2147483647,2147483647,1,5] pg 4.54 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,3,4,2147483647,5] pg 4.66 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2,5,0,4,2147483647] pg 4.68 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,0,4,3] pg 4.70 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,2147483647,2,4,3] [WRN] PG_DEGRADED: Degraded data redundancy: 62454/4594533 objects degraded (1.359%), 130 pgs degraded, 15 pgs undersized pg 3.50 is active+recovery_wait+degraded, acting [6,5,8] pg 3.51 is active+recovery_wait+degraded, acting [6,4,1] pg 3.53 is active+recovery_wait+degraded, acting [5,6,4] pg 3.55 is active+recovery_wait+degraded, acting [8,1,6] pg 3.59 is active+recovery_wait+degraded, acting [5,6,8] pg 3.5b is active+recovery_wait+degraded, acting [7,6,8] pg 3.5e is active+recovery_wait+degraded, acting [5,6,4] pg 3.5f is active+recovery_wait+degraded, acting [2,6,1] pg 3.64 is active+recovery_wait+degraded, acting [4,6,1] pg 3.66 is active+recovery_wait+degraded, acting [5,6,8] pg 3.68 is active+recovery_wait+degraded, acting [4,7,6] pg 3.6b is active+recovery_wait+degraded, acting [7,8,6] pg 3.70 is active+recovery_wait+degraded, acting [5,4,6] pg 3.71 is active+recovery_wait+degraded, acting [5,4,6] pg 3.72 is active+recovery_wait+degraded, acting [8,7,6] pg 3.7c is active+recovery_wait+degraded, acting [6,5,2] pg 3.7e is active+recovery_wait+degraded, acting [6,2,7] pg 4.51 is active+recovery_wait+degraded, acting [7,6,1,5,4,2] pg 4.53 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,4,2147483647,2147483647,1,5] pg 4.54 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,3,4,2147483647,5] pg 4.55 is active+recovery_wait+degraded, acting [3,8,5,4,7,6] pg 4.56 is active+recovery_wait+degraded, acting [4,8,1,0,6,3] pg 4.57 is active+recovery_wait+degraded, acting [3,7,6,5,8,0] pg 4.59 is active+recovery_wait+degraded, acting [4,6,2,1,3,7] pg 4.5a is active+recovery_wait+degraded, acting [5,7,4,3,0,6] pg 4.5b is active+recovery_wait+degraded, acting [3,7,6,5,1,2] pg 4.5c is active+recovery_wait+degraded, acting [0,5,4,7,6,1] pg 4.5d is active+recovery_wait+degraded, acting [1,8,5,6,2,4] pg 4.5f is active+recovery_wait+degraded, acting [2,6,1,7,5,3] pg 4.63 is active+recovery_wait+degraded, acting [7,5,6,0,1,2] pg 4.66 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2,5,0,4,2147483647] pg 4.67 is active+recovery_wait+degraded, acting [8,0,7,6,3,4] pg 4.68 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,0,4,3] pg 4.69 is active+recovery_wait+degraded, acting [3,1,6,4,8,2] pg 4.6a is active+recovery_wait+degraded, acting [4,0,2,1,3,6] pg 4.6b is active+recovery_wait+degraded, acting [2,4,7,6,0,3] pg 4.6c is active+recovery_wait+degraded, acting [0,1,5,6,4,3] pg 4.6d is active+recovery_wait+degraded, acting [0,1,6,8,2,4] pg 4.6f is active+recovery_wait+degraded, acting [2,6,1,7,8,0] pg 4.70 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,2147483647,2,4,3] pg 4.71 is active+recovery_wait+degraded, acting [5,6,2,8,4,7] pg 4.72 is active+recovery_wait+degraded, acting [5,0,1,6,2,7] pg 4.74 is active+recovery_wait+degraded, acting [4,2,6,1,8,7] pg 4.76 is active+recovery_wait+degraded, acting [0,4,6,7,1,8] pg 4.77 is active+recovery_wait+degraded, acting [3,5,1,4,7,6] pg 4.78 is active+recovery_wait+degraded, acting [7,1,3,4,6,0] pg 4.7a is active+recovery_wait+degraded, acting [0,1,6,8,2,4] pg 4.7c is active+recovery_wait+degraded, acting [4,8,1,7,3,6] pg 4.7d is active+recovery_wait+degraded, acting [5,4,1,0,6,8] pg 4.7e is active+recovery_wait+degraded, acting [1,8,3,6,7,2] pg 4.7f is active+recovery_wait+degraded, acting [7,0,6,3,2,4] [root@tceph-01 ~]# ceph pg force-recovery 4.1 4.2 4.14 4.1c 4.33 instructing pg(s) [4.1s1] on osd.1 to force-recovery; instructing pg(s) [4.14s1,4.1cs2,4.33s1] on osd.4 to force-recovery; instructing pg(s) [4.2s1] on osd.5 to force-recovery; [root@tceph-01 ~]# ceph pg force-recovery 4.3e 4.4c 4.4d 4.53 4.54 4.66 4.68 4.70 instructing pg(s) [4.54s1,4.70s1] on osd.0 to force-recovery; instructing pg(s) [4.66s1] on osd.2 to force-recovery; instructing pg(s) [4.4ds2,4.68s2] on osd.5 to force-recovery; instructing pg(s) [4.3es0,4.4cs0,4.53s0] on osd.6 to force-recovery; [root@tceph-01 ~]# ceph status cluster: id: bf1f51f5-b381-4cf7-b3db-88d044c1960c health: HEALTH_WARN 1 MDSs report slow metadata IOs 1 MDSs behind on trimming Reduced data availability: 11 pgs inactive Degraded data redundancy: 58914/4613532 objects degraded (1.277%), 123 pgs degraded, 12 pgs undersized services: mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 6d) mgr: tceph-01(active, since 6d), standbys: tceph-02, tceph-03 mds: fs:1 {0=tceph-03=up:active} 2 up:standby osd: 9 osds: 9 up (since 2m), 9 in (since 2m); 11 remapped pgs data: pools: 4 pools, 321 pgs objects: 1.08M objects, 50 GiB usage: 235 GiB used, 2.2 TiB / 2.4 TiB avail pgs: 3.427% pgs not active 58914/4613532 objects degraded (1.277%) 197 active+clean 112 active+recovery_wait+degraded 6 recovery_wait+forced_recovery+undersized+degraded+remapped+peered 4 recovering+forced_recovery+undersized+degraded+remapped+peered 1 active+undersized+degraded+remapped+backfill_wait 1 undersized+remapped+peered io: client: 1.2 KiB/s rd, 7.7 MiB/s wr, 66 op/s rd, 1.26k op/s wr recovery: 17 MiB/s, 46 objects/s
After all inactive PGs are recovered, operation returns to normal and the "slow OPS" warnings disappear. What is striking is that the inactive PGs show 2 OSDs as missing instead of just 1 (and therefore don't accept writes). It is also worrying that these PGs with higher degradation are not scheduled with high priority for recovery right away and I have to do that by hand.
History
#1 Updated by Frank Schilder over 1 year ago
Some additional info: test cluster with 3 nodes with 3 OSDs, 1 MON and 1 MGR each. Network 1G public and 1G replication. Plenty RAM and CPU.
[root@rit-tceph bench]# ceph osd crush tree ID CLASS WEIGHT TYPE NAME -1 2.44707 root default -7 0.81569 host tceph-01 0 hdd 0.27190 osd.0 3 hdd 0.27190 osd.3 6 hdd 0.27190 osd.6 -3 0.81569 host tceph-02 2 hdd 0.27190 osd.2 4 hdd 0.27190 osd.4 8 hdd 0.27190 osd.8 -5 0.81569 host tceph-03 1 hdd 0.27190 osd.1 5 hdd 0.27190 osd.5 7 hdd 0.27190 osd.7
[root@rit-tceph bench]# ceph osd pool ls detail pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 555 flags hashpspool,nodelete stripe_width 0 pg_num_min 1 application mgr_devicehealth pool 2 'fs-meta1' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change 592 lfor 0/458/589 flags hashpspool,nodelete max_bytes 53687091200 stripe_width 0 expected_num_objects 3 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 3 'fs-meta2' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 613 lfor 0/553/610 flags hashpspool,nodelete max_bytes 10737418240 stripe_width 0 expected_num_objects 3 application cephfs pool 4 'fs-data' erasure profile ec-4-2 size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 619 lfor 0/586/616 flags hashpspool,ec_overwrites,nodelete max_bytes 2199023255552 stripe_width 16384 fast_read 1 compression_mode aggressive application cephfs
[root@rit-tceph bench]# ceph osd crush rule dump fs-data { "rule_id": 1, "rule_name": "fs-data", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_indep", "num": 0, "type": "osd" }, { "op": "emit" } ] }
#2 Updated by Frank Schilder over 1 year ago
I cannot reproduce this with mimic-13.2.10.
#3 Updated by Frank Schilder about 1 year ago
The problem is still present in octopus 15.2.17. Almost certainly all newer versions are affected.