Support #20108
closedPGs are not remapped correctly when one host fails
0%
Description
I have run into the following problem:
in a 6 node cluster we have 2 nodes/chassis, and the crush rule set to distribute PGs on chassis.
One node failed and the cluster ended up with a lot of PGs stuck in active+remapped and active+undersized+degraded state.
ceph version is 0.94.10
we were able to reproduce the issue in a virtual environment also. We have created a small cluster, and the following crush map:
apt-get purge ceph ceph-common ceph osd crush add-bucket c1 chassis ceph osd crush add-bucket c2 chassis ceph osd crush add-bucket tceph2 host ceph osd crush add-bucket tceph3 host ceph osd crush add-bucket tceph4 host ceph osd crush set 0 0.02139 host=tceph1 ceph osd crush set 1 0.02139 host=tceph2 ceph osd crush set 2 0.02139 host=tceph3 ceph osd crush set 3 0.02139 host=tceph4 ceph osd crush move tceph1 chassis=c1 ceph osd crush move tceph2 chassis=c1 ceph osd crush move tceph3 chassis=c2 ceph osd crush move tceph4 chassis=c2 ceph osd crush move c1 root=default ceph osd crush move c2 root=default ceph osd crush rule create-simple test-chassis default chassis firstn ceph osd pool set rbd crush_ruleset 1
then we have killed one of the OSD processes, and we got:
root@tceph1:~ # ceph -s cluster 50d51171-2015-4722-99b3-acfd7ce25cd7 health HEALTH_WARN 21 pgs degraded 26 pgs stuck unclean 21 pgs undersized monmap e1: 1 mons at {tceph1=192.168.178.213:6789/0} election epoch 1, quorum 0 tceph1 osdmap e131: 4 osds: 3 up, 3 in; 5 remapped pgs pgmap v256: 64 pgs, 1 pools, 0 bytes data, 0 objects 106 MB used, 61303 MB / 61409 MB avail 38 active+clean 21 active+undersized+degraded 5 active+remapped root@tceph1:~ # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.08551 root default -6 0.04276 chassis c1 -2 0.02138 host tceph1 0 0.02138 osd.0 up 1.00000 1.00000 -4 0.02138 host tceph2 1 0.02138 osd.1 up 1.00000 1.00000 -7 0.04276 chassis c2 -5 0.02138 host tceph3 2 0.02138 osd.2 down 0 1.00000 -3 0.02138 host tceph4 3 0.02138 osd.3 up 1.00000 1.00000 root@tceph1:~ # ceph health detail HEALTH_WARN 21 pgs degraded; 26 pgs stuck unclean; 21 pgs undersized pg 0.29 is stuck unclean for 1183.502405, current state active+undersized+degraded, last acting [1] pg 0.27 is stuck unclean for 1183.497616, current state active+undersized+degraded, last acting [0] pg 0.23 is stuck unclean for 1191.397627, current state active+remapped, last acting [0,1] pg 0.21 is stuck unclean for 1590.011580, current state active+remapped, last acting [0,1] pg 0.1f is stuck unclean for 1183.509698, current state active+undersized+degraded, last acting [1] pg 0.1d is stuck unclean for 1183.510103, current state active+undersized+degraded, last acting [1] pg 0.18 is stuck unclean for 1183.505489, current state active+undersized+degraded, last acting [1] pg 0.17 is stuck unclean for 1183.494996, current state active+undersized+degraded, last acting [0] pg 0.15 is stuck unclean for 1104.965325, current state active+undersized+degraded, last acting [0] pg 0.14 is stuck unclean for 1104.965260, current state active+undersized+degraded, last acting [0] pg 0.13 is stuck unclean for 1588.792094, current state active+remapped, last acting [0,1] pg 0.12 is stuck unclean for 1183.496727, current state active+undersized+degraded, last acting [1] pg 0.3f is stuck unclean for 1588.792440, current state active+remapped, last acting [0,1] pg 0.3e is stuck unclean for 1183.497452, current state active+undersized+degraded, last acting [1] pg 0.e is stuck unclean for 1104.967052, current state active+undersized+degraded, last acting [0] pg 0.c is stuck unclean for 1183.501277, current state active+undersized+degraded, last acting [0] pg 0.3a is stuck unclean for 1104.962649, current state active+undersized+degraded, last acting [0] pg 0.a is stuck unclean for 1104.966701, current state active+undersized+degraded, last acting [1] pg 0.5 is stuck unclean for 1183.495265, current state active+undersized+degraded, last acting [1] pg 0.33 is stuck unclean for 1104.965930, current state active+undersized+degraded, last acting [0] pg 0.32 is stuck unclean for 1104.963028, current state active+undersized+degraded, last acting [0] pg 0.31 is stuck unclean for 1588.790718, current state active+remapped, last acting [0,1] pg 0.1 is stuck unclean for 1104.966219, current state active+undersized+degraded, last acting [0] pg 0.30 is stuck unclean for 1183.496656, current state active+undersized+degraded, last acting [1] pg 0.2f is stuck unclean for 1775.011740, current state active+undersized+degraded, last acting [1] pg 0.2d is stuck unclean for 1183.497381, current state active+undersized+degraded, last acting [1] pg 0.1f is active+undersized+degraded, acting [1] pg 0.1d is active+undersized+degraded, acting [1] pg 0.18 is active+undersized+degraded, acting [1] pg 0.17 is active+undersized+degraded, acting [0] pg 0.15 is active+undersized+degraded, acting [0] pg 0.14 is active+undersized+degraded, acting [0] pg 0.12 is active+undersized+degraded, acting [1] pg 0.e is active+undersized+degraded, acting [0] pg 0.c is active+undersized+degraded, acting [0] pg 0.a is active+undersized+degraded, acting [1] pg 0.5 is active+undersized+degraded, acting [1] pg 0.1 is active+undersized+degraded, acting [0] pg 0.3e is active+undersized+degraded, acting [1] pg 0.3a is active+undersized+degraded, acting [0] pg 0.33 is active+undersized+degraded, acting [0] pg 0.32 is active+undersized+degraded, acting [0] pg 0.30 is active+undersized+degraded, acting [1] pg 0.2f is active+undersized+degraded, acting [1] pg 0.2d is active+undersized+degraded, acting [1] pg 0.29 is active+undersized+degraded, acting [1] pg 0.27 is active+undersized+degraded, acting [0]
No peering as happened.
Doing the same thing in jewel has recovered properly.
Updated by Greg Farnum almost 7 years ago
- Tracker changed from Bug to Support
- Project changed from Ceph to RADOS
- Category changed from 10 to Peering
- Status changed from New to Resolved
- Component(RADOS) CRUSH added
Okay, as described (and especially since it's better in jewel) this is almost certainly about CRUSH max_retries. I'm a bit surprised it's a problem in the larger cluster, but the 2 chassis decision point is probably enough to constrict things.
Note that you can use newer crush tunables (than the default) on hammer, which can also probably resolve the issue.
Updated by Laszlo Budai over 6 years ago
Hello,
I'm sorry I've missed your message. Can you please give me some clues about the "newer crush tunables" that you had in your mind?
Kind regards,
Laszlo
Greg Farnum wrote:
Okay, as described (and especially since it's better in jewel) this is almost certainly about CRUSH max_retries. I'm a bit surprised it's a problem in the larger cluster, but the 2 chassis decision point is probably enough to constrict things.
Note that you can use newer crush tunables (than the default) on hammer, which can also probably resolve the issue.
Updated by Laszlo Budai over 6 years ago
Hi,
Thank you for your answer!
I've seen that page before, but which tunable are you suggesting for the problem encountered by me?
Kind regards,
Laszlo