Support #20108: PGs are not remapped correctly when one host fails - RADOS - Ceph

Actions

Copy link

Support #20108

closed

PGs are not remapped correctly when one host fails

Added by Laszlo Budai almost 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Peering

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v0.94.10

Component(RADOS):

CRUSH

Pull request ID:

Description

I have run into the following problem:
in a 6 node cluster we have 2 nodes/chassis, and the crush rule set to distribute PGs on chassis.
One node failed and the cluster ended up with a lot of PGs stuck in active+remapped and active+undersized+degraded state.

ceph version is 0.94.10

we were able to reproduce the issue in a virtual environment also. We have created a small cluster, and the following crush map:

apt-get purge ceph ceph-common

ceph osd crush add-bucket c1 chassis
ceph osd crush add-bucket c2 chassis

ceph osd crush add-bucket tceph2 host
ceph osd crush add-bucket tceph3 host
ceph osd crush add-bucket tceph4 host

ceph osd crush set 0 0.02139 host=tceph1
ceph osd crush set 1 0.02139 host=tceph2
ceph osd crush set 2 0.02139 host=tceph3
ceph osd crush set 3 0.02139 host=tceph4

ceph osd crush move tceph1 chassis=c1
ceph osd crush move tceph2 chassis=c1
ceph osd crush move tceph3 chassis=c2
ceph osd crush move tceph4 chassis=c2

ceph osd crush move c1 root=default
ceph osd crush move c2 root=default

ceph osd crush rule create-simple test-chassis default chassis firstn
ceph osd pool set rbd crush_ruleset 1

then we have killed one of the OSD processes, and we got:

root@tceph1:~ # ceph -s
    cluster 50d51171-2015-4722-99b3-acfd7ce25cd7
     health HEALTH_WARN
            21 pgs degraded
            26 pgs stuck unclean
            21 pgs undersized
     monmap e1: 1 mons at {tceph1=192.168.178.213:6789/0}
            election epoch 1, quorum 0 tceph1
     osdmap e131: 4 osds: 3 up, 3 in; 5 remapped pgs
      pgmap v256: 64 pgs, 1 pools, 0 bytes data, 0 objects
            106 MB used, 61303 MB / 61409 MB avail
                  38 active+clean
                  21 active+undersized+degraded
                   5 active+remapped
root@tceph1:~ #  ceph osd tree
ID WEIGHT  TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.08551 root default                                          
-6 0.04276     chassis c1                                        
-2 0.02138         host tceph1                                   
 0 0.02138             osd.0        up  1.00000          1.00000 
-4 0.02138         host tceph2                                   
 1 0.02138             osd.1        up  1.00000          1.00000 
-7 0.04276     chassis c2                                        
-5 0.02138         host tceph3                                   
 2 0.02138             osd.2      down        0          1.00000 
-3 0.02138         host tceph4                                   
 3 0.02138             osd.3        up  1.00000          1.00000 

root@tceph1:~ # ceph health detail
HEALTH_WARN 21 pgs degraded; 26 pgs stuck unclean; 21 pgs undersized
pg 0.29 is stuck unclean for 1183.502405, current state active+undersized+degraded, last acting [1]
pg 0.27 is stuck unclean for 1183.497616, current state active+undersized+degraded, last acting [0]
pg 0.23 is stuck unclean for 1191.397627, current state active+remapped, last acting [0,1]
pg 0.21 is stuck unclean for 1590.011580, current state active+remapped, last acting [0,1]
pg 0.1f is stuck unclean for 1183.509698, current state active+undersized+degraded, last acting [1]
pg 0.1d is stuck unclean for 1183.510103, current state active+undersized+degraded, last acting [1]
pg 0.18 is stuck unclean for 1183.505489, current state active+undersized+degraded, last acting [1]
pg 0.17 is stuck unclean for 1183.494996, current state active+undersized+degraded, last acting [0]
pg 0.15 is stuck unclean for 1104.965325, current state active+undersized+degraded, last acting [0]
pg 0.14 is stuck unclean for 1104.965260, current state active+undersized+degraded, last acting [0]
pg 0.13 is stuck unclean for 1588.792094, current state active+remapped, last acting [0,1]
pg 0.12 is stuck unclean for 1183.496727, current state active+undersized+degraded, last acting [1]
pg 0.3f is stuck unclean for 1588.792440, current state active+remapped, last acting [0,1]
pg 0.3e is stuck unclean for 1183.497452, current state active+undersized+degraded, last acting [1]
pg 0.e is stuck unclean for 1104.967052, current state active+undersized+degraded, last acting [0]
pg 0.c is stuck unclean for 1183.501277, current state active+undersized+degraded, last acting [0]
pg 0.3a is stuck unclean for 1104.962649, current state active+undersized+degraded, last acting [0]
pg 0.a is stuck unclean for 1104.966701, current state active+undersized+degraded, last acting [1]
pg 0.5 is stuck unclean for 1183.495265, current state active+undersized+degraded, last acting [1]
pg 0.33 is stuck unclean for 1104.965930, current state active+undersized+degraded, last acting [0]
pg 0.32 is stuck unclean for 1104.963028, current state active+undersized+degraded, last acting [0]
pg 0.31 is stuck unclean for 1588.790718, current state active+remapped, last acting [0,1]
pg 0.1 is stuck unclean for 1104.966219, current state active+undersized+degraded, last acting [0]
pg 0.30 is stuck unclean for 1183.496656, current state active+undersized+degraded, last acting [1]
pg 0.2f is stuck unclean for 1775.011740, current state active+undersized+degraded, last acting [1]
pg 0.2d is stuck unclean for 1183.497381, current state active+undersized+degraded, last acting [1]
pg 0.1f is active+undersized+degraded, acting [1]
pg 0.1d is active+undersized+degraded, acting [1]
pg 0.18 is active+undersized+degraded, acting [1]
pg 0.17 is active+undersized+degraded, acting [0]
pg 0.15 is active+undersized+degraded, acting [0]
pg 0.14 is active+undersized+degraded, acting [0]
pg 0.12 is active+undersized+degraded, acting [1]
pg 0.e is active+undersized+degraded, acting [0]
pg 0.c is active+undersized+degraded, acting [0]
pg 0.a is active+undersized+degraded, acting [1]
pg 0.5 is active+undersized+degraded, acting [1]
pg 0.1 is active+undersized+degraded, acting [0]
pg 0.3e is active+undersized+degraded, acting [1]
pg 0.3a is active+undersized+degraded, acting [0]
pg 0.33 is active+undersized+degraded, acting [0]
pg 0.32 is active+undersized+degraded, acting [0]
pg 0.30 is active+undersized+degraded, acting [1]
pg 0.2f is active+undersized+degraded, acting [1]
pg 0.2d is active+undersized+degraded, acting [1]
pg 0.29 is active+undersized+degraded, acting [1]
pg 0.27 is active+undersized+degraded, acting [0]

No peering as happened.

Doing the same thing in jewel has recovered properly.

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Tracker changed from Bug to Support
Project changed from Ceph to RADOS
Category changed from 10 to Peering
Status changed from New to Resolved
Component(RADOS) CRUSH added

Okay, as described (and especially since it's better in jewel) this is almost certainly about CRUSH max_retries. I'm a bit surprised it's a problem in the larger cluster, but the 2 chassis decision point is probably enough to constrict things.

Note that you can use newer crush tunables (than the default) on hammer, which can also probably resolve the issue.

Actions

Copy link

Updated by Laszlo Budai over 6 years ago

Hello,

I'm sorry I've missed your message. Can you please give me some clues about the "newer crush tunables" that you had in your mind?

Kind regards,
Laszlo

Greg Farnum wrote:

Okay, as described (and especially since it's better in jewel) this is almost certainly about CRUSH max_retries. I'm a bit surprised it's a problem in the larger cluster, but the 2 chassis decision point is probably enough to constrict things.

Note that you can use newer crush tunables (than the default) on hammer, which can also probably resolve the issue.

Actions

Copy link