Bug #23051
openPGs stuck in down state
0%
Description
Hello,
We see PGs stuck in down state even when the respective osds are started and recovered from the failure scenario.
Environment : 3 node cluster
Erasure coding - 2+1
Ceph Luminous
Steps to reproduce :
1. Stop ceph-osd.target on one node. Wait till status is updated with osd count.
2. Stop ceph-osd.target on another node. All PGs are listed as_ down_ since its min size is 2.
cluster: id: c36fb424-038a-4c38-84a4-1469481ad5c8 health: HEALTH_WARN 24 osds down 2 hosts (24 osds) down Reduced data availability: 1024 pgs inactive, 1024 pgs down Degraded data redundancy: 1024 pgs unclean services: mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3 mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2 osd: 36 osds: 12 up, 36 in data: pools: 1 pools, 1024 pgs objects: 0 objects, 0 bytes usage: 41393 MB used, 196 TB / 196 TB avail pgs: 100.000% pgs not active 527 stale+down 497 down
Example PG :
[root@pl12-cn1 ~]# ceph pg dump | grep 17.29 dumped all 17.29 0 0 0 0 0 0 0 0 down 2018-02-20 12:18:45.705993 0'0 2308:89 [NONE,8,NONE] 8 [NONE,8,NONE] 8 0'0 2018-02-20 10:36:17.676335 0'0 2018-02-20 10:36:17.676335
3. Start ceph-osd.target on any one node. The expected behavior is that there shouldnt be any PGs down since we configured 2+1 erasure profile with min_size 2. But in our case, all PGs are still showing as down.
[root@pl12-cn1 ~]# ceph -s cluster: id: c36fb424-038a-4c38-84a4-1469481ad5c8 health: HEALTH_WARN 12 osds down 1 host (12 osds) down Reduced data availability: 1024 pgs inactive, 1024 pgs down Degraded data redundancy: 1024 pgs unclean services: mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3 mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2 osd: 36 osds: 24 up, 36 in data: pools: 1 pools, 1024 pgs objects: 0 objects, 0 bytes usage: 41393 MB used, 196 TB / 196 TB avail pgs: 100.000% pgs not active 1024 down
[root@pl12-cn1 ~]# ceph pg dump | grep 17.29 dumped all 17.29 0 0 0 0 0 0 0 0 down 2018-02-20 12:21:46.969702 0'0 2310:85 [20,8,NONE] 20 [20,8,NONE] 20 0'0 2018-02-20 10:36:17.676335 0'0 2018-02-20 10:36:17.676335
This issue is reproducible with these steps.Pleas let me know if any other info/logs is required.
Updated by Josh Durgin about 6 years ago
- Project changed from Ceph to RADOS
Can you post the results of 'ceph pg $PGID query' for some of the down pgs?