Support #23050: PG doesn't move to down state in replica pool - RADOS - Ceph

Actions

Copy link

Support #23050

closed

PG doesn't move to down state in replica pool

Added by Nokia ceph-users about 6 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v12.2.2

Component(RADOS):

Pull request ID:

Description

Hello,

Environment used - 3 node cluster
Replication - 3

#ceph osd pool ls detail
pool 16 'cdvr_ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 2119 flags hashpspool stripe_width 0 application freeform

Scenario :

1. For a specific PG (example 16.29e) i have stopped all respective OSDs.

[root@pl12-cn1 ~]# ceph pg dump | grep 16.29e
dumped all
16.29e        0                  0        0         0       0     0   0        0 active+clean 2018-02-20 10:06:54.392885     0'0  2241:89  [15,6,28]         15  [15,6,28]             15        0'0 2018-02-20 06:02:53.117922             0'0 2018-02-20 06:02:53.117922
[root@pl12-cn1 ~]#


  cluster:
    id:     c36fb424-038a-4c38-84a4-1469481ad5c8
    health: HEALTH_WARN
            3 osds down
            Reduced data availability: 4 pgs inactive
            Degraded data redundancy: 140 pgs unclean, 230 pgs degraded

  services:
    mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3
    mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2
    osd: 36 osds: 33 up, 36 in

  data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0 bytes
    usage:   41063 MB used, 196 TB / 196 TB avail
    pgs:     1.465% pgs not active
             794 active+clean
             215 active+undersized+degraded
             13  undersized+degraded+peered
             2   stale+undersized+degraded+peered

2. After stopping all 3 OSDs (replica 3) i can see the respective pg is marked as stale. No PG is marked as down.
OSd.28 was stopped as last.

[root@pl12-cn1 ~]# ceph pg dump | grep 16.29e
dumped all
16.29e        0                  0        0         0       0     0   0        0 stale+undersized+degraded+peered 2018-02-20 10:00:44.999756     0'0  2233:80       [28]         28       [28]             28        0'0 2018-02-20 06:02:53.117922             0'0 2018-02-20 06:02:53.117922

3. I stopped more OSDs across all nodes and i see the same behavior. PGs are marked as stale but not down.

cluster:
    id:     c36fb424-038a-4c38-84a4-1469481ad5c8
    health: HEALTH_WARN
            18 osds down
            Reduced data availability: 431 pgs inactive, 72 pgs stale
            Degraded data redundancy: 861 pgs unclean, 889 pgs degraded, 708 pgs undersized

services:
    mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3
    mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2
    osd: 36 osds: 18 up, 36 in

data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0 bytes
    usage:   40739 MB used, 196 TB / 196 TB avail
    pgs:     47.363% pgs not active
             404 active+undersized+degraded
             360 undersized+degraded+peered
             135 active+clean
             125 stale+undersized+degraded+peered

For a replica pool, do we not expect to see PGs with down status?