Project

General

Profile

Actions

Support #23050

closed

PG doesn't move to down state in replica pool

Added by Nokia ceph-users about 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Hello,

Environment used - 3 node cluster
Replication - 3

#ceph osd pool ls detail
pool 16 'cdvr_ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 2119 flags hashpspool stripe_width 0 application freeform

Scenario :

1. For a specific PG (example 16.29e) i have stopped all respective OSDs.

[root@pl12-cn1 ~]# ceph pg dump | grep 16.29e
dumped all
16.29e        0                  0        0         0       0     0   0        0 active+clean 2018-02-20 10:06:54.392885     0'0  2241:89  [15,6,28]         15  [15,6,28]             15        0'0 2018-02-20 06:02:53.117922             0'0 2018-02-20 06:02:53.117922
[root@pl12-cn1 ~]#

  cluster:
    id:     c36fb424-038a-4c38-84a4-1469481ad5c8
    health: HEALTH_WARN
            3 osds down
            Reduced data availability: 4 pgs inactive
            Degraded data redundancy: 140 pgs unclean, 230 pgs degraded

  services:
    mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3
    mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2
    osd: 36 osds: 33 up, 36 in

  data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0 bytes
    usage:   41063 MB used, 196 TB / 196 TB avail
    pgs:     1.465% pgs not active
             794 active+clean
             215 active+undersized+degraded
             13  undersized+degraded+peered
             2   stale+undersized+degraded+peered

2. After stopping all 3 OSDs (replica 3) i can see the respective pg is marked as stale. No PG is marked as down.
OSd.28 was stopped as last.

[root@pl12-cn1 ~]# ceph pg dump | grep 16.29e
dumped all
16.29e        0                  0        0         0       0     0   0        0 stale+undersized+degraded+peered 2018-02-20 10:00:44.999756     0'0  2233:80       [28]         28       [28]             28        0'0 2018-02-20 06:02:53.117922             0'0 2018-02-20 06:02:53.117922

3. I stopped more OSDs across all nodes and i see the same behavior. PGs are marked as stale but not down.

cluster:
id: c36fb424-038a-4c38-84a4-1469481ad5c8
health: HEALTH_WARN
18 osds down
Reduced data availability: 431 pgs inactive, 72 pgs stale
Degraded data redundancy: 861 pgs unclean, 889 pgs degraded, 708 pgs undersized
services:
mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3
mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2
osd: 36 osds: 18 up, 36 in
data:
pools: 1 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 40739 MB used, 196 TB / 196 TB avail
pgs: 47.363% pgs not active
404 active+undersized+degraded
360 undersized+degraded+peered
135 active+clean
125 stale+undersized+degraded+peered

For a replica pool, do we not expect to see PGs with down status?

Actions #1

Updated by Nokia ceph-users about 6 years ago

Please let me know of the required logs/info to be added if any.

Actions #2

Updated by Josh Durgin about 6 years ago

  • Tracker changed from Bug to Support
  • Project changed from Ceph to RADOS
  • Status changed from New to Closed

'stale' means there haven't been any reports from the primary in a while. Since there's no osd to report the status of a pg, these stay stale.

Actions

Also available in: Atom PDF