Project

General

Profile

Actions

Bug #23051

open

PGs stuck in down state

Added by Nokia ceph-users about 6 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

We see PGs stuck in down state even when the respective osds are started and recovered from the failure scenario.

Environment : 3 node cluster
Erasure coding - 2+1
Ceph Luminous

Steps to reproduce :

1. Stop ceph-osd.target on one node. Wait till status is updated with osd count.

2. Stop ceph-osd.target on another node. All PGs are listed as_ down_ since its min size is 2.

  cluster:
    id:     c36fb424-038a-4c38-84a4-1469481ad5c8
    health: HEALTH_WARN
            24 osds down
            2 hosts (24 osds) down
            Reduced data availability: 1024 pgs inactive, 1024 pgs down
            Degraded data redundancy: 1024 pgs unclean

  services:
    mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3
    mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2
    osd: 36 osds: 12 up, 36 in

  data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0 bytes
    usage:   41393 MB used, 196 TB / 196 TB avail
    pgs:     100.000% pgs not active
             527 stale+down
             497 down

Example PG :

[root@pl12-cn1 ~]# ceph pg dump | grep 17.29
dumped all
17.29         0                  0        0         0       0     0   0        0  down 2018-02-20 12:18:45.705993     0'0  2308:89  [NONE,8,NONE]          8  [NONE,8,NONE]              8        0'0 2018-02-20 10:36:17.676335             0'0 2018-02-20 10:36:17.676335

3. Start ceph-osd.target on any one node. The expected behavior is that there shouldnt be any PGs down since we configured 2+1 erasure profile with min_size 2. But in our case, all PGs are still showing as down.

[root@pl12-cn1 ~]# ceph -s
  cluster:
    id:     c36fb424-038a-4c38-84a4-1469481ad5c8
    health: HEALTH_WARN
            12 osds down
            1 host (12 osds) down
            Reduced data availability: 1024 pgs inactive, 1024 pgs down
            Degraded data redundancy: 1024 pgs unclean

  services:
    mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3
    mgr: pl12-cn3(active), standbys: pl12-cn1, pl12-cn2
    osd: 36 osds: 24 up, 36 in

  data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0 bytes
    usage:   41393 MB used, 196 TB / 196 TB avail
    pgs:     100.000% pgs not active
             1024 down

[root@pl12-cn1 ~]# ceph pg dump | grep 17.29
dumped all
17.29         0                  0        0         0       0     0   0        0  down 2018-02-20 12:21:46.969702     0'0  2310:85  [20,8,NONE]         20  [20,8,NONE]             20        0'0 2018-02-20 10:36:17.676335             0'0 2018-02-20 10:36:17.676335

This issue is reproducible with these steps.Pleas let me know if any other info/logs is required.

Actions #1

Updated by Josh Durgin about 6 years ago

  • Project changed from Ceph to RADOS

Can you post the results of 'ceph pg $PGID query' for some of the down pgs?

Actions

Also available in: Atom PDF