Bug #23430: PGs are stuck in 'creating+incomplete' status on vstart cluster - RADOS - Ceph

Actions

Copy link

Bug #23430

closed

PGs are stuck in 'creating+incomplete' status on vstart cluster

Added by Tatjana Dehler about 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Mykola Golub

Category:

Correctness/Safety

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

The PGs are stuck in 'creating+incomplete' status after creating an erasure coded pool on a vstart cluster.

I tested it on the master branch, commit https://github.com/ceph/ceph/commit/820dac980e9416fe05998d50cac633c81a87b9e3 and I'm observing this behavior about 12 days now.

Steps to reproduce:

1. Create a new vstart cluster

2. Create an erasure coded pool:

ceph-dev /ceph/build # bin/ceph osd pool create ecpool 12 12 erasure
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2018-03-20 09:22:20.589 7f0717be2700 -1 WARNING: all dangerous and experimental features are enabled.
2018-03-20 09:22:20.609 7f0717be2700 -1 WARNING: all dangerous and experimental features are enabled.
pool 'ecpool' created

3. After that my cluster is stuck in the following status:

ceph-dev /ceph/build # bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2018-03-20 09:22:52.885 7f706d0cb700 -1 WARNING: all dangerous and experimental features are enabled.
2018-03-20 09:22:52.897 7f706d0cb700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     78384e20-ab50-458e-b7b0-5248f7d26a20
    health: HEALTH_WARN
            Reduced data availability: 12 pgs incomplete

  services:
    mon: 3 daemons, quorum a,b,c
    mgr: x(active)
    mds: cephfs_a-1/1/1 up  {0=c=up:active}, 2 up:standby
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   3 pools, 28 pgs
    objects: 21 objects, 2.19K
    usage:   3.01G used, 27.0G / 30G avail
    pgs:     42.857% pgs not active
             16 active+clean
             12 creating+incomplete

Please find the pg dump output attached.

I'm not really sure which log files are helpful here, but I could attach them afterwards. Just let me know what you would need.

Files

pg_dump (9.16 KB) pg_dump

Tatjana Dehler, 03/21/2018 08:26 AM

Actions

Copy link

Updated by Tatjana Dehler about 6 years ago

I did further investigation here and figured out this issue occurs due to the "special" situation of my vstart environment. Here all three OSDs are located on one host. As I've learned so far (thanks to Nathan Cutler for the hint): when creating an erasure coded pool on a cluster where all OSDs are located on one host, the related erasure coded profile needs the 'crush-failure-domain' parameter set to 'osd'. Which is not the case for the profile that will be used by default. I resolved that issue by creating a separate profile containing the needed parameter (from the documentation):

$ ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
$ ceph osd pool create ecpool 12 12 erasure myprofile

(Note: There is an issue with creating erasure coded profiles currently - http://tracker.ceph.com/issues/23345)

After that the creation of the erasure coded pool worked without any issues.

Sorry, this was a kind of misunderstanding from my side. I guess this issue can be closed, thanks.

Actions

Copy link

Updated by Mykola Golub about 6 years ago

I think it still worth investigating.

Previously the default profile just worked on vstart clusters, and now it does not.

I see, previously we had in ceph.conf:

  osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 crush-failure-domain=osd

The commit e76a189 changed it use mon config:

$CEPH_BIN/ceph config set mon osd_pool_default_erasure_code_profile 'plugin=jerasure technique=reed_sol_van k=2 m=1 crush-failure-domain=osd'

So the default profile is expected to work for you but apparently it does not.

Actions

Copy link

Updated by Mykola Golub about 6 years ago

I think the problem is that `ceph config` sets osd_pool_default_erasure_code_profile too late: when the cluster already is built and started, while this parameter is used when building the initial osd map.

So osd_pool_default_erasure_code_profile should be present in the initial ceph.conf.

Actions

Copy link