Bug #35808: ceph osd ok-to-stop result dosen't match the real situation - RADOS - Ceph

Actions

Copy link

Bug #35808

closed

ceph osd ok-to-stop result dosen't match the real situation

Added by frank lin over 5 years ago. Updated over 3 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

David Zafman

Category:

Target version:

Ceph - v12.2.8

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.5, Ceph - v12.2.7

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The cluster is in healthy status, when I tried to run ceph osd ok-to-stop 0 it returns

Error EBUSY: 4 PGs are already degraded or might become unavailable

ceph -s show there is no pg in degraded status

 cluster:
    id:     6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon:         1 daemons, quorum mnv001
    mgr:         mnv001(active)
    osd:         44 osds: 44 up, 44 in
                 flags noout
    tcmu-runner: 5 daemons active

  data:
    pools:   5 pools, 1472 pgs
    objects: 49241k objects, 192 TB
    usage:   288 TB used, 111 TB / 400 TB avail
    pgs:     1472 active+clean

after I set osd.0 out I get this result:

  cluster:
    id:     6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
    health: HEALTH_WARN
            noout flag(s) set
            1 osds down
            6835800/302432789 objects misplaced (2.260%)
            Reduced data availability: 5 pgs incomplete
            Degraded data redundancy: 6889753/302432789 objects degraded (2.278%), 162 pgs degraded

  services:
    mon:         1 daemons, quorum mnv001
    mgr:         mnv001(active)
    osd:         44 osds: 43 up, 43 in; 142 remapped pgs
                 flags noout
    tcmu-runner: 5 daemons active

  data:
    pools:   5 pools, 1472 pgs
    objects: 49241k objects, 192 TB
    usage:   288 TB used, 111 TB / 400 TB avail
    pgs:     0.340% pgs not active
             6889753/302432789 objects degraded (2.278%)
             6835800/302432789 objects misplaced (2.260%)
             1156 active+clean
             160  active+undersized+degraded
             124  active+clean+remapped
             18   active+remapped+backfilling
             7    active+undersized
             5    incomplete
             2    active+recovering+degraded

  io:
    client:   2831 B/s rd, 2 op/s rd, 0 op/s wr
    recovery: 181 MB/s, 53 objects/s

ceph health detail shows 5 incomplete pgs are:

    pg 5.1 is incomplete, acting [33,2147483647,35] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.4 is incomplete, acting [45,2147483647,18] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.5 is incomplete, acting [2147483647,10,29] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.16 is incomplete, acting [15,9,2147483647] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.1e is incomplete, acting [2147483647,34,11] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')

Is this a bug or expected?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by xie xingguo over 5 years ago

I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?

Actions

Copy link

Updated by John Spray over 5 years ago

It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.

Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?

Actions

Copy link

Updated by John Spray over 5 years ago

Project changed from Ceph to RADOS
Category deleted (~~ceph cli~~)

Actions

Copy link

Updated by frank lin over 5 years ago

John Spray wrote:

It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.

Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?

The "already degraded" part is kind of confusing.
The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.

Actions

Copy link

Updated by frank lin over 5 years ago

xie xingguo wrote:

I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?

The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.

Actions

Copy link

Updated by David Zafman almost 5 years ago

This may be fixed by https://github.com/ceph/ceph/pull/27503

Actions

Copy link

Updated by David Zafman almost 5 years ago

Due date set to 04/04/2019
Start date changed from 09/06/2018 to 04/04/2019
Follows Bug #39099: Give recovery for inactive PGs a higher priority added

Actions

Copy link

Updated by David Zafman almost 5 years ago

Status changed from New to Need More Info

Can the reporter test this with the change in https://github.com/ceph/ceph/pull/27503 and report back?

Actions

Copy link

Updated by David Zafman almost 5 years ago

Assignee set to David Zafman

Actions

Copy link

#10

Updated by David Zafman over 3 years ago

Status changed from Need More Info to Rejected

Marking rejected because reporter hasn't responded to request.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #35808

ceph osd ok-to-stop result dosen't match the real situation

Updated by xie xingguo over 5 years ago

Updated by John Spray over 5 years ago

Updated by John Spray over 5 years ago

Updated by frank lin over 5 years ago

Updated by frank lin over 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman over 3 years ago