Project

General

Profile

Bug #35808

ceph osd ok-to-stop result dosen't match the real situation

Added by frank lin about 1 year ago. Updated 7 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
04/04/2019
Due date:
04/04/2019
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

The cluster is in healthy status, when I tried to run ceph osd ok-to-stop 0 it returns

Error EBUSY: 4 PGs are already degraded or might become unavailable

ceph -s show there is no pg in degraded status
 cluster:
    id:     6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon:         1 daemons, quorum mnv001
    mgr:         mnv001(active)
    osd:         44 osds: 44 up, 44 in
                 flags noout
    tcmu-runner: 5 daemons active

  data:
    pools:   5 pools, 1472 pgs
    objects: 49241k objects, 192 TB
    usage:   288 TB used, 111 TB / 400 TB avail
    pgs:     1472 active+clean

after I set osd.0 out I get this result:

  cluster:
    id:     6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
    health: HEALTH_WARN
            noout flag(s) set
            1 osds down
            6835800/302432789 objects misplaced (2.260%)
            Reduced data availability: 5 pgs incomplete
            Degraded data redundancy: 6889753/302432789 objects degraded (2.278%), 162 pgs degraded

  services:
    mon:         1 daemons, quorum mnv001
    mgr:         mnv001(active)
    osd:         44 osds: 43 up, 43 in; 142 remapped pgs
                 flags noout
    tcmu-runner: 5 daemons active

  data:
    pools:   5 pools, 1472 pgs
    objects: 49241k objects, 192 TB
    usage:   288 TB used, 111 TB / 400 TB avail
    pgs:     0.340% pgs not active
             6889753/302432789 objects degraded (2.278%)
             6835800/302432789 objects misplaced (2.260%)
             1156 active+clean
             160  active+undersized+degraded
             124  active+clean+remapped
             18   active+remapped+backfilling
             7    active+undersized
             5    incomplete
             2    active+recovering+degraded

  io:
    client:   2831 B/s rd, 2 op/s rd, 0 op/s wr
    recovery: 181 MB/s, 53 objects/s

ceph health detail shows 5 incomplete pgs are:

    pg 5.1 is incomplete, acting [33,2147483647,35] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.4 is incomplete, acting [45,2147483647,18] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.5 is incomplete, acting [2147483647,10,29] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.16 is incomplete, acting [15,9,2147483647] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.1e is incomplete, acting [2147483647,34,11] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')

Is this a bug or expected?


Related issues

Follows RADOS - Bug #39099: Give recovery for inactive PGs a higher priority Resolved 04/03/2019

History

#1 Updated by xie xingguo about 1 year ago

I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?

#2 Updated by John Spray about 1 year ago

It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.

Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?

#3 Updated by John Spray about 1 year ago

  • Project changed from Ceph to RADOS
  • Category deleted (ceph cli)

#4 Updated by frank lin about 1 year ago

John Spray wrote:

It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.

Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?

The "already degraded" part is kind of confusing.
The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.

#5 Updated by frank lin about 1 year ago

xie xingguo wrote:

I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?

The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.

#6 Updated by David Zafman 7 months ago

#7 Updated by David Zafman 7 months ago

  • Due date set to 04/04/2019
  • Start date changed from 09/06/2018 to 04/04/2019
  • Follows Bug #39099: Give recovery for inactive PGs a higher priority added

#8 Updated by David Zafman 7 months ago

  • Status changed from New to Need More Info

Can the reporter test this with the change in https://github.com/ceph/ceph/pull/27503 and report back?

#9 Updated by David Zafman 7 months ago

  • Assignee set to David Zafman

Also available in: Atom PDF