Project

General

Profile

Actions

Bug #35808

closed

ceph osd ok-to-stop result dosen't match the real situation

Added by frank lin over 5 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The cluster is in healthy status, when I tried to run ceph osd ok-to-stop 0 it returns

Error EBUSY: 4 PGs are already degraded or might become unavailable

ceph -s show there is no pg in degraded status
 cluster:
    id:     6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon:         1 daemons, quorum mnv001
    mgr:         mnv001(active)
    osd:         44 osds: 44 up, 44 in
                 flags noout
    tcmu-runner: 5 daemons active

  data:
    pools:   5 pools, 1472 pgs
    objects: 49241k objects, 192 TB
    usage:   288 TB used, 111 TB / 400 TB avail
    pgs:     1472 active+clean

after I set osd.0 out I get this result:

  cluster:
    id:     6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
    health: HEALTH_WARN
            noout flag(s) set
            1 osds down
            6835800/302432789 objects misplaced (2.260%)
            Reduced data availability: 5 pgs incomplete
            Degraded data redundancy: 6889753/302432789 objects degraded (2.278%), 162 pgs degraded

  services:
    mon:         1 daemons, quorum mnv001
    mgr:         mnv001(active)
    osd:         44 osds: 43 up, 43 in; 142 remapped pgs
                 flags noout
    tcmu-runner: 5 daemons active

  data:
    pools:   5 pools, 1472 pgs
    objects: 49241k objects, 192 TB
    usage:   288 TB used, 111 TB / 400 TB avail
    pgs:     0.340% pgs not active
             6889753/302432789 objects degraded (2.278%)
             6835800/302432789 objects misplaced (2.260%)
             1156 active+clean
             160  active+undersized+degraded
             124  active+clean+remapped
             18   active+remapped+backfilling
             7    active+undersized
             5    incomplete
             2    active+recovering+degraded

  io:
    client:   2831 B/s rd, 2 op/s rd, 0 op/s wr
    recovery: 181 MB/s, 53 objects/s

ceph health detail shows 5 incomplete pgs are:

    pg 5.1 is incomplete, acting [33,2147483647,35] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.4 is incomplete, acting [45,2147483647,18] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.5 is incomplete, acting [2147483647,10,29] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.16 is incomplete, acting [15,9,2147483647] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 5.1e is incomplete, acting [2147483647,34,11] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')

Is this a bug or expected?


Related issues 1 (0 open1 closed)

Follows RADOS - Bug #39099: Give recovery for inactive PGs a higher priorityResolvedDavid Zafman04/03/2019

Actions
Actions #1

Updated by xie xingguo over 5 years ago

I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?

Actions #2

Updated by John Spray over 5 years ago

It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.

Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?

Actions #3

Updated by John Spray over 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (ceph cli)
Actions #4

Updated by frank lin over 5 years ago

John Spray wrote:

It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.

Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?

The "already degraded" part is kind of confusing.
The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.

Actions #5

Updated by frank lin over 5 years ago

xie xingguo wrote:

I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?

The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.

Actions #6

Updated by David Zafman almost 5 years ago

Actions #7

Updated by David Zafman almost 5 years ago

  • Due date set to 04/04/2019
  • Start date changed from 09/06/2018 to 04/04/2019
  • Follows Bug #39099: Give recovery for inactive PGs a higher priority added
Actions #8

Updated by David Zafman almost 5 years ago

  • Status changed from New to Need More Info

Can the reporter test this with the change in https://github.com/ceph/ceph/pull/27503 and report back?

Actions #9

Updated by David Zafman almost 5 years ago

  • Assignee set to David Zafman
Actions #10

Updated by David Zafman over 3 years ago

  • Status changed from Need More Info to Rejected

Marking rejected because reporter hasn't responded to request.

Actions

Also available in: Atom PDF