Bug #35808
closedceph osd ok-to-stop result dosen't match the real situation
0%
Description
The cluster is in healthy status, when I tried to run ceph osd ok-to-stop 0 it returns
Error EBUSY: 4 PGs are already degraded or might become unavailable
ceph -s show there is no pg in degraded status
cluster:
id: 6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
health: HEALTH_WARN
noout flag(s) set
services:
mon: 1 daemons, quorum mnv001
mgr: mnv001(active)
osd: 44 osds: 44 up, 44 in
flags noout
tcmu-runner: 5 daemons active
data:
pools: 5 pools, 1472 pgs
objects: 49241k objects, 192 TB
usage: 288 TB used, 111 TB / 400 TB avail
pgs: 1472 active+clean
after I set osd.0 out I get this result:
cluster:
id: 6b204640-60fb-4ed6-bb06-fe67e3c2ac1f
health: HEALTH_WARN
noout flag(s) set
1 osds down
6835800/302432789 objects misplaced (2.260%)
Reduced data availability: 5 pgs incomplete
Degraded data redundancy: 6889753/302432789 objects degraded (2.278%), 162 pgs degraded
services:
mon: 1 daemons, quorum mnv001
mgr: mnv001(active)
osd: 44 osds: 43 up, 43 in; 142 remapped pgs
flags noout
tcmu-runner: 5 daemons active
data:
pools: 5 pools, 1472 pgs
objects: 49241k objects, 192 TB
usage: 288 TB used, 111 TB / 400 TB avail
pgs: 0.340% pgs not active
6889753/302432789 objects degraded (2.278%)
6835800/302432789 objects misplaced (2.260%)
1156 active+clean
160 active+undersized+degraded
124 active+clean+remapped
18 active+remapped+backfilling
7 active+undersized
5 incomplete
2 active+recovering+degraded
io:
client: 2831 B/s rd, 2 op/s rd, 0 op/s wr
recovery: 181 MB/s, 53 objects/s
ceph health detail shows 5 incomplete pgs are:
pg 5.1 is incomplete, acting [33,2147483647,35] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
pg 5.4 is incomplete, acting [45,2147483647,18] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
pg 5.5 is incomplete, acting [2147483647,10,29] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
pg 5.16 is incomplete, acting [15,9,2147483647] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
pg 5.1e is incomplete, acting [2147483647,34,11] (reducing pool rbd_pool_test min_size from 3 may help; search ceph.com/docs for 'incomplete')
Is this a bug or expected?
Updated by xie xingguo over 5 years ago
I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?
Updated by John Spray over 5 years ago
It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.
Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?
Updated by John Spray over 5 years ago
- Project changed from Ceph to RADOS
- Category deleted (
ceph cli)
Updated by frank lin over 5 years ago
John Spray wrote:
It's a little bit odd that the ok-to-stop command said 4 PGs, but you actually had 5 PGs go incomplete, but basically yes, the ok-to-stop command was correctly advising you that if you took out that OSD, your system would be degraded.
Was it the "are already degraded or might become unavailable" part that caused confusion (since there was nothing already degraded)?
The "already degraded" part is kind of confusing.
The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.
Updated by frank lin over 5 years ago
xie xingguo wrote:
I see you are using a pool min_size of 3, so no replicas is allowed to be offline and hence the result is expected?
The pg number ok-to-stop predict to be "unavailable" doesn't match the actual incompleted pg number.
Updated by David Zafman almost 5 years ago
This may be fixed by https://github.com/ceph/ceph/pull/27503
Updated by David Zafman almost 5 years ago
- Due date set to 04/04/2019
- Start date changed from 09/06/2018 to 04/04/2019
- Follows Bug #39099: Give recovery for inactive PGs a higher priority added
Updated by David Zafman almost 5 years ago
- Status changed from New to Need More Info
Can the reporter test this with the change in https://github.com/ceph/ceph/pull/27503 and report back?
Updated by David Zafman over 3 years ago
- Status changed from Need More Info to Rejected
Marking rejected because reporter hasn't responded to request.