Support #22520: nearfull threshold is not cleared when osd really is not nearfull. - RADOS - Ceph

Actions

Copy link

Support #22520

closed

nearfull threshold is not cleared when osd really is not nearfull.

Added by Konstantin Shalygin over 6 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v12.2.2

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v12.2.2

Component(RADOS):

Pull request ID:

Description

Today one of my osd is reached nearfull ratio. mon_osd_nearfull_ratio: '.85'. I increased mon_osd_nearfull_ratio to '0.9'

I rebalanced data by increase weights on another osd's in this root. For that time while I was looking for the golden rule some another osds reached nearfull. But at the end all of this osds should clear nearfull flag because USED space % is lower than mon_osd_nearfull_ratio. osds in this root used with pool size 2 min_size 1 (idontcareaboutmydata).

ID  CLASS WEIGHT    REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS TYPE NAME                        
-12         6.29997        -  5716G  4719G   997G 82.56 8.71   - root solid                       
-14         6.29997        -  5716G  4719G   997G 82.56 8.71   -     datacenter xxx_solid  
-15         2.09999        -  1905G  1455G   450G 76.37 8.06   -         rack rack2-solid         
-13         1.00000        -   952G   686G   266G 72.03 7.60   -             host ceph-osd0-solid 
 24  nvme   1.00000  1.00000   952G   686G   266G 72.03 7.60  74                 osd.24           
-19         1.09999        -   952G   768G   183G 80.70 8.51   -             host ceph-osd2-solid 
 26  nvme   1.09999  1.00000   952G   768G   183G 80.70 8.51  83                 osd.26           
-16         2.09999        -  1905G  1590G   314G 83.49 8.81   -         rack rack3-solid         
-20         1.09999        -   952G   775G   177G 81.40 8.59   -             host ceph-osd3-solid 
 30  nvme   1.09999  1.00000   952G   775G   177G 81.40 8.59  84                 osd.30           
-22         1.00000        -   952G   815G   137G 85.58 9.03   -             host ceph-osd5-solid 
 29  nvme   1.00000  1.00000   952G   815G   137G 85.58 9.03  89                 osd.29           
-17         2.09999        -  1905G  1673G   232G 87.82 9.27   -         rack rack4-solid         
-18         1.09999        -   952G   835G   117G 87.72 9.25   -             host ceph-osd1-solid 
 25  nvme   1.09999  1.00000   952G   835G   117G 87.72 9.25  91                 osd.25           
-21         1.00000        -   952G   837G   115G 87.92 9.28   -             host ceph-osd4-solid 
 28  nvme   1.00000  1.00000   952G   837G   115G 87.92 9.28  91                 osd.28

HEALTH:

[root@ceph-mon0 ceph]# ceph health detail
HEALTH_WARN 3 nearfull osd(s); 1 pool(s) nearfull
OSD_NEARFULL 3 nearfull osd(s)
    osd.25 is near full
    osd.28 is near full
    osd.29 is near full
POOL_NEARFULL 1 pool(s) nearfull
    pool 'solid_rbd' is nearfull

OSD DF:

[root@ceph-mon0 ceph]# ceph osd df | grep nvme | grep -E '(25|28|29)'
29  nvme 1.00000  1.00000  952G   815G  137G 85.58 9.03  89 
25  nvme 1.09999  1.00000  952G   835G  117G 87.72 9.25  91 
28  nvme 1.00000  1.00000  952G   837G  115G 87.92 9.28  91

MONs settings:

[root@ceph-mon0 ceph]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-mon0.asok config show | grep nearfull
    "mon_osd_nearfull_ratio": "0.900000",

OSDs settings:

[root@ceph-osd4 ceph]# ceph daemon osd.28 config get mon_osd_nearfull_ratio
{
    "mon_osd_nearfull_ratio": "0.900000" 
}

When I was find out 'ceph tell' is not working I was deployed ceph.conf with new settings:

[root@ceph-osd4 ceph]# grep full ceph.conf 
mon_osd_full_ratio = .91
mon_osd_nearfull_ratio = .90

[root@ceph-mon0 ceph]# grep full ceph.conf 
mon_osd_full_ratio = .91
mon_osd_nearfull_ratio = .90

And restart this osds - not helped.

The (?) same (?) behavior in ceph-users ML http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023397.html

This bug or I need to do some magic?

Actions

Copy link

Updated by Konstantin Shalygin over 6 years ago

When I was delete some data from this osds, nearfull flag was also deleted.

2017-12-21 18:29:15.653156 [INF]  Cluster is now healthy
2017-12-21 18:29:15.653145 [INF]  Health check cleared: POOL_NEARFULL (was: 1 pool(s) nearfull)
2017-12-21 18:29:15.653125 [INF]  Health check cleared: OSD_NEARFULL (was: 1 nearfull osd(s))
2017-12-21 18:29:11.649585 [WRN]  Health check update: 1 nearfull osd(s) (OSD_NEARFULL)
2017-12-21 18:28:52.239743 [WRN]  Health check update: 2 nearfull osd(s) (OSD_NEARFULL)

29  nvme 1.00000  1.00000  952G   779G  172G 81.85 8.72  89 
25  nvme 1.09999  1.00000  952G   799G  153G 83.90 8.93  91 
28  nvme 1.00000  1.00000  952G   801G  151G 84.12 8.96  91

This proves that the osd nearfull flag can not be removed by setting a higher threshold. This can be a big problem if the threshold is accidentally set at times less than necessary (e.g. 0.2 instead 0.8).

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Tracker changed from Bug to Support
Project changed from Ceph to RADOS
Category deleted (~~OSD~~)
Status changed from New to Closed

You need to change this in the osd map, not the config. "ceph osd set-nearfull-ratio" or something similar.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Support #22520

nearfull threshold is not cleared when osd really is not nearfull.

Updated by Konstantin Shalygin over 6 years ago

Updated by Greg Farnum over 6 years ago