Project

General

Profile

Bug #16878

Updated by Nathan Cutler over 7 years ago

It is easy to fill up a Ceph cluster (FileStore) by running "rados bench write". 

 Assuming the full and nearfull failsafe ratios have not been changed from their defaults, the expected behavior of such a test is that the cluster will fill up to 96, 97, 98% but not more.  

 On one cluster, however, systems with abnormally large journals, it is possible to fill OSDs to 100%, with disastrous consequences. 100%. This cluster has 24 OSDs, all on 1TB spinners with external journals on SSDs. The journal partitions are abnormally large (87 GiB). 

 There is a configuration parameter called osd_failsafe_nearfull_ratio which defaults to 0.90. When because the filestore disk usage utilization ratio reaches this point, the OSD state is changed to "near full". The conditional used to determine whether osd_failsafe_nearfull_ratio has been exceeded calculation does not take the journal size into account. 

 So, here is what might This can be happening: 

 1. reproduced by creating journals that are 10% of the journal is periodically flushed to the underlying filestore; 
 2. the OSD stats (including "cur_state", which can be "FULL", "NEAR", or "NONE") are updated only before size and after then bombarding the journal flush operation - not during it; 
 3. cluster with writes (e.g. <code>rados bench --no-cleanup</code>). The OSDs will fill up to 100% and then crash. 

 Furthermore, they cannot be restarted in this state. The documentation at http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ describes a technique for addressing this 100% full OSD situation, involving deletion of PGs to free up space. 

 However, in this case, even when cur_state some space is "NEAR" or "FULL", created by deleting PGs, the OSD still refuses to start because the journal flush operation is careful not to fill up full of queued writes. These get applied as the disk, but if it OSD is "NONE", coming up, causing it writes blindly for maximum performance. to crash again. 

 Hence Kefu's suggested fix (see comments below), which The proposed solution is to assume take the worst case (full journal) journal size into account when checking whether calculating the nearfull failsafe ratio has been reached, as part of updating utilization ratio. 

 Perhaps also add some journal sizing guidance to the OSD stats. 
 documentation. When SSDs are available, there is a temptation to use up all their available space by allocating it to journals.

Back