Project

General

Profile

Bug #16878

filestore: utilization ratio calculation does not take journal size into account

Added by Nathan Cutler about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
08/01/2016
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

It is easy to fill up a Ceph cluster (FileStore) by running "rados bench write".

Assuming the full and nearfull failsafe ratios have not been changed from their defaults, the expected behavior of such a test is that the cluster will fill up to 96, 97, 98% but not more.

On one cluster, however, it is possible to fill OSDs to 100%, with disastrous consequences. This cluster has 24 OSDs, all on 1TB spinners with external journals on SSDs. The journal partitions are abnormally large (87 GiB).

There is a configuration parameter called osd_failsafe_nearfull_ratio which defaults to 0.90. When the filestore disk usage ratio reaches this point, the OSD state is changed to "near full". The conditional used to determine whether osd_failsafe_nearfull_ratio has been exceeded does not take the journal size into account.

So, here is what might be happening:

1. the journal is periodically flushed to the underlying filestore;
2. the OSD stats (including "cur_state", which can be "FULL", "NEAR", or "NONE") are updated only before and after the journal flush operation - not during it;
3. when cur_state is "NEAR" or "FULL", the journal flush operation is careful not to fill up the disk, but if it is "NONE", it writes blindly for maximum performance.

Hence Kefu's suggested fix (see comments below), which is to assume the worst case (full journal) when checking whether the nearfull failsafe ratio has been reached, as part of updating the OSD stats.

ceph-12.0-2-client.log View - log of rados bench (75.2 KB) Nathan Cutler, 03/03/2017 12:32 PM

ceph-12.0-2.log View - log of ceph -w and the output of ceph -s and ceph osd tree following execution of rados bench (26.1 KB) Nathan Cutler, 03/03/2017 12:32 PM

ceph-osd.1.log.bz2 - log of the OSD that crashed (197 KB) Nathan Cutler, 03/03/2017 12:32 PM


Related issues

Related to Ceph - Bug #18153: No space cause osd crash (bluestore) Resolved 12/06/2016
Related to Ceph - Bug #15912: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed Resolved 05/17/2016

History

#1 Updated by Nathan Cutler about 1 year ago

  • Description updated (diff)

#2 Updated by Nathan Cutler about 1 year ago

  • Description updated (diff)

#3 Updated by Nathan Cutler 11 months ago

  • Description updated (diff)
  • Needs Doc set to No

#4 Updated by Nathan Cutler 11 months ago

  • Status changed from New to In Progress

My first attempt to reproduce the bug was a partial success.

The test cluster was 11.0.2, single node, one MON, and two OSDs. Each OSD has a raw size of 10GB and an external journal that is 1GB in size. The journal partitions are somewhat irregular:

ubuntu@target167114235145:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): C44659B7-3254-4D88-AA39-5EC24B9BB86F
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16781177 sectors (8.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2097185   1023.0 MiB  FFFF  
   2         2099200         4194337   1023.0 MiB  FFFF

The command used to fill up the cluster is "sudo rados -p rbd bench 60000 write --no-cleanup", which starts 16 concurrent threads, each writing 4MB objects to the cluster ("Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60000 seconds or 0 objects").

When I run the rados bench command on this cluster, the OSDs fill up to 97% before the cluster realizes that something is amiss (first appearance of "[ERR] : OSD full dropping all updates") in the log. Even after this message appears, the full ratio rises further, to 98%. After that, nothing more is written to the disk.

Upon closer analysis of the "heartbeat: osd_stat" messages, the OSD fills up at a linear rate until it hits just over 95% full. At this point, the behavior is strange - for a few heartbeats the usage level actually goes down (!).

2016-10-25 08:51:17.538793 7f210cc63700  5 osd.3 22 heartbeat: osd_stat(8402 MB used, 1826 MB avail, 10228 MB total,
 peers [2]/[] op hist [0,0,0,0,0,0,1,6,8,1])
NOTE: 8402/10228 = 82.14%
2016-10-25 08:51:21.639633 7f210cc63700  5 osd.3 22 heartbeat: osd_stat(8714 MB used, 1514 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,2,6,4,0,4])
NOTE: 8714/10228 = 85.19%
2016-10-25 08:51:23.939854 7f210cc63700  5 osd.3 22 heartbeat: osd_stat(8962 MB used, 1266 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,2,0,3,5,6])
NOTE: 8962/10228 = 87.62%
2016-10-25 08:51:27.440389 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9210 MB used, 1018 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,3,0,0,0,3,10])
NOTE: 9210/10228 = 90.04%
2016-10-25 08:51:30.940709 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9515 MB used, 713 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,0,0,0,0,16,1])
NOTE: 9515/10228 = 93.02%

(i.e. 82.14%, 85.19%, 87.62%, 90.04%, 93.02% - linear increase per heartbeat)

Then something changes:

2016-10-25 08:51:32.641183 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9670 MB used, 558 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,0,0,0,0,6,8])
2016-10-25 08:51:36.141500 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9726 MB used, 502 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:37.241852 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9725 MB used, 503 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:37.742268 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9725 MB used, 503 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:38.842620 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9724 MB used, 504 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:42.343043 7f210cc63700  5 osd.3 23 heartbeat: osd_stat(9724 MB used, 504 MB avail, 10228 MB total, peers [2]/[] op hist [])

The usage actually goes down (!) and then levels off at 9724 MB, which is 95.07%.

But then, we see 87 (eighty-seven) messages like this one:

08:52:04.414782 filestore(/var/lib/ceph/osd/ceph-3) queue_op 0x7f2140f74900 seq 2431 osr(1.2 0x7f2139474dc0) 4195666 bytes (queue has 16 ops and 67129672 bytes)

These messages indicate operations being queued. Each operation has a "seq" number, from "seq 2431" to "seq 25128".

The messages are very similar to eachother. I believe each of these messages indicates a ~4MB write operation taking place.

Then there is another heartbeat:

2016-10-25 08:52:06.454980 7f210cc63700  5 osd.3 25 heartbeat: osd_stat(9813 MB used, 415 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,1,0,4,3,6,3,4])

Already we see OSD disk usage has exceeded 95%. But the ops keep coming, and further heartbeats show usage continues to increase:

2016-10-25 08:52:12.355231 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10024 MB used, 204 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,4,18])
2016-10-25 08:52:18.255550 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:20.555869 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:21.656170 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:26.356491 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:27.456790 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:30.957166 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:33.857496 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:37.357791 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:40.258233 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:41.958618 7f210cc63700  5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])

And, finally:

2016-10-25 08:52:44.258986 7f210cc63700 -1 log_channel(cluster) log [ERR] : OSD full dropping all updates 98% full

Log excerpt (~7MB) can be viewed here: https://gist.githubusercontent.com/smithfarm/13d8c959243b604f63c5573219f4a72b/raw/fd93f769200da5c60f020254e471c9b352e94150/wip-16878-test-1.log

Next I will try with 2GB journals to see if that makes any difference.

#5 Updated by Nathan Cutler 11 months ago

The second test was on an identical cluster (11.0.2, 2 OSDs, 10GB each, external journals) with the only difference that the journal partitions were 2GB in size.

ubuntu@target167114248085:/var/log/ceph$ sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 8A044A84-965A-4B8E-BA16-378034D31E89
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 12582845 sectors (6.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         4196351   2.0 GiB     FFFF  
   2         4196352         8390655   2.0 GiB     FFFF

Whereas the first test had journal partitions 1023MB in size.

Interestingly, in this test run, cur_state is never set to "FULL" even though the journal partitions are twice as big as in the first test. Also, and it isn't changed to "NEAR" until the usage ratio reaches 92%. Even after that, usage continues to rise to 95% before levelling off:

smithfarm@wilbur:~/Downloads> grep 'near' ceph-osd.2.log
2016-10-25 11:42:35.178483 7f70cdfe5700  0 log_channel(cluster) log [WRN] : OSD near full (92%)
2016-10-25 11:43:07.380567 7f70cdfe5700  0 log_channel(cluster) log [WRN] : OSD near full (95%)
2016-10-25 11:43:37.983834 7f70cdfe5700  0 log_channel(cluster) log [WRN] : OSD near full (95%)
2016-10-25 11:44:07.887403 7f70cdfe5700  0 log_channel(cluster) log [WRN] : OSD near full (95%)
2016-10-25 11:44:42.090935 7f70cdfe5700  0 log_channel(cluster) log [WRN] : OSD near full (95%)

Actually, as the hearbeats show, the usage ratio reaches a peak of 95.99% (9818/10228) before falling to 95.95% where it levels off:

2016-10-25 11:42:28.177795 7f70cdfe5700  5 osd.2 19 heartbeat: osd_stat(8876 MB used, 1352 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,16])
2016-10-25 11:42:29.278027 7f70cdfe5700  5 osd.2 19 heartbeat: osd_stat(9074 MB used, 1154 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,5])
2016-10-25 11:42:35.178499 7f70cdfe5700  5 osd.2 20 heartbeat: osd_stat(9415 MB used, 813 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,14,2])
2016-10-25 11:42:38.078735 7f70cdfe5700  5 osd.2 20 heartbeat: osd_stat(9674 MB used, 554 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,1,4,7,0,2])
2016-10-25 11:42:40.979012 7f70cdfe5700  5 osd.2 20 heartbeat: osd_stat(9727 MB used, 501 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,9,7])
2016-10-25 11:42:45.679254 7f70cdfe5700  5 osd.2 20 heartbeat: osd_stat(9815 MB used, 413 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,0,3,4,1])
2016-10-25 11:42:51.579456 7f70cdfe5700  5 osd.2 20 heartbeat: osd_stat(9818 MB used, 410 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:42:53.879772 7f70cdfe5700  5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:42:56.780082 7f70cdfe5700  5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:02.680321 7f70cdfe5700  5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:07.380583 7f70cdfe5700  5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:12.680811 7f70cdfe5700  5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:13.781036 7f70cdfe5700  5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:14.881330 7f70cdfe5700  5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])

Later, after "rados bench" is terminated and then restarted, usage goes down to 9751 MB.

I have no explanation for the usage going down, since the only cluster traffic is "sudo rados -p rbd bench 60000 write --no-cleanup"

Also, the "OSD full dropping all updates" message does not appear in the log.

Full ceph logs for this test are available here: http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_11%3a24%3a30-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/25/remote/target167114248085/log/

#6 Updated by Nathan Cutler 11 months ago

The "heartbeat:" message comes from here: https://github.com/ceph/ceph/blob/v11.0.2/src/osd/OSD.cc#L4247

#7 Updated by Nathan Cutler 11 months ago

Third test has 1.0GB journals, precisely sized - all other parameters are the same.

ubuntu@target167114225016:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 3FA69EBF-29A0-45E3-82EE-DA415B1D6FC8
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16777149 sectors (8.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2099199   1024.0 MiB  FFFF  
   2         2099200         4196351   1024.0 MiB  FFFF

Cluster fills up linearly until "9870 MB used, 358 MB avail, 10228 MB total" (96.49%) (and doesn't go up (or down) from there, until I CTRL-C the "rados bench" command and then start it again after some minutes. After restarting it, the heartbeat reports a lower usage figure (96.19%):

2016-10-25 13:34:41.714955 7fd1637c7700  5 osd.2 21 heartbeat: osd_stat(9839 MB used, 389 MB avail, 10228 MB total, peers [3]/[] op hist [])

Logs: http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_12%3a52%3a12-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/26/remote/target167114225016/log/

#8 Updated by Nathan Cutler 11 months ago

For the fourth test, I return to the same parameters as the first test. Journal partitions are 1023.0 MiB in size and do not end on 2048 boundary:

ubuntu@target167114235145:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): C44659B7-3254-4D88-AA39-5EC24B9BB86F
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16781177 sectors (8.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2097185   1023.0 MiB  FFFF  
   2         2099200         4194337   1023.0 MiB  FFFF

I do not let "rados bench" run continuously - I stop and start it periodically.

Excerpt from "ceph -w":

2016-10-25 13:58:29.456965 mon.0 [INF] pgmap v92: 8 pgs: 8 active+clean; 9036 MB data, 18908 MB used, 1549 MB / 20457 MB avail; 79739 kB/s wr, 19 op/s
2016-10-25 13:58:31.472783 mon.0 [INF] pgmap v93: 8 pgs: 8 active+clean; 9344 MB data, 19071 MB used, 1386 MB / 20457 MB avail; 72964 kB/s wr, 17 op/s
2016-10-25 13:58:34.058278 mon.0 [INF] pgmap v94: 8 pgs: 8 active+clean; 9460 MB data, 19376 MB used, 1081 MB / 20457 MB avail; 94371 kB/s wr, 23 op/s
2016-10-25 13:58:36.100040 mon.0 [INF] pgmap v95: 8 pgs: 8 active+clean; 9780 MB data, 19513 MB used, 944 MB / 20457 MB avail; 96508 kB/s wr, 23 op/s
2016-10-25 13:58:37.437900 mon.0 [INF] osdmap e22: 2 osds: 2 up, 2 in full
2016-10-25 13:58:37.441116 mon.0 [INF] pgmap v96: 8 pgs: 8 active+clean; 9780 MB data, 19513 MB used, 944 MB / 20457 MB avail; 96869 kB/s wr, 23 op/s
2016-10-25 13:58:39.467619 mon.0 [INF] pgmap v97: 8 pgs: 8 active+clean; 9832 MB data, 19645 MB used, 812 MB / 20457 MB avail; 15803 kB/s wr, 3 op/s
2016-10-25 13:58:41.481405 mon.0 [INF] pgmap v98: 8 pgs: 8 active+clean; 9944 MB data, 19757 MB used, 700 MB / 20457 MB avail; 41581 kB/s wr, 10 op/s
2016-10-25 13:58:44.063871 mon.0 [INF] pgmap v99: 8 pgs: 8 active+clean; 9988 MB data, 19913 MB used, 544 MB / 20457 MB avail; 34754 kB/s wr, 8 op/s
2016-10-25 13:58:40.335147 osd.2 [ERR] OSD full dropping all updates 97% full
2016-10-25 13:58:46.077324 mon.0 [INF] pgmap v100: 8 pgs: 8 active+clean; 9988 MB data, 20060 MB used, 397 MB / 20457 MB avail; 9802 kB/s wr, 2 op/s
2016-10-25 13:58:43.085854 osd.3 [ERR] OSD full dropping all updates 98% full
2016-10-25 13:58:49.105441 mon.0 [INF] pgmap v101: 8 pgs: 8 active+clean; 9988 MB data, 20060 MB used, 397 MB / 20457 MB avail

Clearly, the irregularly-sized journals affect the usage calculation.

Logs: http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_13%3a42%3a28-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/28/remote/target167114235145/log/

#9 Updated by Nathan Cutler 11 months ago

Fifth test, all parameters the same as the fourth test, except we use a jewel (latest from git - almost 10.2.4) cluster:

ubuntu@target167114237209:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 9E55F8DA-98A3-446E-B89F-D4924C376B9D
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16781177 sectors (8.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2097185   1023.0 MiB  FFFF  
   2         2099200         4194337   1023.0 MiB  FFFF 

ceph -w excerpt

2016-10-25 14:05:42.853358 mon.0 [INF] pgmap v74: 8 pgs: 8 active+clean; 8480 MB data, 17708 MB used, 2748 MB / 20457 MB avail; 27977 kB/s wr, 6 op/s
2016-10-25 14:05:43.148348 osd.2 [WRN] OSD near full (90%)
2016-10-25 14:05:44.867965 mon.0 [INF] pgmap v75: 8 pgs: 8 active+clean; 8784 MB data, 17921 MB used, 2536 MB / 20457 MB avail; 100 MB/s wr, 25 op/s
2016-10-25 14:05:45.145358 osd.3 [WRN] OSD near full (92%)
2016-10-25 14:05:47.893578 mon.0 [INF] pgmap v76: 8 pgs: 8 active+clean; 8900 MB data, 18260 MB used, 2197 MB / 20457 MB avail; 85311 kB/s wr, 20 op/s
2016-10-25 14:05:49.916282 mon.0 [INF] pgmap v77: 8 pgs: 8 active+clean; 9336 MB data, 18737 MB used, 1720 MB / 20457 MB avail; 109 MB/s wr, 27 op/s
2016-10-25 14:05:52.092371 mon.0 [INF] pgmap v78: 8 pgs: 8 active+clean; 9468 MB data, 19228 MB used, 1229 MB / 20457 MB avail; 135 MB/s wr, 33 op/s
2016-10-25 14:05:54.221496 mon.0 [INF] pgmap v79: 8 pgs: 8 active+clean; 9920 MB data, 19635 MB used, 822 MB / 20457 MB avail; 136 MB/s wr, 34 op/s
2016-10-25 14:05:55.855625 mon.0 [INF] osdmap e20: 2 osds: 2 up, 2 in full
2016-10-25 14:05:55.858893 mon.0 [INF] pgmap v80: 8 pgs: 8 active+clean; 9920 MB data, 19635 MB used, 822 MB / 20457 MB avail; 119 MB/s wr, 29 op/s
2016-10-25 14:05:57.861462 mon.0 [INF] pgmap v81: 8 pgs: 8 active+clean; 9988 MB data, 19932 MB used, 525 MB / 20457 MB avail; 19016 kB/s wr, 4 op/s
2016-10-25 14:05:59.877270 mon.0 [INF] pgmap v82: 8 pgs: 8 active+clean; 10052 MB data, 20076 MB used, 381 MB / 20457 MB avail; 33653 kB/s wr, 8 op/s
2016-10-25 14:06:02.088427 mon.0 [INF] pgmap v83: 8 pgs: 8 active+clean; 10052 MB data, 20077 MB used, 380 MB / 20457 MB avail; 15509 kB/s wr, 3 op/s
2016-10-25 14:05:55.450072 osd.2 [ERR] OSD full dropping all updates 98% full
2016-10-25 14:05:56.248672 osd.3 [ERR] OSD full dropping all updates 98% full
2016-10-25 14:06:05.112068 mon.0 [INF] pgmap v84: 8 pgs: 8 active+clean; 10052 MB data, 20136 MB used, 321 MB / 20457 MB avail
2016-10-25 14:06:07.125225 mon.0 [INF] pgmap v85: 8 pgs: 8 active+clean; 10052 MB data, 20188 MB used, 269 MB / 20457 MB avail
2016-10-25 14:06:25.053419 osd.2 [ERR] OSD full dropping all updates 98% full
2016-10-25 14:06:27.151812 osd.3 [ERR] OSD full dropping all updates 98% full

#10 Updated by Nathan Cutler 11 months ago

We have a report that, on a larger cluster (24 OSDs, 1TB each, external 87GB journals), several OSDs fill up to 100% in this test. Unfortunately I don't have logs from that.

Is the information above enough to go on, or do I need to reproduce on a bigger cluster and try to get up to 100%?

#11 Updated by Kefu Chai 11 months ago

Nathan,

yeah, if the journal is abnormally large, when we sync/flush the journal to the filestore, there is chance that the utilization percentage of filestore could reach osd_failsafe_full_ratio before we notice it. maybe we should set a watermark taking the worst case into consideration, where the journal is full when we sync the filestore.

in other words, we should be more conservative when evaluating NEAR_FULL, instead of using

is_near_full = (osd_stat.kb - osd_stat.kb_avail) / osd_stat.kb > osd_failsafe_nearfull_ratio

we should use

(osd_stat.kb - osd_stat.kb_avail + journal_size.kb) / osd_stat.kb > osd_failsafe_nearfull_ratio

what do you think?

#12 Updated by Alexey Sheplyakov 11 months ago

(osd_stat.kb - osd_stat.kb_avail + journal_size.kb) / osd_stat.kb

I guess this expression should be used for external journals only. If the journal is co-located with the data the filesystem driver accounts for blocks allocated for the journal file.

#13 Updated by Nathan Cutler 11 months ago

@Kefu, first of all thank you very much for looking at this. Keep in mind that the only way I have been able to trigger the bug so far is by having the journal be irregularly sized (number of sectors not evenly divisible by 2048).

In other words:

- when the journal size (in sectors) is evenly divisible by 2048, the "write-to-full" test works as expected: cluster fills up to 95% and the usage ratio does not rise any further.

- when the journal size is not evenly divisible by 2048, buggy behavior is seen (cluster fills to 98%, possibly even to 100%).

That, combined with the fact that the journal is abnormally large (and full) in both cases, would indicate that the bug is more subtle, right? Still, I will try your patch and report back.

#14 Updated by Nathan Cutler 11 months ago

@Alexey: Good point. I'll keep that in mind.

#15 Updated by Nathan Cutler 11 months ago

Regarding my "number of sectors evenly divisible by 2048" comments above, it has been brought to my attention that 1023 MiB actually meets this criterion, provided we take "sector" to mean "512 bytes", since 1023 MiB % (2048 * 512) is 0:

>>> print (1023 * 1024**2) % (2048 * 512)
0

However, if "sector" is taken to mean 4096 bytes, this no longer holds:

>>> print (1023 * 1024**2) % (2048 * 4096)
7340032

Now, it seems to me that the Ceph journal code does not have any concept of sectors, but it does have a concept of 4096-byte blocks, and the journal block size in these tests was 4096:

2016-10-25 13:58:41.121694 7f81ac4fb700 10 journal header: block_size 4096 alignment 4096 max_size 1072709632

max_size 1072709632 is 1023 MiB which is divisible by 4096, so I'm still looking for an explanation and will run some more tests with different journal sizes, to gather more data.

#16 Updated by Nathan Cutler 11 months ago

  • Description updated (diff)

Rewrote description in light of Kefu's analysis.

#17 Updated by Nathan Cutler 10 months ago

  • Related to Bug #18153: No space cause osd crash (bluestore) added

#18 Updated by Nathan Cutler 7 months ago

  • Related to Bug #15912: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed added

#19 Updated by Nathan Cutler 7 months ago

  • Status changed from In Progress to Verified
  • Assignee deleted (Nathan Cutler)

#20 Updated by David Zafman 7 months ago

  • Assignee set to David Zafman

#21 Updated by David Zafman 7 months ago

  • Priority changed from Normal to Urgent

#22 Updated by Nathan Cutler 7 months ago

@David We found a reproducer for this, let me know if you need/want it.

#23 Updated by David Zafman 7 months ago

@nathan I would love to know if the fix I have for 15912 which accounts for the amount of dirty journal fixes the issue in this tracker. If everything works, I'll close this as Duplicate.

#25 Updated by Nathan Cutler 7 months ago

@David: We were able to reproduce the bug on a very recent version of master plus your patches from https://github.com/ceph/ceph/pull/13425 - here are the details:

We deployed the following cluster:

4 Nodes
Each Node has 1GB RAM, 2 cores
8 OSDs (2 OSDs per node)
Each OSD uses 2GB disk with journal co-located
4 OSDs use 1GB journal, 2 OSDs use 512M journal, 2 OSDs use 256M journal.
3 monitors sharing 3/4 OSD nodes
248 PGs

We then ran rados -p rbd bench 60000 write --no-cleanup and after about 5-10 minutes one of the OSDs crashes.

I just uploaded three files:

  • log-12.0-2-client.log (log of rados bench)
  • log-12.0-2.log (log of ceph -w and the output of ceph -s and ceph osd
    tree following execution of rados bench)
  • ceph-osd.1.log.bz2 (log of the OSD that crashed)

Please let me know if this is useful and if you need anything else.

#26 Updated by David Zafman 7 months ago

  • Related to deleted (Bug #15912: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed)

#27 Updated by David Zafman 7 months ago

  • Related to Bug #15912: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed added

#28 Updated by David Zafman 7 months ago

  • Status changed from Verified to Resolved

Also available in: Atom PDF