https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2016-08-01T14:00:22ZCeph Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=756952016-08-01T14:00:22ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/75695/diff?detail_id=72113">diff</a>)</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=756962016-08-01T14:01:27ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/75696/diff?detail_id=72114">diff</a>)</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=801912016-10-21T11:13:17ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/80191/diff?detail_id=77209">diff</a>)</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=803012016-10-25T10:37:30ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>My first attempt to reproduce the bug was a partial success.</p>
<p>The test cluster was 11.0.2, single node, one MON, and two OSDs. Each OSD has a raw size of 10GB and an external journal that is 1GB in size. The journal partitions are somewhat irregular:</p>
<pre>
ubuntu@target167114235145:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): C44659B7-3254-4D88-AA39-5EC24B9BB86F
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16781177 sectors (8.0 GiB)
Number Start (sector) End (sector) Size Code Name
1 2048 2097185 1023.0 MiB FFFF
2 2099200 4194337 1023.0 MiB FFFF
</pre>
<p>The command used to fill up the cluster is "sudo rados -p rbd bench 60000 write --no-cleanup", which starts 16 concurrent threads, each writing 4MB objects to the cluster ("Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60000 seconds or 0 objects").</p>
<p>When I run the rados bench command on this cluster, the OSDs fill up to 97% before the cluster realizes that something is amiss (first appearance of "[ERR] : OSD full dropping all updates") in the log. Even after this message appears, the full ratio rises further, to 98%. After that, nothing more is written to the disk.</p>
<p>Upon closer analysis of the "heartbeat: osd_stat" messages, the OSD fills up at a linear rate until it hits just over 95% full. At this point, the behavior is strange - for a few heartbeats the usage level actually goes down (!).</p>
<pre>
2016-10-25 08:51:17.538793 7f210cc63700 5 osd.3 22 heartbeat: osd_stat(8402 MB used, 1826 MB avail, 10228 MB total,
peers [2]/[] op hist [0,0,0,0,0,0,1,6,8,1])
NOTE: 8402/10228 = 82.14%
2016-10-25 08:51:21.639633 7f210cc63700 5 osd.3 22 heartbeat: osd_stat(8714 MB used, 1514 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,2,6,4,0,4])
NOTE: 8714/10228 = 85.19%
2016-10-25 08:51:23.939854 7f210cc63700 5 osd.3 22 heartbeat: osd_stat(8962 MB used, 1266 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,2,0,3,5,6])
NOTE: 8962/10228 = 87.62%
2016-10-25 08:51:27.440389 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9210 MB used, 1018 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,3,0,0,0,3,10])
NOTE: 9210/10228 = 90.04%
2016-10-25 08:51:30.940709 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9515 MB used, 713 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,0,0,0,0,16,1])
NOTE: 9515/10228 = 93.02%
</pre>
<p>(i.e. 82.14%, 85.19%, 87.62%, 90.04%, 93.02% - linear increase per heartbeat)</p>
<p>Then something changes:</p>
<pre>
2016-10-25 08:51:32.641183 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9670 MB used, 558 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,0,0,0,0,6,8])
2016-10-25 08:51:36.141500 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9726 MB used, 502 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:37.241852 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9725 MB used, 503 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:37.742268 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9725 MB used, 503 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:38.842620 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9724 MB used, 504 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:51:42.343043 7f210cc63700 5 osd.3 23 heartbeat: osd_stat(9724 MB used, 504 MB avail, 10228 MB total, peers [2]/[] op hist [])
</pre>
<p>The usage actually goes down (!) and then levels off at 9724 MB, which is 95.07%.</p>
<p>But then, we see 87 (eighty-seven) messages like this one:</p>
<p>08:52:04.414782 filestore(/var/lib/ceph/osd/ceph-3) queue_op 0x7f2140f74900 seq 2431 osr(1.2 0x7f2139474dc0) 4195666 bytes (queue has 16 ops and 67129672 bytes)</p>
<p>These messages indicate operations being queued. Each operation has a "seq" number, from "seq 2431" to "seq 25128".</p>
<p>The messages are very similar to eachother. I believe each of these messages indicates a ~4MB write operation taking place.</p>
<p>Then there is another heartbeat:</p>
<pre>
2016-10-25 08:52:06.454980 7f210cc63700 5 osd.3 25 heartbeat: osd_stat(9813 MB used, 415 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,1,0,4,3,6,3,4])
</pre>
<p>Already we see OSD disk usage has exceeded 95%. But the ops keep coming, and further heartbeats show usage continues to increase:</p>
<pre>
2016-10-25 08:52:12.355231 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10024 MB used, 204 MB avail, 10228 MB total, peers [2]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,4,18])
2016-10-25 08:52:18.255550 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:20.555869 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:21.656170 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:26.356491 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:27.456790 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:30.957166 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:33.857496 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:37.357791 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:40.258233 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
2016-10-25 08:52:41.958618 7f210cc63700 5 osd.3 26 heartbeat: osd_stat(10104 MB used, 124 MB avail, 10228 MB total, peers [2]/[] op hist [])
</pre>
<p>And, finally:</p>
<pre>
2016-10-25 08:52:44.258986 7f210cc63700 -1 log_channel(cluster) log [ERR] : OSD full dropping all updates 98% full
</pre>
<p>Log excerpt (~7MB) can be viewed here: <a class="external" href="https://gist.githubusercontent.com/smithfarm/13d8c959243b604f63c5573219f4a72b/raw/fd93f769200da5c60f020254e471c9b352e94150/wip-16878-test-1.log">https://gist.githubusercontent.com/smithfarm/13d8c959243b604f63c5573219f4a72b/raw/fd93f769200da5c60f020254e471c9b352e94150/wip-16878-test-1.log</a></p>
<p>Next I will try with 2GB journals to see if that makes any difference.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=803062016-10-25T12:00:34ZNathan Cutlerncutler@suse.cz
<ul></ul><p>The second test was on an identical cluster (11.0.2, 2 OSDs, 10GB each, external journals) with the only difference that the journal partitions were 2GB in size.</p>
<pre>
ubuntu@target167114248085:/var/log/ceph$ sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 8A044A84-965A-4B8E-BA16-378034D31E89
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 12582845 sectors (6.0 GiB)
Number Start (sector) End (sector) Size Code Name
1 2048 4196351 2.0 GiB FFFF
2 4196352 8390655 2.0 GiB FFFF
</pre>
<p>Whereas the first test had journal partitions 1023MB in size.</p>
<p>Interestingly, in this test run, cur_state is never set to "FULL" even though the journal partitions are twice as big as in the first test. Also, and it isn't changed to "NEAR" until the usage ratio reaches 92%. Even after that, usage continues to rise to 95% before levelling off:</p>
<pre>
smithfarm@wilbur:~/Downloads> grep 'near' ceph-osd.2.log
2016-10-25 11:42:35.178483 7f70cdfe5700 0 log_channel(cluster) log [WRN] : OSD near full (92%)
2016-10-25 11:43:07.380567 7f70cdfe5700 0 log_channel(cluster) log [WRN] : OSD near full (95%)
2016-10-25 11:43:37.983834 7f70cdfe5700 0 log_channel(cluster) log [WRN] : OSD near full (95%)
2016-10-25 11:44:07.887403 7f70cdfe5700 0 log_channel(cluster) log [WRN] : OSD near full (95%)
2016-10-25 11:44:42.090935 7f70cdfe5700 0 log_channel(cluster) log [WRN] : OSD near full (95%)
</pre>
<p>Actually, as the hearbeats show, the usage ratio reaches a peak of 95.99% (9818/10228) before falling to 95.95% where it levels off:</p>
<pre>
2016-10-25 11:42:28.177795 7f70cdfe5700 5 osd.2 19 heartbeat: osd_stat(8876 MB used, 1352 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,16])
2016-10-25 11:42:29.278027 7f70cdfe5700 5 osd.2 19 heartbeat: osd_stat(9074 MB used, 1154 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,5])
2016-10-25 11:42:35.178499 7f70cdfe5700 5 osd.2 20 heartbeat: osd_stat(9415 MB used, 813 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,14,2])
2016-10-25 11:42:38.078735 7f70cdfe5700 5 osd.2 20 heartbeat: osd_stat(9674 MB used, 554 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,1,4,7,0,2])
2016-10-25 11:42:40.979012 7f70cdfe5700 5 osd.2 20 heartbeat: osd_stat(9727 MB used, 501 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,9,7])
2016-10-25 11:42:45.679254 7f70cdfe5700 5 osd.2 20 heartbeat: osd_stat(9815 MB used, 413 MB avail, 10228 MB total, peers [3]/[] op hist [0,0,0,0,0,0,0,0,0,0,3,4,1])
2016-10-25 11:42:51.579456 7f70cdfe5700 5 osd.2 20 heartbeat: osd_stat(9818 MB used, 410 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:42:53.879772 7f70cdfe5700 5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:42:56.780082 7f70cdfe5700 5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:02.680321 7f70cdfe5700 5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:07.380583 7f70cdfe5700 5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:12.680811 7f70cdfe5700 5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:13.781036 7f70cdfe5700 5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
2016-10-25 11:43:14.881330 7f70cdfe5700 5 osd.2 21 heartbeat: osd_stat(9814 MB used, 414 MB avail, 10228 MB total, peers [3]/[] op hist [])
</pre>
<p>Later, after "rados bench" is terminated and then restarted, usage goes down to 9751 MB.</p>
<p>I have no explanation for the usage going down, since the only cluster traffic is "sudo rados -p rbd bench 60000 write --no-cleanup"</p>
<p>Also, the "OSD full dropping all updates" message does not appear in the log.</p>
<p>Full ceph logs for this test are available here: <a class="external" href="http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_11%3a24%3a30-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/25/remote/target167114248085/log/">http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_11%3a24%3a30-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/25/remote/target167114248085/log/</a></p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=803112016-10-25T12:59:18ZNathan Cutlerncutler@suse.cz
<ul></ul><p>The "heartbeat:" message comes from here: <a class="external" href="https://github.com/ceph/ceph/blob/v11.0.2/src/osd/OSD.cc#L4247">https://github.com/ceph/ceph/blob/v11.0.2/src/osd/OSD.cc#L4247</a></p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=803132016-10-25T13:29:04ZNathan Cutlerncutler@suse.cz
<ul></ul><p>Third test has 1.0GB journals, precisely sized - all other parameters are the same.</p>
<pre>
ubuntu@target167114225016:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 3FA69EBF-29A0-45E3-82EE-DA415B1D6FC8
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16777149 sectors (8.0 GiB)
Number Start (sector) End (sector) Size Code Name
1 2048 2099199 1024.0 MiB FFFF
2 2099200 4196351 1024.0 MiB FFFF
</pre>
<p>Cluster fills up linearly until "9870 MB used, 358 MB avail, 10228 MB total" (96.49%) (and doesn't go up (or down) from there, until I CTRL-C the "rados bench" command and then start it again after some minutes. After restarting it, the heartbeat reports a lower usage figure (96.19%):</p>
<pre>
2016-10-25 13:34:41.714955 7fd1637c7700 5 osd.2 21 heartbeat: osd_stat(9839 MB used, 389 MB avail, 10228 MB total, peers [3]/[] op hist [])
</pre>
<p>Logs: <a class="external" href="http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_12%3a52%3a12-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/26/remote/target167114225016/log/">http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_12%3a52%3a12-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/26/remote/target167114225016/log/</a></p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=803252016-10-25T14:02:49ZNathan Cutlerncutler@suse.cz
<ul></ul><p>For the fourth test, I return to the same parameters as the first test. Journal partitions are 1023.0 MiB in size and do not end on 2048 boundary:</p>
<pre>
ubuntu@target167114235145:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): C44659B7-3254-4D88-AA39-5EC24B9BB86F
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16781177 sectors (8.0 GiB)
Number Start (sector) End (sector) Size Code Name
1 2048 2097185 1023.0 MiB FFFF
2 2099200 4194337 1023.0 MiB FFFF
</pre>
<p>I do not let "rados bench" run continuously - I stop and start it periodically.</p>
<p>Excerpt from "ceph -w":</p>
<pre>
2016-10-25 13:58:29.456965 mon.0 [INF] pgmap v92: 8 pgs: 8 active+clean; 9036 MB data, 18908 MB used, 1549 MB / 20457 MB avail; 79739 kB/s wr, 19 op/s
2016-10-25 13:58:31.472783 mon.0 [INF] pgmap v93: 8 pgs: 8 active+clean; 9344 MB data, 19071 MB used, 1386 MB / 20457 MB avail; 72964 kB/s wr, 17 op/s
2016-10-25 13:58:34.058278 mon.0 [INF] pgmap v94: 8 pgs: 8 active+clean; 9460 MB data, 19376 MB used, 1081 MB / 20457 MB avail; 94371 kB/s wr, 23 op/s
2016-10-25 13:58:36.100040 mon.0 [INF] pgmap v95: 8 pgs: 8 active+clean; 9780 MB data, 19513 MB used, 944 MB / 20457 MB avail; 96508 kB/s wr, 23 op/s
2016-10-25 13:58:37.437900 mon.0 [INF] osdmap e22: 2 osds: 2 up, 2 in full
2016-10-25 13:58:37.441116 mon.0 [INF] pgmap v96: 8 pgs: 8 active+clean; 9780 MB data, 19513 MB used, 944 MB / 20457 MB avail; 96869 kB/s wr, 23 op/s
2016-10-25 13:58:39.467619 mon.0 [INF] pgmap v97: 8 pgs: 8 active+clean; 9832 MB data, 19645 MB used, 812 MB / 20457 MB avail; 15803 kB/s wr, 3 op/s
2016-10-25 13:58:41.481405 mon.0 [INF] pgmap v98: 8 pgs: 8 active+clean; 9944 MB data, 19757 MB used, 700 MB / 20457 MB avail; 41581 kB/s wr, 10 op/s
2016-10-25 13:58:44.063871 mon.0 [INF] pgmap v99: 8 pgs: 8 active+clean; 9988 MB data, 19913 MB used, 544 MB / 20457 MB avail; 34754 kB/s wr, 8 op/s
2016-10-25 13:58:40.335147 osd.2 [ERR] OSD full dropping all updates 97% full
2016-10-25 13:58:46.077324 mon.0 [INF] pgmap v100: 8 pgs: 8 active+clean; 9988 MB data, 20060 MB used, 397 MB / 20457 MB avail; 9802 kB/s wr, 2 op/s
2016-10-25 13:58:43.085854 osd.3 [ERR] OSD full dropping all updates 98% full
2016-10-25 13:58:49.105441 mon.0 [INF] pgmap v101: 8 pgs: 8 active+clean; 9988 MB data, 20060 MB used, 397 MB / 20457 MB avail
</pre>
<p>Clearly, the irregularly-sized journals affect the usage calculation.</p>
<p>Logs: <a class="external" href="http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_13%3a42%3a28-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/28/remote/target167114235145/log/">http://teuthology-logs.public.ceph.com/ubuntu-2016-10-25_13%3a42%3a28-rados%3asingleton-nomsgr%3aall%3a16878.yaml-wip-16878---basic-openstack/28/remote/target167114235145/log/</a></p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=803262016-10-25T14:08:34ZNathan Cutlerncutler@suse.cz
<ul></ul><p>Fifth test, all parameters the same as the fourth test, except we use a jewel (latest from git - almost 10.2.4) cluster:</p>
<pre>
ubuntu@target167114237209:~$ sudo sgdisk -p /dev/vdd
Disk /dev/vdd: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 9E55F8DA-98A3-446E-B89F-D4924C376B9D
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 16781177 sectors (8.0 GiB)
Number Start (sector) End (sector) Size Code Name
1 2048 2097185 1023.0 MiB FFFF
2 2099200 4194337 1023.0 MiB FFFF
</pre>
<p>ceph -w excerpt</p>
<pre>
2016-10-25 14:05:42.853358 mon.0 [INF] pgmap v74: 8 pgs: 8 active+clean; 8480 MB data, 17708 MB used, 2748 MB / 20457 MB avail; 27977 kB/s wr, 6 op/s
2016-10-25 14:05:43.148348 osd.2 [WRN] OSD near full (90%)
2016-10-25 14:05:44.867965 mon.0 [INF] pgmap v75: 8 pgs: 8 active+clean; 8784 MB data, 17921 MB used, 2536 MB / 20457 MB avail; 100 MB/s wr, 25 op/s
2016-10-25 14:05:45.145358 osd.3 [WRN] OSD near full (92%)
2016-10-25 14:05:47.893578 mon.0 [INF] pgmap v76: 8 pgs: 8 active+clean; 8900 MB data, 18260 MB used, 2197 MB / 20457 MB avail; 85311 kB/s wr, 20 op/s
2016-10-25 14:05:49.916282 mon.0 [INF] pgmap v77: 8 pgs: 8 active+clean; 9336 MB data, 18737 MB used, 1720 MB / 20457 MB avail; 109 MB/s wr, 27 op/s
2016-10-25 14:05:52.092371 mon.0 [INF] pgmap v78: 8 pgs: 8 active+clean; 9468 MB data, 19228 MB used, 1229 MB / 20457 MB avail; 135 MB/s wr, 33 op/s
2016-10-25 14:05:54.221496 mon.0 [INF] pgmap v79: 8 pgs: 8 active+clean; 9920 MB data, 19635 MB used, 822 MB / 20457 MB avail; 136 MB/s wr, 34 op/s
2016-10-25 14:05:55.855625 mon.0 [INF] osdmap e20: 2 osds: 2 up, 2 in full
2016-10-25 14:05:55.858893 mon.0 [INF] pgmap v80: 8 pgs: 8 active+clean; 9920 MB data, 19635 MB used, 822 MB / 20457 MB avail; 119 MB/s wr, 29 op/s
2016-10-25 14:05:57.861462 mon.0 [INF] pgmap v81: 8 pgs: 8 active+clean; 9988 MB data, 19932 MB used, 525 MB / 20457 MB avail; 19016 kB/s wr, 4 op/s
2016-10-25 14:05:59.877270 mon.0 [INF] pgmap v82: 8 pgs: 8 active+clean; 10052 MB data, 20076 MB used, 381 MB / 20457 MB avail; 33653 kB/s wr, 8 op/s
2016-10-25 14:06:02.088427 mon.0 [INF] pgmap v83: 8 pgs: 8 active+clean; 10052 MB data, 20077 MB used, 380 MB / 20457 MB avail; 15509 kB/s wr, 3 op/s
2016-10-25 14:05:55.450072 osd.2 [ERR] OSD full dropping all updates 98% full
2016-10-25 14:05:56.248672 osd.3 [ERR] OSD full dropping all updates 98% full
2016-10-25 14:06:05.112068 mon.0 [INF] pgmap v84: 8 pgs: 8 active+clean; 10052 MB data, 20136 MB used, 321 MB / 20457 MB avail
2016-10-25 14:06:07.125225 mon.0 [INF] pgmap v85: 8 pgs: 8 active+clean; 10052 MB data, 20188 MB used, 269 MB / 20457 MB avail
2016-10-25 14:06:25.053419 osd.2 [ERR] OSD full dropping all updates 98% full
2016-10-25 14:06:27.151812 osd.3 [ERR] OSD full dropping all updates 98% full
</pre> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=803272016-10-25T14:12:28ZNathan Cutlerncutler@suse.cz
<ul></ul><p>We have a report that, on a larger cluster (24 OSDs, 1TB each, external 87GB journals), several OSDs fill up to 100% in this test. Unfortunately I don't have logs from that.</p>
<p>Is the information above enough to go on, or do I need to reproduce on a bigger cluster and try to get up to 100%?</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=804122016-10-26T12:11:55ZKefu Chaitchaikov@gmail.com
<ul></ul><p>Nathan,</p>
<p>yeah, if the journal is abnormally large, when we sync/flush the journal to the filestore, there is chance that the utilization percentage of filestore could reach osd_failsafe_full_ratio before we notice it. maybe we should set a watermark taking the worst case into consideration, where the journal is full when we sync the filestore.</p>
<p>in other words, we should be more conservative when evaluating NEAR_FULL, instead of using</p>
<pre>
is_near_full = (osd_stat.kb - osd_stat.kb_avail) / osd_stat.kb > osd_failsafe_nearfull_ratio
</pre>
<p>we should use<br /><pre>
(osd_stat.kb - osd_stat.kb_avail + journal_size.kb) / osd_stat.kb > osd_failsafe_nearfull_ratio
</pre></p>
<p>what do you think?</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=804132016-10-26T12:34:47ZAlexey Sheplyakovasheplyakov@mirantis.com
<ul></ul><blockquote>
<p>(osd_stat.kb - osd_stat.kb_avail + journal_size.kb) / osd_stat.kb</p>
</blockquote>
<p>I guess this expression should be used for external journals only. If the journal is co-located with the data the filesystem driver accounts for blocks allocated for the journal file.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=804292016-10-26T20:27:35ZNathan Cutlerncutler@suse.cz
<ul></ul><p>@Kefu, first of all thank you very much for looking at this. Keep in mind that the only way I have been able to trigger the bug so far is by having the journal be irregularly sized (number of sectors not evenly divisible by 2048).</p>
<p>In other words:</p>
<p>- when the journal size (in sectors) is evenly divisible by 2048, the "write-to-full" test works as expected: cluster fills up to 95% and the usage ratio does not rise any further.</p>
<p>- when the journal size is <strong>not</strong> evenly divisible by 2048, buggy behavior is seen (cluster fills to 98%, possibly even to 100%).</p>
<p>That, combined with the fact that the journal is abnormally large (and full) in both cases, would indicate that the bug is more subtle, right? Still, I will try your patch and report back.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=804312016-10-26T20:28:52ZNathan Cutlerncutler@suse.cz
<ul></ul><p>@Alexey: Good point. I'll keep that in mind.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=804782016-10-27T20:04:57ZNathan Cutlerncutler@suse.cz
<ul></ul><p>Regarding my "number of sectors evenly divisible by 2048" comments above, it has been brought to my attention that 1023 MiB actually meets this criterion, provided we take "sector" to mean "512 bytes", since 1023 MiB % (2048 * 512) is 0:</p>
<pre>
>>> print (1023 * 1024**2) % (2048 * 512)
0
</pre>
<p>However, if "sector" is taken to mean 4096 bytes, this no longer holds:</p>
<pre>
>>> print (1023 * 1024**2) % (2048 * 4096)
7340032
</pre>
<p>Now, it seems to me that the Ceph journal code does not have any concept of sectors, but it does have a concept of 4096-byte blocks, and the journal block size in these tests was 4096:</p>
<pre>
2016-10-25 13:58:41.121694 7f81ac4fb700 10 journal header: block_size 4096 alignment 4096 max_size 1072709632
</pre>
<p>max_size 1072709632 is 1023 MiB which is divisible by 4096, so I'm still looking for an explanation and will run some more tests with different journal sizes, to gather more data.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=804852016-10-28T07:30:26ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/80485/diff?detail_id=77521">diff</a>)</li></ul><p>Rewrote description in light of Kefu's analysis.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=826902016-12-07T22:51:18ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-4 priority-default closed" href="/issues/18153">Bug #18153</a>: No space cause osd crash (bluestore)</i> added</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=860912017-02-15T09:04:55ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-6 priority-high2 closed" href="/issues/15912">Bug #15912</a>: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed</i> added</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=860932017-02-15T09:05:16ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>12</i></li><li><strong>Assignee</strong> deleted (<del><i>Nathan Cutler</i></del>)</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=863992017-02-20T18:13:41ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Assignee</strong> set to <i>David Zafman</i></li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=864002017-02-20T18:13:59ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=864142017-02-20T22:25:10ZNathan Cutlerncutler@suse.cz
<ul></ul><p>@David We found a reproducer for this, let me know if you need/want it.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=869012017-02-28T21:55:05ZDavid Zafmandzafman@redhat.com
<ul></ul><p>@nathan I would love to know if the fix I have for 15912 which accounts for the amount of dirty journal fixes the issue in this tracker. If everything works, I'll close this as Duplicate.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=871552017-03-03T12:33:57ZNathan Cutlerncutler@suse.cz
<ul><li><strong>File</strong> <a href="/attachments/download/2743/ceph-12.0-2-client.log">ceph-12.0-2-client.log</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/2743/ceph-12.0-2-client.log">View</a> added</li><li><strong>File</strong> <a href="/attachments/download/2744/ceph-12.0-2.log">ceph-12.0-2.log</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/2744/ceph-12.0-2.log">View</a> added</li><li><strong>File</strong> <a href="/attachments/download/2745/ceph-osd.1.log.bz2">ceph-osd.1.log.bz2</a> added</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=871562017-03-03T12:39:31ZNathan Cutlerncutler@suse.cz
<ul></ul><p>@David: We were able to reproduce the bug on a very recent version of master plus your patches from <a class="external" href="https://github.com/ceph/ceph/pull/13425">https://github.com/ceph/ceph/pull/13425</a> - here are the details:</p>
<p>We deployed the following cluster:</p>
<p>4 Nodes<br />Each Node has 1GB RAM, 2 cores<br />8 OSDs (2 OSDs per node)<br />Each OSD uses 2GB disk with journal co-located<br />4 OSDs use 1GB journal, 2 OSDs use 512M journal, 2 OSDs use 256M journal.<br />3 monitors sharing 3/4 OSD nodes<br />248 PGs</p>
<p>We then ran <code>rados -p rbd bench 60000 write --no-cleanup</code> and after about 5-10 minutes one of the OSDs crashes.</p>
<p>I just uploaded three files:</p>
<ul>
<li>log-12.0-2-client.log (log of rados bench)</li>
<li>log-12.0-2.log (log of ceph -w and the output of ceph -s and ceph osd<br />tree following execution of rados bench)</li>
<li>ceph-osd.1.log.bz2 (log of the OSD that crashed)</li>
</ul>
<p>Please let me know if this is useful and if you need anything else.</p> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=872472017-03-07T02:11:17ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Related to</strong> deleted (<i><a class="issue tracker-1 status-3 priority-6 priority-high2 closed" href="/issues/15912">Bug #15912</a>: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed</i>)</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=872492017-03-07T02:11:37ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-6 priority-high2 closed" href="/issues/15912">Bug #15912</a>: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed</i> added</li></ul> Ceph - Bug #16878: filestore: utilization ratio calculation does not take journal size into accounthttps://tracker.ceph.com/issues/16878?journal_id=872512017-03-07T02:12:30ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Resolved</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/13425">https://github.com/ceph/ceph/pull/13425</a></p>