https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2019-03-25T00:15:40ZCeph bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1326892019-03-25T00:15:40ZChris Callegari
<ul><li><strong>File</strong> <a href="/attachments/download/4036/osd5_perf_log.log">osd5_perf_log.log</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/4036/osd5_perf_log.log">View</a> added</li></ul><p>I recently upgraded from latest mimic to nautilus. My cluster displayed 'BLUEFS_SPILLOVER BlueFS spillover detected on <abbr title="s">OSD</abbr>'. It took a long conversation and a manual scan of all my osds to find the culprit. The '/usr/bin/ceph daemon osd.5 perf dump | /usr/bin/jq .' output is attached. Unfortunately I did not let this osd hang around for a long time. I zapped and created him.</p>
<p>Thanks,<br />/Chris Callegari</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1326902019-03-25T00:17:19ZChris Callegari
<ul></ul><p>Also my cluster did not display the 'osd.X spilled over 123 GiB metadata from 'blah' device (20 GiB used of 31 GiB) to slow device"</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1327012019-03-25T11:10:51ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Chris Callegari wrote:</p>
<blockquote>
<p>Also my cluster did not display the 'osd.X spilled over 123 GiB metadata from 'blah' device (20 GiB used of 31 GiB) to slow device"</p>
</blockquote>
<p>Chris, you should invoke "ceph health detail" to get such an output.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1327142019-03-25T14:12:51ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Generally I suppose this is a valid state - RocksDB put next level data to slow device when it expects it wouldn't fit into the fast one. Please recall that RocksDB uses 250MB as a level base size and 10 as a next level multiplier by default. So drive has to have 250+GB to fit L3.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1327192019-03-25T15:15:10ZSage Weilsage@newdream.net
<ul></ul><p>I tried a compaction on osd.50. Before,<br /><pre>
osd.50 spilled over 1.3 GiB metadata from 'db' device (18 GiB used of 31 GiB) to slow device
</pre></p>
<p>compaction showed:<br /><pre>
2019-03-25 14:46:13.616 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:14.272 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:15.032 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:15.976 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:16.888 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:17.820 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:18.756 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:19.668 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:20.452 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:21.352 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:22.240 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:23.160 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:24.000 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:24.884 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:25.780 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:26.688 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:28.356 7f85bf4ea700 1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0xa00000; fallback to bdev 2
2019-03-25 14:48:22.707 7f85d12e6080 1 bluestore(/var/lib/ceph/osd/ceph-50) umount
</pre></p>
<p>after,<br /><pre>
osd.50 spilled over 198 MiB metadata from 'db' device (17 GiB used of 31 GiB) to slow device
</pre><br />so... that is kind of weird.</p>
<p>rerunning with bluefs and bluestore debug enabled.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1327202019-03-25T15:27:22ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>@Sage, I observed up to 2x space utilization increase during compaction. You can inspect l_bluefs_max_bytes_wal, l_bluefs_max_bytes_db, l_bluefs_max_bytes_slow perf counter to confirm that. Looks like the case here as well. Log output shows that BlueFS is unable to allocate ~69 MB (which seems to be close to average SST size in RocksDB) at fast device since it has just 47MB. Hence the fallback. <br />So perhaps the root cause for spillover might be both level layout (as per my previous comment) and lack of space during compaction.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1327232019-03-25T15:56:36ZSage Weilsage@newdream.net
<ul></ul><p>ceph-post-file: a6ef2d24-56c0-486d-bb1e-f82080c0da9e</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1327242019-03-25T16:20:21ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Curious thing is that one is unable to realize what space is occupied at slow device due to fallbacks from neither "kvstore-tool stats" nor "bluestore-tool export" commands. bluestore-tool show-bdev-sizes is the only (implicit) mean.<br />E.g. <br />"kvstore-tool stats" output:<br /> "": " L0 3/0 667.25 MB 2.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",<br /> "": " L1 5/0 246.06 MB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",<br /> "": " L2 39/0 2.45 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",<br /> "": " L3 11/0 707.18 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",<br /> "": " Sum 58/0 4.03 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",</p>
<p>"bluestore-tool show-bdev-sizes" output:<br />1 : device size 0x100000000 : own 0x[2000~ffffe000] = 0xffffe000 : using 0xc12fe000(3.0 GiB)<br />2 : device size 0x100000000 : own 0x[1ee00000~3c300000,60000000~40000000] = 0x7c300000 : using 0x44500000(1.1 Gi</p>
<p>Please note that used space at slow device is higher than space required by L3. Which is IMO caused by the fallback.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1327522019-03-26T02:35:34ZKonstantin Shalygink0ste@k0ste.ru
<ul><li><strong>File</strong> <a href="/attachments/download/4037/Selection-001.png">Selection-001.png</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/4037/Selection-001.png">View</a> added</li></ul><p><cite>@Sage, I observed up to 2x space utilization increase during compaction.</cite></p>
<p>This is normal behavior for first compaction.<br /><img src="https://tracker.ceph.com/attachments/download/4037/Selection-001.png" alt="" /></p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1328222019-03-27T08:55:23ZRafal Wadolowskirwadolowski@cloudferro.com
<ul></ul><p>The slow bytes used is the problem we've been seeing for one year.<br />One of server has 20GB db.wal for 8TB RAW device. There is one main pool build on EC 4+2. This cluster is only object storage cluster and the slows are hitting performance on listing buckets.<br />Statistics for usage:<br /><pre>
OSD.384 DB used: 8.28 GiB SLOW used= 4.51 GiB WAL used= 252.00 MiB
OSD.404 DB used: 17.57 GiB SLOW used= 77.00 MiB WAL used= 252.00 MiB
OSD.374 DB used: 8.81 GiB SLOW used= 4.87 GiB WAL used= 252.00 MiB
OSD.385 DB used: 15.46 GiB SLOW used= 0 Bytes WAL used= 252.00 MiB
OSD.382 DB used: 9.11 GiB SLOW used= 5.05 GiB WAL used= 267.00 MiB
OSD.386 DB used: 14.94 GiB SLOW used= 1.83 GiB WAL used= 252.00 MiB
OSD.401 DB used: 15.12 GiB SLOW used= 4.20 GiB WAL used= 252.00 MiB
OSD.396 DB used: 15.37 GiB SLOW used= 89.00 MiB WAL used= 252.00 MiB
OSD.377 DB used: 16.55 GiB SLOW used= 202.00 MiB WAL used= 371.00 MiB
OSD.392 DB used: 10.44 GiB SLOW used= 4.29 GiB WAL used= 304.00 MiB
OSD.403 DB used: 15.93 GiB SLOW used= 76.00 MiB WAL used= 252.00 MiB
OSD.395 DB used: 15.33 GiB SLOW used= 0 Bytes WAL used= 264.00 MiB
OSD.375 DB used: 16.16 GiB SLOW used= 4.71 GiB WAL used= 252.00 MiB
OSD.379 DB used: 6.78 GiB SLOW used= 1.75 GiB WAL used= 540.00 MiB
OSD.407 DB used: 16.47 GiB SLOW used= 141.00 MiB WAL used= 252.00 MiB
OSD.393 DB used: 15.59 GiB SLOW used= 4.19 GiB WAL used= 264.00 MiB
OSD.399 DB used: 15.28 GiB SLOW used= 4.14 GiB WAL used= 252.00 MiB
OSD.381 DB used: 15.47 GiB SLOW used= 225.00 MiB WAL used= 580.00 MiB
OSD.405 DB used: 17.07 GiB SLOW used= 6.09 GiB WAL used= 858.00 MiB
OSD.398 DB used: 14.86 GiB SLOW used= 166.00 MiB WAL used= 252.00 MiB
OSD.383 DB used: 15.12 GiB SLOW used= 78.00 MiB WAL used= 253.00 MiB
OSD.402 DB used: 18.08 GiB SLOW used= 6.04 GiB WAL used= 280.00 MiB
OSD.391 DB used: 15.78 GiB SLOW used= 3.49 GiB WAL used= 256.00 MiB
OSD.389 DB used: 9.13 GiB SLOW used= 4.58 GiB WAL used= 256.00 MiB
OSD.376 DB used: 16.92 GiB SLOW used= 1.33 GiB WAL used= 2.62 GiB
OSD.388 DB used: 15.47 GiB SLOW used= 141.00 MiB WAL used= 248.00 MiB
OSD.394 DB used: 7.74 GiB SLOW used= 4.45 GiB WAL used= 272.00 MiB
OSD.380 DB used: 15.82 GiB SLOW used= 79.00 MiB WAL used= 252.00 MiB
OSD.390 DB used: 10.88 GiB SLOW used= 5.50 GiB WAL used= 256.00 MiB
OSD.397 DB used: 8.31 GiB SLOW used= 3.73 GiB WAL used= 442.00 MiB
OSD.406 DB used: 16.69 GiB SLOW used= 195.00 MiB WAL used= 311.00 MiB
OSD.400 DB used: 16.05 GiB SLOW used= 145.00 MiB WAL used= 256.00 MiB
OSD.378 DB used: 15.22 GiB SLOW used= 152.00 MiB WAL used= 445.00 MiB
OSD.387 DB used: 13.03 GiB SLOW used= 0 Bytes WAL used= 262.00 MiB
SUM DB used: 475.00 GiB SUM SLOW used= 76.55 GiB SUM WAL used= 12.64 GiB
</pre><br />IMHO The compaction is only short-term resolution. I think the real problem is in how the rocksdb store data on disk. For example if you delete some of the data, they are deleted from DB, but DB still used this same space and the compaction (when it was triggered) will free it. Maybe there is a method, which is trying to optimize used space periodicaly.<br />Actually compaction is started by the user or by rocksdb when level ratio is above 1.0.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1328232019-03-27T08:58:56ZKonstantin Shalygink0ste@k0ste.ru
<ul></ul><p>Rafal, there is not your case! You spillover is because your db is lower than 30Gb. Please consult with <a class="external" href="http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html">http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html</a></p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1328352019-03-27T12:13:38ZRafal Wadolowskirwadolowski@cloudferro.com
<ul></ul><p>Konstantin Shalygin wrote:</p>
<blockquote>
<p>Rafal, there is not your case! You spillover is because your db is lower than 30Gb. Please consult with <a class="external" href="http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html">http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html</a></p>
</blockquote>
<p>This problem is related. The data should be only on DB, if there is free space. There is interesting split:<br /><pre>
osd-379
{
"rocksdb_compaction_statistics": "",
"": "",
"": "** Compaction Stats [default] **",
"": "Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
"": "----------------------------------------------------------------------------------------------------------------------------------------------------------",
"": " L0 2/0 56.73 MB 0.2 0.0 0.0 0.0 0.1 0.1 0.0 1.0 0.0 73.4 1 2 0.386 0 0",
"": " L3 135/0 8.47 GB 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " Sum 137/0 8.53 GB 0.0 0.0 0.0 0.0 0.1 0.1 0.0 1.0 0.0 73.4 1 2 0.386 0 0",
"": " Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.1 0.1 0.0 1.0 0.0 73.4 1 2 0.386 0 0",
"": "Uptime(secs): 136650.4 total, 136650.4 interval",
"": "Flush(GB): cumulative 0.055, interval 0.055",
"": "AddFile(GB): cumulative 0.000, interval 0.000",
"": "AddFile(Total Files): cumulative 0, interval 0",
"": "AddFile(L0 Files): cumulative 0, interval 0",
"": "AddFile(Keys): cumulative 0, interval 0",
"": "Cumulative compaction: 0.06 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.8 seconds",
"": "Interval compaction: 0.06 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.8 seconds",
"": "Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count",
"": "",
"": "** File Read Latency Histogram By Level [default] **",
"": "",
"": "** DB Stats **",
"": "Uptime(secs): 136650.4 total, 136650.4 interval",
"": "Cumulative writes: 259K writes, 794K keys, 259K commit groups, 1.0 writes per commit group, ingest: 0.39 GB, 0.00 MB/s",
"": "Cumulative WAL: 259K writes, 129K syncs, 2.00 writes per sync, written: 0.39 GB, 0.00 MB/s",
"": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
"": "Interval writes: 259K writes, 794K keys, 259K commit groups, 1.0 writes per commit group, ingest: 395.78 MB, 0.00 MB/s",
"": "Interval WAL: 259K writes, 129K syncs, 2.00 writes per sync, written: 0.39 MB, 0.00 MB/s",
"": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent"
}
</pre></p>
<p>"You spillover is because your db is lower than 30Gb", why 30Gb are problem in my case?</p>
<p>According to your screen, we noticed this is normal effect and twice compacting cluster are clearing slow.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1328382019-03-27T13:00:07ZKonstantin Shalygink0ste@k0ste.ru
<ul></ul><p><cite>why 30Gb are problem in my case</cite>?</p>
<p>Because of compaction levels: <a class="external" href="https://github.com/facebook/rocksdb/wiki/Leveled-Compaction">https://github.com/facebook/rocksdb/wiki/Leveled-Compaction</a></p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1328432019-03-27T13:27:07ZRafal Wadolowskirwadolowski@cloudferro.com
<ul></ul><p>Konstanin, okey but in documentation are default settings.<br />We have<br /><pre>
bluestore_rocksdb_options = "compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=1610612736,max_bytes_for_level_multiplier=10,compaction_threads=32,flusher_threads=8"
</pre><br />so L0 is 1.5GB,L1 1.5GB, L2 15GB<br />L0+L1+L2 18GB<br />In <a class="external" href="https://github.com/ceph/ceph/pull/22025">https://github.com/ceph/ceph/pull/22025</a> I proposed option to rocksdb, that will help osd.379, but I'm not sure if that change will cover all spillover cases.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1329492019-03-28T02:04:31ZKonstantin Shalygink0ste@k0ste.ru
<ul></ul><p>256MB+2.56GB+25.6GB=~28-29GB - for default Luminous options.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1375172019-05-30T10:07:27ZXiaoxi Chenxiaoxchen@ebay.com
<ul></ul><p>BLUEFS_SPILLOVER BlueFS spillover detected on 7 <abbr title="s">OSD</abbr><br /> osd.248 spilled over 257 MiB metadata from 'db' device (3.2 GiB used of 20 GiB) to slow device<br /> osd.266 spilled over 264 MiB metadata from 'db' device (3.4 GiB used of 20 GiB) to slow device<br /> osd.283 spilled over 330 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device<br /> osd.294 spilled over 594 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device<br /> osd.320 spilled over 279 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device<br /> osd.371 spilled over 264 MiB metadata from 'db' device (3.4 GiB used of 20 GiB) to slow device<br /> osd.391 spilled over 264 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device</p>
<p>one more instance, even more wired... have no idea how come I only use 3.3GB but already split over</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1375402019-05-30T14:15:44ZJosh Durgin
<ul><li><strong>Assignee</strong> set to <i>Adam Kupczyk</i></li></ul><p>Adam's looking at similar spillover issues</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1388152019-06-14T15:16:48ZBrett Chancellor
<ul></ul><p>This is also showing up in 14.2.1 in instances where the db is overly provisioned.</p>
<p>HEALTH_WARN BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br />BLUEFS_SPILLOVER BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br /> osd.52 spilled over 804 MiB metadata from 'db' device (29 GiB used of 148 GiB) to slow device<br /> osd.221 spilled over 3.9 GiB metadata from 'db' device (29 GiB used of 148 GiB) to slow device<br /> osd.245 spilled over 1.7 GiB metadata from 'db' device (28 GiB used of 148 GiB) to slow device</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1417222019-07-25T14:11:32ZSage Weilsage@newdream.net
<ul><li><strong>Priority</strong> changed from <i>High</i> to <i>Normal</i></li></ul> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1417232019-07-25T14:11:50ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>In Progress</i></li></ul> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1457592019-09-10T15:38:50ZDan van der Ster
<ul></ul><p>Us to, on hdd-only OSDs (no dedicated block.db or wal):</p>
<p>BLUEFS_SPILLOVER BlueFS spillover detected on 94 <abbr title="s">OSD</abbr><br /> osd.98 spilled over 1.1 GiB metadata from 'db' device (147 MiB used of 26 GiB) to slow device<br /> osd.135 spilled over 1.3 GiB metadata from 'db' device (183 MiB used of 26 GiB) to slow device<br /> ...</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1457602019-09-10T16:03:40ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>@Dan - this sounds weird - spillover without dedicated db... Could you please share 'ceph osd metadata output'?</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1457852019-09-11T05:05:35ZRafal Wadolowskirwadolowski@cloudferro.com
<ul></ul><p>@Adam are there any news about this problem? We have about ~1500 osds with spillover. <br />If you need some more data, feel free to contact me :)</p>
<p>@Dan do you have default configuration for rocksdb?</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1457892019-09-11T07:56:28ZDan van der Ster
<ul></ul><p>@Igor, @Rafal: please ignore. I was confused about the configuration of this cluster. It indeed has 26GB rocksdb's, so the spillover makes perfect sense.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1498372019-10-29T14:41:16ZMarcin W
<ul></ul><p>Due to spillover, I'm trying to optimize RocksDB options based on data partition size(roughly 9TB in example below). This will help to estimate how much space is needed for new DB partition on NVME and adjust parameters accordingly.<br />Would this calculation make sense and would it prevent spillover?<br /><pre><code class="python syntaxhl"><span class="CodeRay"> <span class="keyword">def</span> <span class="function">calculate_rocksdb_levels</span>(<span class="predefined-constant">self</span>, base, multiplier, levels):
<span class="comment"># default Ceph setting:</span>
<span class="comment"># base=256, multiplier=10, levels=5</span>
level_sizes = [ <span class="integer">0</span>, base ]
<span class="keyword">for</span> level <span class="keyword">in</span> <span class="predefined">range</span>(<span class="integer">2</span>, levels+<span class="integer">1</span>):
<span class="comment"># L(n+1) = (Ln) * max_bytes_for_level_multiplier</span>
level_prev = level - <span class="integer">1</span>
level_size_prev = level_sizes[level_prev]
level_size = level_size_prev * multiplier
level_sizes.append(level_size)
level_sizes_all = <span class="predefined">int</span>((base * (<span class="integer">1</span> - (multiplier**levels))) / (<span class="integer">1</span>-multiplier))
<span class="comment"># log.debug('level_sizes_all=%s, level_sizes=%s', level_sizes_all, level_sizes)</span>
<span class="keyword">return</span> level_sizes_all, level_sizes
<span class="keyword">def</span> <span class="function">calculate_rocksdb_new_size</span>(<span class="predefined-constant">self</span>, partition_db_size):
<span class="comment"># Add 10% of extra space for compaction and others. Enough?</span>
partition_db_size_w_spare = <span class="predefined">int</span>(partition_db_size - (partition_db_size / <span class="integer">10</span>))
nearest_size = <span class="integer">0</span>
nearest_settings = {}
<span class="comment"># Only subset of base, multiplier and levels is taken into consideration here</span>
<span class="comment"># Brute force geometric progression calculations to find</span>
<span class="comment"># total space of all DB levels to be as close to partition_db_size_w_spare as possible</span>
<span class="keyword">for</span> base <span class="keyword">in</span> [ <span class="integer">64</span>, <span class="integer">96</span>, <span class="integer">128</span>, <span class="integer">192</span>, <span class="integer">256</span> ]:
<span class="keyword">for</span> multiplier <span class="keyword">in</span> <span class="predefined">range</span>(<span class="integer">3</span>, <span class="integer">10</span>+<span class="integer">1</span>):
<span class="keyword">for</span> levels <span class="keyword">in</span> <span class="predefined">range</span>(<span class="integer">4</span>, <span class="integer">6</span>+<span class="integer">1</span>):
<span class="comment"># log.debug('base=%s, multiplier=%s, levels=%s', base, multiplier, levels)</span>
level_sizes_all, level_sizes = <span class="predefined-constant">self</span>.calculate_rocksdb_levels(base, multiplier, levels)
<span class="keyword">if</span> level_sizes_all < partition_db_size_w_spare:
<span class="keyword">if</span> level_sizes_all > nearest_size:
nearest_size = level_sizes_all
nearest_settings = {
<span class="string"><span class="delimiter">'</span><span class="content">base</span><span class="delimiter">'</span></span>: base,
<span class="string"><span class="delimiter">'</span><span class="content">multiplier</span><span class="delimiter">'</span></span>: multiplier,
<span class="string"><span class="delimiter">'</span><span class="content">levels</span><span class="delimiter">'</span></span>: levels,
<span class="string"><span class="delimiter">'</span><span class="content">level_sizes</span><span class="delimiter">'</span></span>: level_sizes[<span class="integer">1</span>:],
}
nearest_db_size = <span class="predefined">int</span>(nearest_size + (nearest_size / <span class="integer">10</span>))
base = nearest_settings[<span class="string"><span class="delimiter">'</span><span class="content">base</span><span class="delimiter">'</span></span>]
multiplier = nearest_settings[<span class="string"><span class="delimiter">'</span><span class="content">multiplier</span><span class="delimiter">'</span></span>]
levels = nearest_settings[<span class="string"><span class="delimiter">'</span><span class="content">levels</span><span class="delimiter">'</span></span>]
mem_buffers = <span class="predefined">int</span>(base/<span class="integer">16</span>) <span class="comment"># base / write_buffer_size</span>
db_settings = {
<span class="string"><span class="delimiter">'</span><span class="content">compaction_readahead_size</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">2MB</span><span class="delimiter">'</span></span>,
<span class="string"><span class="delimiter">'</span><span class="content">compaction_style</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">kCompactionStyleLevel</span><span class="delimiter">'</span></span>,
<span class="string"><span class="delimiter">'</span><span class="content">compaction_threads</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % (mem_buffers * <span class="integer">2</span>),
<span class="string"><span class="delimiter">'</span><span class="content">compression</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">kNoCompression</span><span class="delimiter">'</span></span>,
<span class="string"><span class="delimiter">'</span><span class="content">flusher_threads</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">8</span><span class="delimiter">'</span></span>,
<span class="string"><span class="delimiter">'</span><span class="content">level0_file_num_compaction_trigger</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % <span class="predefined">int</span>(mem_buffers / <span class="integer">2</span>),
<span class="string"><span class="delimiter">'</span><span class="content">level0_slowdown_writes_trigger</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % (mem_buffers + <span class="integer">8</span>),
<span class="string"><span class="delimiter">'</span><span class="content">level0_stop_writes_trigger</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % (mem_buffers + <span class="integer">16</span>),
<span class="string"><span class="delimiter">'</span><span class="content">max_background_compactions</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % (mem_buffers * <span class="integer">2</span>),
<span class="string"><span class="delimiter">'</span><span class="content">max_bytes_for_level_base</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%sMB</span><span class="delimiter">'</span></span> % base,
<span class="string"><span class="delimiter">'</span><span class="content">max_bytes_for_level_multiplier</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % multiplier,
<span class="string"><span class="delimiter">'</span><span class="content">max_write_buffer_number</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % mem_buffers,
<span class="string"><span class="delimiter">'</span><span class="content">min_write_buffer_number_to_merge</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % <span class="predefined">int</span>(mem_buffers / <span class="integer">2</span>),
<span class="string"><span class="delimiter">'</span><span class="content">num_levels</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">%s</span><span class="delimiter">'</span></span> % levels,
<span class="string"><span class="delimiter">'</span><span class="content">recycle_log_file_num</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">2</span><span class="delimiter">'</span></span>,
<span class="string"><span class="delimiter">'</span><span class="content">target_file_size_base</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">16MB</span><span class="delimiter">'</span></span>,
<span class="string"><span class="delimiter">'</span><span class="content">write_buffer_size</span><span class="delimiter">'</span></span>: <span class="string"><span class="delimiter">'</span><span class="content">16MB</span><span class="delimiter">'</span></span>,
}
db_settings_join = <span class="string"><span class="delimiter">'</span><span class="content">,</span><span class="delimiter">'</span></span>.join([ <span class="string"><span class="delimiter">'</span><span class="content">%s=%s</span><span class="delimiter">'</span></span> % (k, db_settings[k]) <span class="keyword">for</span> k <span class="keyword">in</span> <span class="predefined">sorted</span>(db_settings.keys()) ])
log.debug(<span class="string"><span class="delimiter">'</span><span class="content">partition_db_size=%s, partition_db_size_w_spare=%s, nearest_db_size=%s, %s</span><span class="delimiter">'</span></span>, partition_db_size, partition_db_size_w_spare, nearest_db_size, db_settings_join)
log.debug(<span class="string"><span class="delimiter">'</span><span class="content">final partition_db_size=%s, levels=%s, levels_total=%s</span><span class="delimiter">'</span></span>, nearest_db_size, nearest_settings[<span class="string"><span class="delimiter">'</span><span class="content">level_sizes</span><span class="delimiter">'</span></span>], nearest_size)
<span class="keyword">return</span> nearest_db_size, db_settings_join
<span class="keyword">def</span> <span class="function">calc_db_size_for_block</span>(<span class="predefined-constant">self</span>, data_size):
<span class="comment"># Assume DB size is about 5% of data partition</span>
db_size = <span class="predefined">int</span>(<span class="predefined">int</span>(<span class="predefined">int</span>((data_size * <span class="integer">5</span> / <span class="integer">100</span>) + <span class="integer">1</span>) / <span class="integer">2</span>) * <span class="integer">2</span>)
log.debug(<span class="string"><span class="delimiter">'</span><span class="content">data_size=%s, db_size_free=%s</span><span class="delimiter">'</span></span>, data_size, db_size)
<span class="keyword">return</span> <span class="predefined-constant">self</span>.calculate_rocksdb_new_size(db_size)
</span></code></pre></p>
<pre>
/dev/sdc: for data, partition size: 9537532 MB, 5% = partition_db_size(476876 MB)
nearest_db_size=437888 MB, compaction_readahead_size=2MB,compaction_style=kCompactionStyleLevel,compaction_threads=32,compression=kNoCompression,flusher_threads=8,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=24,level0_stop_writes_trigger=32,max_background_compactions=32,max_bytes_for_level_base=256MB,max_bytes_for_level_multiplier=6,max_write_buffer_number=16,min_write_buffer_number_to_merge=8,num_levels=5,recycle_log_file_num=2,target_file_size_base=16MB,write_buffer_size=16MB'
final partition_db_size=437888, levels=[256, 1536, 9216, 55296, 331776], levels_total=398080'
</pre>
<p>At this point, LVM will create 437888MB db_lv and ceph.conf needs extra section:<br /><pre>
[osd.X]
bluestore_rocksdb_options = {{ db_settings_join }}
</pre></p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1607922020-03-11T08:00:12ZYoann Moulin
<ul></ul><p>Hello,</p>
<p>I also have this message on a Nautilus cluster, should I must worry about ?</p>
<pre>
artemis@icitsrv5:~$ ceph --version
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
</pre>
<pre>
artemis@icitsrv5:~$ ceph -s
cluster:
id: 815ea021-7839-4a63-9dc1-14f8c5feecc6
health: HEALTH_WARN
BlueFS spillover detected on 1 OSD(s)
services:
mon: 3 daemons, quorum iccluster003,iccluster005,iccluster007 (age 7w)
mgr: iccluster021(active, since 10h), standbys: iccluster009, iccluster023
mds: cephfs:5 5 up:active
osd: 120 osds: 120 up (since 22h), 120 in (since 45h)
rgw: 8 daemons active (iccluster003.rgw0, iccluster005.rgw0, iccluster007.rgw0, iccluster013.rgw0, iccluster015.rgw0, iccluster019.rgw0, iccluster021.rgw0, iccluster023.rgw0)
data:
pools: 10 pools, 2161 pgs
objects: 204.46M objects, 234 TiB
usage: 367 TiB used, 295 TiB / 662 TiB avail
pgs: 2157 active+clean
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
io:
client: 504 B/s rd, 270 KiB/s wr, 0 op/s rd, 9 op/s wr
</pre>
<pre>
artemis@icitsrv5:~$ ceph health detail
HEALTH_WARN BlueFS spillover detected on 1 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
osd.1 spilled over 324 MiB metadata from 'db' device (28 GiB used of 64 GiB) to slow device
</pre>
<p>Thanks,</p>
<p>Best regards,</p>
<p>Yoann</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1615272020-03-24T09:25:52ZEneko Lacunza
<ul><li><strong>File</strong> <a href="/attachments/download/4764/ceph-osd.5.log">ceph-osd.5.log</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/4764/ceph-osd.5.log">View</a> added</li></ul><p>Hi, we're seeing this issue too. Using 14.2.8 (proxmox build)</p>
<p>We originally had 1GB rocks.db partition:</p>
<ol>
<li>ceph health detail<br />HEALTH_WARN BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br />BLUEFS_SPILLOVER BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br /> osd.3 spilled over 78 MiB metadata from 'db' device (1024 MiB used of 1024 MiB) to slow device<br /> osd.4 spilled over 78 MiB metadata from 'db' device (1024 MiB used of 1024 MiB) to slow device<br /> osd.5 spilled over 84 MiB metadata from 'db' device (1024 MiB used of 1024 MiB) to slow device</li>
</ol>
<p>We have created new 6GiB partitions for rocks.db, copied the original partition, then extended it with "ceph-bluestore-tool bluefs-bdev-expand". Now we get:</p>
<ol>
<li>ceph health detail<br />HEALTH_WARN BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br />BLUEFS_SPILLOVER BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br /> osd.3 spilled over 5 MiB metadata from 'db' device (555 MiB used of 6.0 GiB) to slow device<br /> osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of 6.0 GiB) to slow device<br /> osd.5 spilled over 5 MiB metadata from 'db' device (561 MiB used of 6.0 GiB) to slow device</li>
</ol>
<p>Issuing "ceph daemon osd.X compact" doesn't help, but shows the following transitional state:</p>
<ol>
<li>ceph daemon osd.5 compact
{<br /> "elapsed_time": 5.4560688339999999<br />}</li>
<li>ceph health detail<br />HEALTH_WARN BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br />BLUEFS_SPILLOVER BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br /> osd.3 spilled over 5 MiB metadata from 'db' device (556 MiB used of 6.0 GiB) to slow device<br /> osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of 6.0 GiB) to slow device<br /> osd.5 spilled over 5 MiB metadata from 'db' device (1.1 GiB used of 6.0 GiB) to slow device<br />(...and after a while...)</li>
<li>ceph health detail<br />HEALTH_WARN BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br />BLUEFS_SPILLOVER BlueFS spillover detected on 3 <abbr title="s">OSD</abbr><br /> osd.3 spilled over 5 MiB metadata from 'db' device (556 MiB used of 6.0 GiB) to slow device<br /> osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of 6.0 GiB) to slow device<br /> osd.5 spilled over 5 MiB metadata from 'db' device (551 MiB used of 6.0 GiB) to slow device</li>
</ol>
<p>Please find attached manual compaction log.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1615302020-03-24T10:00:45ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Eneko,<br />could you please attach output for 'ceph-kvstore-tool bluestore-kv <path-to-osd> stats' command?</p>
<p>I suppose this is an expected behavior and can be simply ignored (and warning suppressed if needed). Given minor amount of spilled over data I expect no visible performance impact.<br />Once this backport PR (<a class="external" href="https://github.com/ceph/ceph/pull/33889">https://github.com/ceph/ceph/pull/33889</a>) is merged one will have some tuning means to avoid/smooth the case.</p>
<p>Also generally I would recommend to have 64G DB/WAL volume for low-/mid-size deployments.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1615342020-03-24T11:47:18ZMarcin W
<ul></ul><p>@Igor,</p>
<p>I've read somewhere in docs that recommended DB size is no less that 5% of block. IMO if we consider either 64GB or 5% of block size, it is still likely to spill or underutilise because default RocksDB settings are static ergo levels and their sizes are constant. IIRC, the default level multiplier is 10(recommended by FB) so, levels will grow quickly. If the block device size is small, it will spill eventually. If block size is enormous, DB may be underprovisioned. According to my experience, pre-calculating RocksDB parameters makes sense because it takes into consideration block sizes independently. For this purpose, I'm using the code I posted above with small modifications. Currently, its limitation is that entire RocksDB has to fit into dedicated LV - it doesn't allow to spill and will fall on face if DB content grows too much. I'd advise to include additional level on top of what this code produces to make sure there is enough room to spill if entire DB space is allocated. I know that calculating settings may be cumbersome and error prone but IMO one size doesn't fit all.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1615352020-03-24T11:56:29ZMarcin W
<ul></ul><p>BTW. These are the default RocksDB params IIRC (in GB):<br />base=256, multiplier=10, levels=5</p>
<p>0= 0.25<br />1= 0,25<br />2= 2,5<br />3= 25<br />4= 250<br />5= 2500</p>
<p>So, If LV size of DB is 64GB, levels 1-3 will fit like a glove but level 4 guaranties spillage due to RocksDB nature - the code makes sure there is enough space to fit all files for new level before creating first file.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1615972020-03-25T11:24:00ZEneko Lacunza
<ul></ul><p>Thanks @Igor</p>
<p>Sorry for the delay getting back, I didn't receive any email regarding your update from tracker.</p>
<p>I've just attached stats info for osd.5 . We have currently suppressed spillover warning for those 3 OSD, yes, but I'd like to get a warning if the current 6GB isn't enough when db increases to values near 3GB... :)</p>
<p>Is your 64GB recommendation the same as 60GB in other ceph-user mailing list messages? Is there a reason to be 64GB instead of 60GB? (I understand that being 60GB instead of 30GB is to have enough space for compaction...)</p>
<p>In this case, originally db partitions where 1 GB due proxmox default values. We may not have enough space to allocate 60GB for all OSDs, that's why I tried 6GB ;)</p>
<p>Thanks a lot</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1616402020-03-25T21:50:00ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Eneko,</p>
<p>to be honest 60GB and 64GB are pretty the same to me. The estimate which resulted in this value did some rounding and assumptions like:<br />1) Take 30 GB to fit L1+L3 (250MB + 2.5GB + 25GB). Rounding!<br />2) Take 100% to fit the worst possible compaction. Assumption!<br />3) Take additional 4GB for WAL which in fact I've never seen above 1GB. Taking some spare just in case!</p>
<p>So 64GB is just a nice final value which has some spare volume included.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1616412020-03-25T21:58:09ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Marcin W wrote:</p>
<blockquote>
<p>BTW. These are the default RocksDB params IIRC (in GB):<br />base=256, multiplier=10, levels=5</p>
<p>0= 0.25<br />1= 0,25<br />2= 2,5<br />3= 25<br />4= 250<br />5= 2500</p>
<p>So, If LV size of DB is 64GB, levels 1-3 will fit like a glove but level 4 guaranties spillage due to RocksDB nature - the code makes sure there is enough space to fit all files for new level before creating first file.</p>
</blockquote>
<p>Marcin, yeah,you're almost right. But we recommend 100% spare volume for interim processes like compaction. I've observed up to 100% overhead for this purposes in the lab. Certainly production might not experience such space usage peaks most of time if any. And hence one can actually apply some value in 30GB - 64GB range. Higher value will provide more reliability though. 64GB is just a recommendation to make things simple and straightforward while being the most reliable.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1616422020-03-25T22:01:49ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>In short - I'm trying to pretty conservative when suggesting 64GB for DB/WAL volume.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1617002020-03-26T07:47:44ZEneko Lacunza
<ul></ul><p>Thanks a lot Igor. Will wait for the backport PR and will report back the results.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1617222020-03-26T08:53:23ZMarcin W
<ul></ul><p>Hi Igor,</p>
<p>So, your recommendation is to create a volume which can serve enough space for levels 1-3, compaction and file reuse purposes only.</p>
<p>Does it mean that according to your tests/observations the DB is never going to grow bigger than level 3(not exceed ~30GB)?</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1617802020-03-26T14:44:36ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Marcin,</p>
<p>30-64GB is an optimal configuration. Certainly if one can affort 250+ GB drive and hence serve L4 - it's OK too. And definitely make sense if OSD load needs it.</p>
<p>30GB isn't enough sometimes IMO. This is OK for static DB store. But compaction might result in some temporary/peak loads. As I mentioned I've seen once up to 100% of L3 in the lab.<br />The above mentioned PR introduces some statistics collection means to be better aware about this peak utilization.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1679502020-06-10T09:49:39ZSeena Fallah
<ul></ul><p>I'm experiencing this in nautilus 14.2.9<br />Should the above PR solve this issue? I get what does the message really mean? I have 200GB db for my 10TB block and as I see in both prometheus and `ceph osd df` there is only 28GB of db is used. Can you explain more about it?<br /><pre><code class="text syntaxhl"><span class="CodeRay">osd.35 spilled over 1.5 GiB metadata from 'db' device (28 GiB used of 191 GiB) to slow device
</span></code></pre></p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1679612020-06-10T11:07:36ZMarcin W
<ul></ul><p>Hi Seena,</p>
<p>Metadata is stored in RocksDB which a 'logging database'. It doesn't replace or remove entries, it just appends changes to end of file. It's normal that it accumulates lots of changes over time and it needs to be compacted(flatten to discard obsolete objects). RocksDB is split into levels which you can see in my comment above. If it needs more space for changes, it creates new level(group of files). Every level is 10x bigger than previous one. Your total allocated DB size is 28GB and at this point it created files for level 4 which needs 250GB. The condition is that whole level has to fit on single device and your DB partition doesn't offer enough space(total 191GB). Levels 0-3 reside on DB partition but in this case, DB expanded to slow device(block device where data is stored). It can affect performance depending on class of data partition(HDD, SSD, NVME).</p>
<p>Running RocksDB compaction from time to time prevents that issue. I have the below bash script in cron and it runs at night. You will have to think of good time of day/week when it's OK to schedule when cluster is not busy. Perhaps running it on weekend would be safer. The script doesn't stop the cluster, it will just lock DB on each disk for couple of seconds/minutes.<br />Perhaps Igor can clarify if I missed sth.</p>
<pre><code class="c syntaxhl"><span class="CodeRay"><span class="preprocessor">#</span>!/bin/bash
export CEPH_DEV=<span class="integer">1</span>
/usr/bin/ceph osd ls | xargs -rn1 -I <span class="char">'{</span>}<span class="char">' </span>/usr/bin/ceph tell osd.<span class="char">'{</span>}<span class="char">' </span>compact
/usr/bin/ceph mon compact
</span></code></pre> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1679822020-06-10T13:18:21ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>@Marcin - perfect overview, thanks!<br />Just want to mention that this "granular" level space allocation has been fixed in both Octopus and upcoming Nautilus release.<br />See <a class="external" href="https://github.com/ceph/ceph/pull/33889">https://github.com/ceph/ceph/pull/33889</a></p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1680002020-06-10T16:02:49ZSeena Fallah
<ul></ul><p>@Marcin Really thanks for your overview. Now I get's what's going on.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1680012020-06-10T16:07:30ZSeena Fallah
<ul></ul><p>@Marcin One more question I would be so thankful if you answer me, How did you find out that db is now on level 4? I mean what's the size limit on level one?</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1680022020-06-10T16:15:34ZSeena Fallah
<ul></ul><p>Igor Fedotov wrote:</p>
<blockquote>
<p>@Marcin - perfect overview, thanks!<br />Just want to mention that this "granular" level space allocation has been fixed in both Octopus and upcoming Nautilus release.<br />See <a class="external" href="https://github.com/ceph/ceph/pull/33889">https://github.com/ceph/ceph/pull/33889</a></p>
</blockquote>
<p>@Igor Can you please explain how does this "granular" level space allocation will fix? I mean after next release of nautilus should I still compact db or it will automatically compact previous levels?</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1680302020-06-11T07:54:16ZMarcin W
<ul></ul><p>Hi Seena,</p>
<p>"How did you find out that db is now on level 4? I mean what's the size limit on level one?"</p>
<p>These are the sizes of each level in GB:<br />0= 0.25 <- in memory<br />1= 0,25<br />2= 2,5<br />3= 25<br />4= 250<br />5= 2500</p>
<p>You mentioned that allocated space is 28GB. It would include levels 1+2+3(from the list above) = 27,75GB plus some files for reuse.<br />Level 4 is the reported spillage.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1681302020-06-12T13:14:24ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Seena Fallah wrote:</p>
<blockquote>
<p>Igor Fedotov wrote:</p>
<blockquote>
<p>@Marcin - perfect overview, thanks!<br />Just want to mention that this "granular" level space allocation has been fixed in both Octopus and upcoming Nautilus release.<br />See <a class="external" href="https://github.com/ceph/ceph/pull/33889">https://github.com/ceph/ceph/pull/33889</a></p>
</blockquote>
<p>@Igor Can you please explain how does this "granular" level space allocation will fix? I mean after next release of nautilus should I still compact db or it will automatically compact previous levels?</p>
</blockquote>
<p>@Seena - first of all compacting DB is generally not a mean to avoid spillover. It can help sometimes but generally spillovers happen when fast device lacks space to fit all the data. Compaction just helps to keep data in a more optimal way, i.e. levels tend to take less space. But at some point one can get the next data level anyway.</p>
<p>And as Marcin explained currently RockDB spills the every level's byte if there is no enough space at fast device to fit the level completely. E.g. with the default settings L3 needs 25GB and L4 needs 250GB. Hence L4 data are unconditionally spilled over for e.g. 100GB drive. And given that L1+L2+l3 takes max 30GB (a bit simplified measurement!) 70GB of available space are wasted permanently.</p>
<p>The above-mentioned patch alleviates these losses by allowing to [partially] use that "wasted" space for L4 data. I.e. from now one RocksDB is able to keep level data at fast volume even when level at its full capacity doesn't fit into fast volume.</p>
<p>You can find some additional info in the master PR:<br /><a class="external" href="https://github.com/ceph/ceph/pull/29687">https://github.com/ceph/ceph/pull/29687</a></p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1701582020-07-10T12:58:54ZSeena Fallah
<ul></ul><p>Thanks @Igor for your help again.<br />I saw a new behavior now and I don't see any level gets score 1.0 but ceph says the OSD is spillover. Does it still needs compaction? If it needs why there there is no full level in compaction stats?<br /><pre>
osd.41 spilled over 66 MiB metadata from 'db' device (30 GiB used of 191 GiB) to slow device
</pre><br /><pre>
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 54.9 54.9 0.0 1.0 0.0 171.0 328.58 312.96 1281 0.257 0 0
L1 1/0 65.41 MB 0.7 105.8 55.0 50.7 97.3 46.6 0.0 1.8 241.6 222.3 448.21 417.64 333 1.346 185M 15M
L2 8/0 498.21 MB 0.9 220.6 45.0 175.6 213.0 37.4 1.7 4.7 237.2 229.1 952.22 872.29 780 1.221 378M 9906K
L3 42/0 1.78 GB 0.9 180.7 38.4 142.4 164.5 22.1 1.2 4.3 229.1 208.5 807.94 714.33 616 1.312 350M 41M
L4 429/0 27.06 GB 0.1 31.5 5.0 26.4 29.5 3.1 24.0 5.9 261.6 245.3 123.22 108.54 16 7.701 39M 18M
Sum 480/0 29.39 GB 0.0 538.5 143.4 395.1 559.2 164.1 26.9 10.2 207.3 215.2 2660.17 2425.76 3026 0.879 954M 84M
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
</pre></p>
<p>And one more question. My db device is 191GB and with this leveling at level 4 There should be a 250GB space but I win't have it never. Should I change the default settings to fit with my db device space? If yes can you help me with the config name? I can't find it.</p>
<p>Thanks.</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1701772020-07-10T14:50:24ZSeena Fallah
<ul><li><strong>File</strong> <a href="/attachments/download/4967/Screenshot%20from%202020-07-10%2019-07-33.png">Screenshot from 2020-07-10 19-07-33.png</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/4967/Screenshot%20from%202020-07-10%2019-07-33.png">View</a> added</li><li><strong>File</strong> <a href="/attachments/download/4968/Screenshot%20from%202020-07-10%2019-19-33.png">Screenshot from 2020-07-10 19-19-33.png</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/4968/Screenshot%20from%202020-07-10%2019-19-33.png">View</a> added</li></ul><p>And also here in graphs you can see the bluefs db used is decreasing after slow used increases but still slow bytes is used</p> bluestore - Bug #38745: spillover that doesn't make sensehttps://tracker.ceph.com/issues/38745?journal_id=1711232020-07-20T06:37:55ZSeena Fallah
<ul></ul><p>I found<br /><pre><code class="cpp syntaxhl"><span class="CodeRay">uint64_t target_file_size_base = <span class="integer">64</span> * <span class="integer">1048576</span>;
</span></code></pre><br />in rocksdb Ceph repo and it seems each level is 64MB. I have also check bluestore_rocksdb_options in my OSD and there were no override configs for target_file_size_base.<br />Am I wrong?</p>