Project

General

Profile

Bug #38745

spillover that doesn't make sense

Added by Sage Weil over 1 year ago. Updated 23 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD(s)
     osd.50 spilled over 1.3 GiB metadata from 'db' device (20 GiB used of 31 GiB) to slow device
     osd.94 spilled over 1.1 GiB metadata from 'db' device (16 GiB used of 31 GiB) to slow device
     osd.103 spilled over 1.0 GiB metadata from 'db' device (18 GiB used of 31 GiB) to slow device

this is on the sepia lab cluster.

osd5_perf_log.log View (21.1 KB) Chris Callegari, 03/25/2019 12:15 AM

Selection-001.png View (79.3 KB) Konstantin Shalygin, 03/26/2019 02:34 AM

ceph-osd.5.log View - Manual compact log (15.3 KB) Eneko Lacunza, 03/24/2020 09:25 AM

History

#1 Updated by Chris Callegari over 1 year ago

I recently upgraded from latest mimic to nautilus. My cluster displayed 'BLUEFS_SPILLOVER BlueFS spillover detected on OSD'. It took a long conversation and a manual scan of all my osds to find the culprit. The '/usr/bin/ceph daemon osd.5 perf dump | /usr/bin/jq .' output is attached. Unfortunately I did not let this osd hang around for a long time. I zapped and created him.

Thanks,
/Chris Callegari

#2 Updated by Chris Callegari over 1 year ago

Also my cluster did not display the 'osd.X spilled over 123 GiB metadata from 'blah' device (20 GiB used of 31 GiB) to slow device"

#3 Updated by Igor Fedotov over 1 year ago

Chris Callegari wrote:

Also my cluster did not display the 'osd.X spilled over 123 GiB metadata from 'blah' device (20 GiB used of 31 GiB) to slow device"

Chris, you should invoke "ceph health detail" to get such an output.

#4 Updated by Igor Fedotov over 1 year ago

Generally I suppose this is a valid state - RocksDB put next level data to slow device when it expects it wouldn't fit into the fast one. Please recall that RocksDB uses 250MB as a level base size and 10 as a next level multiplier by default. So drive has to have 250+GB to fit L3.

#5 Updated by Sage Weil over 1 year ago

I tried a compaction on osd.50. Before,

     osd.50 spilled over 1.3 GiB metadata from 'db' device (18 GiB used of 31 GiB) to slow device

compaction showed:

2019-03-25 14:46:13.616 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:14.272 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:15.032 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:15.976 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:16.888 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:17.820 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:18.756 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:19.668 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:20.452 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:21.352 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:22.240 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:23.160 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:24.000 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:24.884 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:25.780 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:26.688 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:28.356 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0xa00000; fallback to bdev 2
2019-03-25 14:48:22.707 7f85d12e6080  1 bluestore(/var/lib/ceph/osd/ceph-50) umount

after,

     osd.50 spilled over 198 MiB metadata from 'db' device (17 GiB used of 31 GiB) to slow device

so... that is kind of weird.

rerunning with bluefs and bluestore debug enabled.

#6 Updated by Igor Fedotov over 1 year ago

@Sage, I observed up to 2x space utilization increase during compaction. You can inspect l_bluefs_max_bytes_wal, l_bluefs_max_bytes_db, l_bluefs_max_bytes_slow perf counter to confirm that. Looks like the case here as well. Log output shows that BlueFS is unable to allocate ~69 MB (which seems to be close to average SST size in RocksDB) at fast device since it has just 47MB. Hence the fallback.
So perhaps the root cause for spillover might be both level layout (as per my previous comment) and lack of space during compaction.

#7 Updated by Sage Weil over 1 year ago

ceph-post-file: a6ef2d24-56c0-486d-bb1e-f82080c0da9e

#8 Updated by Igor Fedotov over 1 year ago

Curious thing is that one is unable to realize what space is occupied at slow device due to fallbacks from neither "kvstore-tool stats" nor "bluestore-tool export" commands. bluestore-tool show-bdev-sizes is the only (implicit) mean.
E.g.
"kvstore-tool stats" output:
"": " L0 3/0 667.25 MB 2.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " L1 5/0 246.06 MB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " L2 39/0 2.45 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " L3 11/0 707.18 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " Sum 58/0 4.03 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",

"bluestore-tool show-bdev-sizes" output:
1 : device size 0x100000000 : own 0x[2000~ffffe000] = 0xffffe000 : using 0xc12fe000(3.0 GiB)
2 : device size 0x100000000 : own 0x[1ee00000~3c300000,60000000~40000000] = 0x7c300000 : using 0x44500000(1.1 Gi

Please note that used space at slow device is higher than space required by L3. Which is IMO caused by the fallback.

#9 Updated by Konstantin Shalygin over 1 year ago

@Sage, I observed up to 2x space utilization increase during compaction.

This is normal behavior for first compaction.

#10 Updated by Rafal Wadolowski over 1 year ago

The slow bytes used is the problem we've been seeing for one year.
One of server has 20GB db.wal for 8TB RAW device. There is one main pool build on EC 4+2. This cluster is only object storage cluster and the slows are hitting performance on listing buckets.
Statistics for usage:

OSD.384 DB used: 8.28 GiB SLOW used= 4.51 GiB WAL used= 252.00 MiB
OSD.404 DB used: 17.57 GiB SLOW used= 77.00 MiB WAL used= 252.00 MiB
OSD.374 DB used: 8.81 GiB SLOW used= 4.87 GiB WAL used= 252.00 MiB
OSD.385 DB used: 15.46 GiB SLOW used= 0 Bytes WAL used= 252.00 MiB
OSD.382 DB used: 9.11 GiB SLOW used= 5.05 GiB WAL used= 267.00 MiB
OSD.386 DB used: 14.94 GiB SLOW used= 1.83 GiB WAL used= 252.00 MiB
OSD.401 DB used: 15.12 GiB SLOW used= 4.20 GiB WAL used= 252.00 MiB
OSD.396 DB used: 15.37 GiB SLOW used= 89.00 MiB WAL used= 252.00 MiB
OSD.377 DB used: 16.55 GiB SLOW used= 202.00 MiB WAL used= 371.00 MiB
OSD.392 DB used: 10.44 GiB SLOW used= 4.29 GiB WAL used= 304.00 MiB
OSD.403 DB used: 15.93 GiB SLOW used= 76.00 MiB WAL used= 252.00 MiB
OSD.395 DB used: 15.33 GiB SLOW used= 0 Bytes WAL used= 264.00 MiB
OSD.375 DB used: 16.16 GiB SLOW used= 4.71 GiB WAL used= 252.00 MiB
OSD.379 DB used: 6.78 GiB SLOW used= 1.75 GiB WAL used= 540.00 MiB
OSD.407 DB used: 16.47 GiB SLOW used= 141.00 MiB WAL used= 252.00 MiB
OSD.393 DB used: 15.59 GiB SLOW used= 4.19 GiB WAL used= 264.00 MiB
OSD.399 DB used: 15.28 GiB SLOW used= 4.14 GiB WAL used= 252.00 MiB
OSD.381 DB used: 15.47 GiB SLOW used= 225.00 MiB WAL used= 580.00 MiB
OSD.405 DB used: 17.07 GiB SLOW used= 6.09 GiB WAL used= 858.00 MiB
OSD.398 DB used: 14.86 GiB SLOW used= 166.00 MiB WAL used= 252.00 MiB
OSD.383 DB used: 15.12 GiB SLOW used= 78.00 MiB WAL used= 253.00 MiB
OSD.402 DB used: 18.08 GiB SLOW used= 6.04 GiB WAL used= 280.00 MiB
OSD.391 DB used: 15.78 GiB SLOW used= 3.49 GiB WAL used= 256.00 MiB
OSD.389 DB used: 9.13 GiB SLOW used= 4.58 GiB WAL used= 256.00 MiB
OSD.376 DB used: 16.92 GiB SLOW used= 1.33 GiB WAL used= 2.62 GiB
OSD.388 DB used: 15.47 GiB SLOW used= 141.00 MiB WAL used= 248.00 MiB
OSD.394 DB used: 7.74 GiB SLOW used= 4.45 GiB WAL used= 272.00 MiB
OSD.380 DB used: 15.82 GiB SLOW used= 79.00 MiB WAL used= 252.00 MiB
OSD.390 DB used: 10.88 GiB SLOW used= 5.50 GiB WAL used= 256.00 MiB
OSD.397 DB used: 8.31 GiB SLOW used= 3.73 GiB WAL used= 442.00 MiB
OSD.406 DB used: 16.69 GiB SLOW used= 195.00 MiB WAL used= 311.00 MiB
OSD.400 DB used: 16.05 GiB SLOW used= 145.00 MiB WAL used= 256.00 MiB
OSD.378 DB used: 15.22 GiB SLOW used= 152.00 MiB WAL used= 445.00 MiB
OSD.387 DB used: 13.03 GiB SLOW used= 0 Bytes WAL used= 262.00 MiB
SUM DB used: 475.00 GiB SUM SLOW used= 76.55 GiB SUM WAL used= 12.64 GiB

IMHO The compaction is only short-term resolution. I think the real problem is in how the rocksdb store data on disk. For example if you delete some of the data, they are deleted from DB, but DB still used this same space and the compaction (when it was triggered) will free it. Maybe there is a method, which is trying to optimize used space periodicaly.
Actually compaction is started by the user or by rocksdb when level ratio is above 1.0.

#11 Updated by Konstantin Shalygin over 1 year ago

Rafal, there is not your case! You spillover is because your db is lower than 30Gb. Please consult with http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html

#12 Updated by Rafal Wadolowski over 1 year ago

Konstantin Shalygin wrote:

Rafal, there is not your case! You spillover is because your db is lower than 30Gb. Please consult with http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html

This problem is related. The data should be only on DB, if there is free space. There is interesting split:

osd-379
{
    "rocksdb_compaction_statistics": "",
    "": "",
    "": "** Compaction Stats [default] **",
    "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
    "": "----------------------------------------------------------------------------------------------------------------------------------------------------------",
    "": "  L0      2/0   56.73 MB   0.2      0.0     0.0      0.0       0.1      0.1       0.0   1.0      0.0     73.4         1         2    0.386       0      0",
    "": "  L3    135/0    8.47 GB   0.1      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0",
    "": " Sum    137/0    8.53 GB   0.0      0.0     0.0      0.0       0.1      0.1       0.0   1.0      0.0     73.4         1         2    0.386       0      0",
    "": " Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.1      0.1       0.0   1.0      0.0     73.4         1         2    0.386       0      0",
    "": "Uptime(secs): 136650.4 total, 136650.4 interval",
    "": "Flush(GB): cumulative 0.055, interval 0.055",
    "": "AddFile(GB): cumulative 0.000, interval 0.000",
    "": "AddFile(Total Files): cumulative 0, interval 0",
    "": "AddFile(L0 Files): cumulative 0, interval 0",
    "": "AddFile(Keys): cumulative 0, interval 0",
    "": "Cumulative compaction: 0.06 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.8 seconds",
    "": "Interval compaction: 0.06 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.8 seconds",
    "": "Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count",
    "": "",
    "": "** File Read Latency Histogram By Level [default] **",
    "": "",
    "": "** DB Stats **",
    "": "Uptime(secs): 136650.4 total, 136650.4 interval",
    "": "Cumulative writes: 259K writes, 794K keys, 259K commit groups, 1.0 writes per commit group, ingest: 0.39 GB, 0.00 MB/s",
    "": "Cumulative WAL: 259K writes, 129K syncs, 2.00 writes per sync, written: 0.39 GB, 0.00 MB/s",
    "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
    "": "Interval writes: 259K writes, 794K keys, 259K commit groups, 1.0 writes per commit group, ingest: 395.78 MB, 0.00 MB/s",
    "": "Interval WAL: 259K writes, 129K syncs, 2.00 writes per sync, written: 0.39 MB, 0.00 MB/s",
    "": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent" 
}

"You spillover is because your db is lower than 30Gb", why 30Gb are problem in my case?

According to your screen, we noticed this is normal effect and twice compacting cluster are clearing slow.

#13 Updated by Konstantin Shalygin over 1 year ago

why 30Gb are problem in my case?

Because of compaction levels: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

#14 Updated by Rafal Wadolowski over 1 year ago

Konstanin, okey but in documentation are default settings.
We have

bluestore_rocksdb_options = "compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=1610612736,max_bytes_for_level_multiplier=10,compaction_threads=32,flusher_threads=8" 

so L0 is 1.5GB,L1 1.5GB, L2 15GB
L0+L1+L2 18GB
In https://github.com/ceph/ceph/pull/22025 I proposed option to rocksdb, that will help osd.379, but I'm not sure if that change will cover all spillover cases.

#15 Updated by Konstantin Shalygin over 1 year ago

256MB+2.56GB+25.6GB=~28-29GB - for default Luminous options.

#16 Updated by Xiaoxi Chen about 1 year ago

BLUEFS_SPILLOVER BlueFS spillover detected on 7 OSD
osd.248 spilled over 257 MiB metadata from 'db' device (3.2 GiB used of 20 GiB) to slow device
osd.266 spilled over 264 MiB metadata from 'db' device (3.4 GiB used of 20 GiB) to slow device
osd.283 spilled over 330 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device
osd.294 spilled over 594 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device
osd.320 spilled over 279 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device
osd.371 spilled over 264 MiB metadata from 'db' device (3.4 GiB used of 20 GiB) to slow device
osd.391 spilled over 264 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device

one more instance, even more wired... have no idea how come I only use 3.3GB but already split over

#17 Updated by Josh Durgin about 1 year ago

  • Assignee set to Adam Kupczyk

Adam's looking at similar spillover issues

#18 Updated by Brett Chancellor about 1 year ago

This is also showing up in 14.2.1 in instances where the db is overly provisioned.

HEALTH_WARN BlueFS spillover detected on 3 OSD
BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
osd.52 spilled over 804 MiB metadata from 'db' device (29 GiB used of 148 GiB) to slow device
osd.221 spilled over 3.9 GiB metadata from 'db' device (29 GiB used of 148 GiB) to slow device
osd.245 spilled over 1.7 GiB metadata from 'db' device (28 GiB used of 148 GiB) to slow device

#19 Updated by Sage Weil 12 months ago

  • Priority changed from High to Normal

#20 Updated by Sage Weil 12 months ago

  • Status changed from 12 to In Progress

#21 Updated by Dan van der Ster 10 months ago

Us to, on hdd-only OSDs (no dedicated block.db or wal):

BLUEFS_SPILLOVER BlueFS spillover detected on 94 OSD
osd.98 spilled over 1.1 GiB metadata from 'db' device (147 MiB used of 26 GiB) to slow device
osd.135 spilled over 1.3 GiB metadata from 'db' device (183 MiB used of 26 GiB) to slow device
...

#22 Updated by Igor Fedotov 10 months ago

@Dan - this sounds weird - spillover without dedicated db... Could you please share 'ceph osd metadata output'?

#23 Updated by Rafal Wadolowski 10 months ago

@Adam are there any news about this problem? We have about ~1500 osds with spillover.
If you need some more data, feel free to contact me :)

@Dan do you have default configuration for rocksdb?

#24 Updated by Dan van der Ster 10 months ago

@Igor, @Rafal: please ignore. I was confused about the configuration of this cluster. It indeed has 26GB rocksdb's, so the spillover makes perfect sense.

#25 Updated by Marcin W 8 months ago

Due to spillover, I'm trying to optimize RocksDB options based on data partition size(roughly 9TB in example below). This will help to estimate how much space is needed for new DB partition on NVME and adjust parameters accordingly.
Would this calculation make sense and would it prevent spillover?

    def calculate_rocksdb_levels(self, base, multiplier, levels):
        # default Ceph setting:
        # base=256, multiplier=10, levels=5

        level_sizes = [ 0, base ]

        for level in range(2, levels+1):
            # L(n+1) = (Ln) * max_bytes_for_level_multiplier
            level_prev      = level - 1
            level_size_prev = level_sizes[level_prev]
            level_size      = level_size_prev * multiplier
            level_sizes.append(level_size)
        level_sizes_all = int((base * (1 - (multiplier**levels))) / (1-multiplier))
        # log.debug('level_sizes_all=%s, level_sizes=%s', level_sizes_all, level_sizes)
        return level_sizes_all, level_sizes

    def calculate_rocksdb_new_size(self, partition_db_size):
        # Add 10% of extra space for compaction and others. Enough?
        partition_db_size_w_spare = int(partition_db_size - (partition_db_size / 10))
        nearest_size        = 0
        nearest_settings    = {}

        # Only subset of base, multiplier and levels is taken into consideration here
        # Brute force geometric progression calculations to find
        # total space of all DB levels to be as close to partition_db_size_w_spare as possible
        for base in [ 64, 96, 128, 192, 256 ]:
            for multiplier in range(3, 10+1):
                for levels in range(4, 6+1):
                    # log.debug('base=%s, multiplier=%s, levels=%s', base, multiplier, levels)
                    level_sizes_all, level_sizes = self.calculate_rocksdb_levels(base, multiplier, levels)
                    if level_sizes_all < partition_db_size_w_spare:
                        if level_sizes_all > nearest_size:
                            nearest_size        = level_sizes_all
                            nearest_settings    = {
                                'base':         base,
                                'multiplier':   multiplier,
                                'levels':       levels,
                                'level_sizes':  level_sizes[1:],
                            }

        nearest_db_size = int(nearest_size + (nearest_size / 10))
        base            = nearest_settings['base']
        multiplier      = nearest_settings['multiplier']
        levels          = nearest_settings['levels']
        mem_buffers     = int(base/16)      # base / write_buffer_size

        db_settings = {
            'compaction_readahead_size':            '2MB',
            'compaction_style':                     'kCompactionStyleLevel',
            'compaction_threads':                   '%s' % (mem_buffers * 2),
            'compression':                          'kNoCompression',
            'flusher_threads':                      '8',
            'level0_file_num_compaction_trigger':   '%s' % int(mem_buffers / 2),
            'level0_slowdown_writes_trigger':       '%s' % (mem_buffers + 8),
            'level0_stop_writes_trigger':           '%s' % (mem_buffers + 16),
            'max_background_compactions':           '%s' % (mem_buffers * 2),
            'max_bytes_for_level_base':             '%sMB' % base,
            'max_bytes_for_level_multiplier':       '%s' % multiplier,
            'max_write_buffer_number':              '%s' % mem_buffers,
            'min_write_buffer_number_to_merge':     '%s' % int(mem_buffers / 2),
            'num_levels':                           '%s' % levels,
            'recycle_log_file_num':                 '2',
            'target_file_size_base':                '16MB',
            'write_buffer_size':                    '16MB',
        }

        db_settings_join = ','.join([ '%s=%s' % (k, db_settings[k]) for k in sorted(db_settings.keys()) ])
        log.debug('partition_db_size=%s, partition_db_size_w_spare=%s, nearest_db_size=%s, %s', partition_db_size, partition_db_size_w_spare, nearest_db_size, db_settings_join)
        log.debug('final partition_db_size=%s, levels=%s, levels_total=%s', nearest_db_size, nearest_settings['level_sizes'], nearest_size)
        return nearest_db_size, db_settings_join

    def calc_db_size_for_block(self, data_size):
        # Assume DB size is about 5% of data partition
        db_size         = int(int(int((data_size * 5 / 100) + 1) / 2) * 2)
        log.debug('data_size=%s, db_size_free=%s', data_size, db_size)

        return self.calculate_rocksdb_new_size(db_size)

/dev/sdc: for data, partition size: 9537532 MB, 5% = partition_db_size(476876 MB)
nearest_db_size=437888 MB, compaction_readahead_size=2MB,compaction_style=kCompactionStyleLevel,compaction_threads=32,compression=kNoCompression,flusher_threads=8,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=24,level0_stop_writes_trigger=32,max_background_compactions=32,max_bytes_for_level_base=256MB,max_bytes_for_level_multiplier=6,max_write_buffer_number=16,min_write_buffer_number_to_merge=8,num_levels=5,recycle_log_file_num=2,target_file_size_base=16MB,write_buffer_size=16MB'
final partition_db_size=437888, levels=[256, 1536, 9216, 55296, 331776], levels_total=398080'

At this point, LVM will create 437888MB db_lv and ceph.conf needs extra section:

[osd.X]
bluestore_rocksdb_options = {{ db_settings_join }}

#26 Updated by Yoann Moulin 4 months ago

Hello,

I also have this message on a Nautilus cluster, should I must worry about ?

artemis@icitsrv5:~$ ceph --version
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
artemis@icitsrv5:~$ ceph -s
  cluster:
    id:     815ea021-7839-4a63-9dc1-14f8c5feecc6
    health: HEALTH_WARN
            BlueFS spillover detected on 1 OSD(s)

  services:
    mon: 3 daemons, quorum iccluster003,iccluster005,iccluster007 (age 7w)
    mgr: iccluster021(active, since 10h), standbys: iccluster009, iccluster023
    mds: cephfs:5 5 up:active
    osd: 120 osds: 120 up (since 22h), 120 in (since 45h)
    rgw: 8 daemons active (iccluster003.rgw0, iccluster005.rgw0, iccluster007.rgw0, iccluster013.rgw0, iccluster015.rgw0, iccluster019.rgw0, iccluster021.rgw0, iccluster023.rgw0)

  data:
    pools:   10 pools, 2161 pgs
    objects: 204.46M objects, 234 TiB
    usage:   367 TiB used, 295 TiB / 662 TiB avail
    pgs:     2157 active+clean
             3    active+clean+scrubbing+deep
             1    active+clean+scrubbing

  io:
    client:   504 B/s rd, 270 KiB/s wr, 0 op/s rd, 9 op/s wr

artemis@icitsrv5:~$ ceph health detail
HEALTH_WARN BlueFS spillover detected on 1 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
     osd.1 spilled over 324 MiB metadata from 'db' device (28 GiB used of 64 GiB) to slow device

Thanks,

Best regards,

Yoann

#27 Updated by Eneko Lacunza 3 months ago

Hi, we're seeing this issue too. Using 14.2.8 (proxmox build)

We originally had 1GB rocks.db partition:

  1. ceph health detail
    HEALTH_WARN BlueFS spillover detected on 3 OSD
    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
    osd.3 spilled over 78 MiB metadata from 'db' device (1024 MiB used of 1024 MiB) to slow device
    osd.4 spilled over 78 MiB metadata from 'db' device (1024 MiB used of 1024 MiB) to slow device
    osd.5 spilled over 84 MiB metadata from 'db' device (1024 MiB used of 1024 MiB) to slow device

We have created new 6GiB partitions for rocks.db, copied the original partition, then extended it with "ceph-bluestore-tool bluefs-bdev-expand". Now we get:

  1. ceph health detail
    HEALTH_WARN BlueFS spillover detected on 3 OSD
    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
    osd.3 spilled over 5 MiB metadata from 'db' device (555 MiB used of 6.0 GiB) to slow device
    osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of 6.0 GiB) to slow device
    osd.5 spilled over 5 MiB metadata from 'db' device (561 MiB used of 6.0 GiB) to slow device

Issuing "ceph daemon osd.X compact" doesn't help, but shows the following transitional state:

  1. ceph daemon osd.5 compact {
    "elapsed_time": 5.4560688339999999
    }
  2. ceph health detail
    HEALTH_WARN BlueFS spillover detected on 3 OSD
    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
    osd.3 spilled over 5 MiB metadata from 'db' device (556 MiB used of 6.0 GiB) to slow device
    osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of 6.0 GiB) to slow device
    osd.5 spilled over 5 MiB metadata from 'db' device (1.1 GiB used of 6.0 GiB) to slow device
    (...and after a while...)
  3. ceph health detail
    HEALTH_WARN BlueFS spillover detected on 3 OSD
    BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
    osd.3 spilled over 5 MiB metadata from 'db' device (556 MiB used of 6.0 GiB) to slow device
    osd.4 spilled over 5 MiB metadata from 'db' device (552 MiB used of 6.0 GiB) to slow device
    osd.5 spilled over 5 MiB metadata from 'db' device (551 MiB used of 6.0 GiB) to slow device

Please find attached manual compaction log.

#28 Updated by Igor Fedotov 3 months ago

Eneko,
could you please attach output for 'ceph-kvstore-tool bluestore-kv <path-to-osd> stats' command?

I suppose this is an expected behavior and can be simply ignored (and warning suppressed if needed). Given minor amount of spilled over data I expect no visible performance impact.
Once this backport PR (https://github.com/ceph/ceph/pull/33889) is merged one will have some tuning means to avoid/smooth the case.

Also generally I would recommend to have 64G DB/WAL volume for low-/mid-size deployments.

#29 Updated by Marcin W 3 months ago

@Igor,

I've read somewhere in docs that recommended DB size is no less that 5% of block. IMO if we consider either 64GB or 5% of block size, it is still likely to spill or underutilise because default RocksDB settings are static ergo levels and their sizes are constant. IIRC, the default level multiplier is 10(recommended by FB) so, levels will grow quickly. If the block device size is small, it will spill eventually. If block size is enormous, DB may be underprovisioned. According to my experience, pre-calculating RocksDB parameters makes sense because it takes into consideration block sizes independently. For this purpose, I'm using the code I posted above with small modifications. Currently, its limitation is that entire RocksDB has to fit into dedicated LV - it doesn't allow to spill and will fall on face if DB content grows too much. I'd advise to include additional level on top of what this code produces to make sure there is enough room to spill if entire DB space is allocated. I know that calculating settings may be cumbersome and error prone but IMO one size doesn't fit all.

#30 Updated by Marcin W 3 months ago

BTW. These are the default RocksDB params IIRC (in GB):
base=256, multiplier=10, levels=5

0= 0.25
1= 0,25
2= 2,5
3= 25
4= 250
5= 2500

So, If LV size of DB is 64GB, levels 1-3 will fit like a glove but level 4 guaranties spillage due to RocksDB nature - the code makes sure there is enough space to fit all files for new level before creating first file.

#31 Updated by Eneko Lacunza 3 months ago

Thanks @Igor

Sorry for the delay getting back, I didn't receive any email regarding your update from tracker.

I've just attached stats info for osd.5 . We have currently suppressed spillover warning for those 3 OSD, yes, but I'd like to get a warning if the current 6GB isn't enough when db increases to values near 3GB... :)

Is your 64GB recommendation the same as 60GB in other ceph-user mailing list messages? Is there a reason to be 64GB instead of 60GB? (I understand that being 60GB instead of 30GB is to have enough space for compaction...)

In this case, originally db partitions where 1 GB due proxmox default values. We may not have enough space to allocate 60GB for all OSDs, that's why I tried 6GB ;)

Thanks a lot

#32 Updated by Igor Fedotov 3 months ago

Eneko,

to be honest 60GB and 64GB are pretty the same to me. The estimate which resulted in this value did some rounding and assumptions like:
1) Take 30 GB to fit L1+L3 (250MB + 2.5GB + 25GB). Rounding!
2) Take 100% to fit the worst possible compaction. Assumption!
3) Take additional 4GB for WAL which in fact I've never seen above 1GB. Taking some spare just in case!

So 64GB is just a nice final value which has some spare volume included.

#33 Updated by Igor Fedotov 3 months ago

Marcin W wrote:

BTW. These are the default RocksDB params IIRC (in GB):
base=256, multiplier=10, levels=5

0= 0.25
1= 0,25
2= 2,5
3= 25
4= 250
5= 2500

So, If LV size of DB is 64GB, levels 1-3 will fit like a glove but level 4 guaranties spillage due to RocksDB nature - the code makes sure there is enough space to fit all files for new level before creating first file.

Marcin, yeah,you're almost right. But we recommend 100% spare volume for interim processes like compaction. I've observed up to 100% overhead for this purposes in the lab. Certainly production might not experience such space usage peaks most of time if any. And hence one can actually apply some value in 30GB - 64GB range. Higher value will provide more reliability though. 64GB is just a recommendation to make things simple and straightforward while being the most reliable.

#34 Updated by Igor Fedotov 3 months ago

In short - I'm trying to pretty conservative when suggesting 64GB for DB/WAL volume.

#35 Updated by Eneko Lacunza 3 months ago

Thanks a lot Igor. Will wait for the backport PR and will report back the results.

#36 Updated by Marcin W 3 months ago

Hi Igor,

So, your recommendation is to create a volume which can serve enough space for levels 1-3, compaction and file reuse purposes only.

Does it mean that according to your tests/observations the DB is never going to grow bigger than level 3(not exceed ~30GB)?

#37 Updated by Igor Fedotov 3 months ago

Marcin,

30-64GB is an optimal configuration. Certainly if one can affort 250+ GB drive and hence serve L4 - it's OK too. And definitely make sense if OSD load needs it.

30GB isn't enough sometimes IMO. This is OK for static DB store. But compaction might result in some temporary/peak loads. As I mentioned I've seen once up to 100% of L3 in the lab.
The above mentioned PR introduces some statistics collection means to be better aware about this peak utilization.

#38 Updated by Seena Fallah 25 days ago

I'm experiencing this in nautilus 14.2.9
Should the above PR solve this issue? I get what does the message really mean? I have 200GB db for my 10TB block and as I see in both prometheus and `ceph osd df` there is only 28GB of db is used. Can you explain more about it?

osd.35 spilled over 1.5 GiB metadata from 'db' device (28 GiB used of 191 GiB) to slow device

#39 Updated by Marcin W 25 days ago

Hi Seena,

Metadata is stored in RocksDB which a 'logging database'. It doesn't replace or remove entries, it just appends changes to end of file. It's normal that it accumulates lots of changes over time and it needs to be compacted(flatten to discard obsolete objects). RocksDB is split into levels which you can see in my comment above. If it needs more space for changes, it creates new level(group of files). Every level is 10x bigger than previous one. Your total allocated DB size is 28GB and at this point it created files for level 4 which needs 250GB. The condition is that whole level has to fit on single device and your DB partition doesn't offer enough space(total 191GB). Levels 0-3 reside on DB partition but in this case, DB expanded to slow device(block device where data is stored). It can affect performance depending on class of data partition(HDD, SSD, NVME).

Running RocksDB compaction from time to time prevents that issue. I have the below bash script in cron and it runs at night. You will have to think of good time of day/week when it's OK to schedule when cluster is not busy. Perhaps running it on weekend would be safer. The script doesn't stop the cluster, it will just lock DB on each disk for couple of seconds/minutes.
Perhaps Igor can clarify if I missed sth.

#!/bin/bash

export CEPH_DEV=1

/usr/bin/ceph osd ls | xargs -rn1 -I '{}' /usr/bin/ceph tell osd.'{}' compact
/usr/bin/ceph mon compact

#40 Updated by Igor Fedotov 25 days ago

@Marcin - perfect overview, thanks!
Just want to mention that this "granular" level space allocation has been fixed in both Octopus and upcoming Nautilus release.
See https://github.com/ceph/ceph/pull/33889

#41 Updated by Seena Fallah 25 days ago

@Marcin Really thanks for your overview. Now I get's what's going on.

#42 Updated by Seena Fallah 25 days ago

@Marcin One more question I would be so thankful if you answer me, How did you find out that db is now on level 4? I mean what's the size limit on level one?

#43 Updated by Seena Fallah 25 days ago

Igor Fedotov wrote:

@Marcin - perfect overview, thanks!
Just want to mention that this "granular" level space allocation has been fixed in both Octopus and upcoming Nautilus release.
See https://github.com/ceph/ceph/pull/33889

@Igor Can you please explain how does this "granular" level space allocation will fix? I mean after next release of nautilus should I still compact db or it will automatically compact previous levels?

#44 Updated by Marcin W 24 days ago

Hi Seena,

"How did you find out that db is now on level 4? I mean what's the size limit on level one?"

These are the sizes of each level in GB:
0= 0.25 <- in memory
1= 0,25
2= 2,5
3= 25
4= 250
5= 2500

You mentioned that allocated space is 28GB. It would include levels 1+2+3(from the list above) = 27,75GB plus some files for reuse.
Level 4 is the reported spillage.

#45 Updated by Igor Fedotov 23 days ago

Seena Fallah wrote:

Igor Fedotov wrote:

@Marcin - perfect overview, thanks!
Just want to mention that this "granular" level space allocation has been fixed in both Octopus and upcoming Nautilus release.
See https://github.com/ceph/ceph/pull/33889

@Igor Can you please explain how does this "granular" level space allocation will fix? I mean after next release of nautilus should I still compact db or it will automatically compact previous levels?

@Seena - first of all compacting DB is generally not a mean to avoid spillover. It can help sometimes but generally spillovers happen when fast device lacks space to fit all the data. Compaction just helps to keep data in a more optimal way, i.e. levels tend to take less space. But at some point one can get the next data level anyway.

And as Marcin explained currently RockDB spills the every level's byte if there is no enough space at fast device to fit the level completely. E.g. with the default settings L3 needs 25GB and L4 needs 250GB. Hence L4 data are unconditionally spilled over for e.g. 100GB drive. And given that L1+L2+l3 takes max 30GB (a bit simplified measurement!) 70GB of available space are wasted permanently.

The above-mentioned patch alleviates these losses by allowing to [partially] use that "wasted" space for L4 data. I.e. from now one RocksDB is able to keep level data at fast volume even when level at its full capacity doesn't fit into fast volume.

You can find some additional info in the master PR:
https://github.com/ceph/ceph/pull/29687

Also available in: Atom PDF