Project

General

Profile

Bug #38745

spillover that doesn't make sense

Added by Sage Weil 8 months ago. Updated 24 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Target version:
-
Start date:
03/14/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD(s)
     osd.50 spilled over 1.3 GiB metadata from 'db' device (20 GiB used of 31 GiB) to slow device
     osd.94 spilled over 1.1 GiB metadata from 'db' device (16 GiB used of 31 GiB) to slow device
     osd.103 spilled over 1.0 GiB metadata from 'db' device (18 GiB used of 31 GiB) to slow device

this is on the sepia lab cluster.

osd5_perf_log.log View (21.1 KB) Chris Callegari, 03/25/2019 12:15 AM

Selection-001.png View (79.3 KB) Konstantin Shalygin, 03/26/2019 02:34 AM

History

#1 Updated by Chris Callegari 8 months ago

I recently upgraded from latest mimic to nautilus. My cluster displayed 'BLUEFS_SPILLOVER BlueFS spillover detected on OSD'. It took a long conversation and a manual scan of all my osds to find the culprit. The '/usr/bin/ceph daemon osd.5 perf dump | /usr/bin/jq .' output is attached. Unfortunately I did not let this osd hang around for a long time. I zapped and created him.

Thanks,
/Chris Callegari

#2 Updated by Chris Callegari 8 months ago

Also my cluster did not display the 'osd.X spilled over 123 GiB metadata from 'blah' device (20 GiB used of 31 GiB) to slow device"

#3 Updated by Igor Fedotov 8 months ago

Chris Callegari wrote:

Also my cluster did not display the 'osd.X spilled over 123 GiB metadata from 'blah' device (20 GiB used of 31 GiB) to slow device"

Chris, you should invoke "ceph health detail" to get such an output.

#4 Updated by Igor Fedotov 8 months ago

Generally I suppose this is a valid state - RocksDB put next level data to slow device when it expects it wouldn't fit into the fast one. Please recall that RocksDB uses 250MB as a level base size and 10 as a next level multiplier by default. So drive has to have 250+GB to fit L3.

#5 Updated by Sage Weil 8 months ago

I tried a compaction on osd.50. Before,

     osd.50 spilled over 1.3 GiB metadata from 'db' device (18 GiB used of 31 GiB) to slow device

compaction showed:

2019-03-25 14:46:13.616 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:14.272 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:15.032 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:15.976 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:16.888 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:17.820 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:18.756 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:19.668 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:20.452 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:21.352 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:22.240 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:23.160 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:24.000 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:24.884 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:25.780 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:26.688 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0x2d00000; fallback to bdev 2
2019-03-25 14:46:28.356 7f85bf4ea700  1 bluefs _allocate failed to allocate 0x4200000 on bdev 1, free 0xa00000; fallback to bdev 2
2019-03-25 14:48:22.707 7f85d12e6080  1 bluestore(/var/lib/ceph/osd/ceph-50) umount

after,

     osd.50 spilled over 198 MiB metadata from 'db' device (17 GiB used of 31 GiB) to slow device

so... that is kind of weird.

rerunning with bluefs and bluestore debug enabled.

#6 Updated by Igor Fedotov 8 months ago

@Sage, I observed up to 2x space utilization increase during compaction. You can inspect l_bluefs_max_bytes_wal, l_bluefs_max_bytes_db, l_bluefs_max_bytes_slow perf counter to confirm that. Looks like the case here as well. Log output shows that BlueFS is unable to allocate ~69 MB (which seems to be close to average SST size in RocksDB) at fast device since it has just 47MB. Hence the fallback.
So perhaps the root cause for spillover might be both level layout (as per my previous comment) and lack of space during compaction.

#7 Updated by Sage Weil 8 months ago

ceph-post-file: a6ef2d24-56c0-486d-bb1e-f82080c0da9e

#8 Updated by Igor Fedotov 8 months ago

Curious thing is that one is unable to realize what space is occupied at slow device due to fallbacks from neither "kvstore-tool stats" nor "bluestore-tool export" commands. bluestore-tool show-bdev-sizes is the only (implicit) mean.
E.g.
"kvstore-tool stats" output:
"": " L0 3/0 667.25 MB 2.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " L1 5/0 246.06 MB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " L2 39/0 2.45 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " L3 11/0 707.18 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",
"": " Sum 58/0 4.03 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0",

"bluestore-tool show-bdev-sizes" output:
1 : device size 0x100000000 : own 0x[2000~ffffe000] = 0xffffe000 : using 0xc12fe000(3.0 GiB)
2 : device size 0x100000000 : own 0x[1ee00000~3c300000,60000000~40000000] = 0x7c300000 : using 0x44500000(1.1 Gi

Please note that used space at slow device is higher than space required by L3. Which is IMO caused by the fallback.

#9 Updated by Konstantin Shalygin 8 months ago

@Sage, I observed up to 2x space utilization increase during compaction.

This is normal behavior for first compaction.

#10 Updated by Rafal Wadolowski 8 months ago

The slow bytes used is the problem we've been seeing for one year.
One of server has 20GB db.wal for 8TB RAW device. There is one main pool build on EC 4+2. This cluster is only object storage cluster and the slows are hitting performance on listing buckets.
Statistics for usage:

OSD.384 DB used: 8.28 GiB SLOW used= 4.51 GiB WAL used= 252.00 MiB
OSD.404 DB used: 17.57 GiB SLOW used= 77.00 MiB WAL used= 252.00 MiB
OSD.374 DB used: 8.81 GiB SLOW used= 4.87 GiB WAL used= 252.00 MiB
OSD.385 DB used: 15.46 GiB SLOW used= 0 Bytes WAL used= 252.00 MiB
OSD.382 DB used: 9.11 GiB SLOW used= 5.05 GiB WAL used= 267.00 MiB
OSD.386 DB used: 14.94 GiB SLOW used= 1.83 GiB WAL used= 252.00 MiB
OSD.401 DB used: 15.12 GiB SLOW used= 4.20 GiB WAL used= 252.00 MiB
OSD.396 DB used: 15.37 GiB SLOW used= 89.00 MiB WAL used= 252.00 MiB
OSD.377 DB used: 16.55 GiB SLOW used= 202.00 MiB WAL used= 371.00 MiB
OSD.392 DB used: 10.44 GiB SLOW used= 4.29 GiB WAL used= 304.00 MiB
OSD.403 DB used: 15.93 GiB SLOW used= 76.00 MiB WAL used= 252.00 MiB
OSD.395 DB used: 15.33 GiB SLOW used= 0 Bytes WAL used= 264.00 MiB
OSD.375 DB used: 16.16 GiB SLOW used= 4.71 GiB WAL used= 252.00 MiB
OSD.379 DB used: 6.78 GiB SLOW used= 1.75 GiB WAL used= 540.00 MiB
OSD.407 DB used: 16.47 GiB SLOW used= 141.00 MiB WAL used= 252.00 MiB
OSD.393 DB used: 15.59 GiB SLOW used= 4.19 GiB WAL used= 264.00 MiB
OSD.399 DB used: 15.28 GiB SLOW used= 4.14 GiB WAL used= 252.00 MiB
OSD.381 DB used: 15.47 GiB SLOW used= 225.00 MiB WAL used= 580.00 MiB
OSD.405 DB used: 17.07 GiB SLOW used= 6.09 GiB WAL used= 858.00 MiB
OSD.398 DB used: 14.86 GiB SLOW used= 166.00 MiB WAL used= 252.00 MiB
OSD.383 DB used: 15.12 GiB SLOW used= 78.00 MiB WAL used= 253.00 MiB
OSD.402 DB used: 18.08 GiB SLOW used= 6.04 GiB WAL used= 280.00 MiB
OSD.391 DB used: 15.78 GiB SLOW used= 3.49 GiB WAL used= 256.00 MiB
OSD.389 DB used: 9.13 GiB SLOW used= 4.58 GiB WAL used= 256.00 MiB
OSD.376 DB used: 16.92 GiB SLOW used= 1.33 GiB WAL used= 2.62 GiB
OSD.388 DB used: 15.47 GiB SLOW used= 141.00 MiB WAL used= 248.00 MiB
OSD.394 DB used: 7.74 GiB SLOW used= 4.45 GiB WAL used= 272.00 MiB
OSD.380 DB used: 15.82 GiB SLOW used= 79.00 MiB WAL used= 252.00 MiB
OSD.390 DB used: 10.88 GiB SLOW used= 5.50 GiB WAL used= 256.00 MiB
OSD.397 DB used: 8.31 GiB SLOW used= 3.73 GiB WAL used= 442.00 MiB
OSD.406 DB used: 16.69 GiB SLOW used= 195.00 MiB WAL used= 311.00 MiB
OSD.400 DB used: 16.05 GiB SLOW used= 145.00 MiB WAL used= 256.00 MiB
OSD.378 DB used: 15.22 GiB SLOW used= 152.00 MiB WAL used= 445.00 MiB
OSD.387 DB used: 13.03 GiB SLOW used= 0 Bytes WAL used= 262.00 MiB
SUM DB used: 475.00 GiB SUM SLOW used= 76.55 GiB SUM WAL used= 12.64 GiB

IMHO The compaction is only short-term resolution. I think the real problem is in how the rocksdb store data on disk. For example if you delete some of the data, they are deleted from DB, but DB still used this same space and the compaction (when it was triggered) will free it. Maybe there is a method, which is trying to optimize used space periodicaly.
Actually compaction is started by the user or by rocksdb when level ratio is above 1.0.

#11 Updated by Konstantin Shalygin 8 months ago

Rafal, there is not your case! You spillover is because your db is lower than 30Gb. Please consult with http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html

#12 Updated by Rafal Wadolowski 8 months ago

Konstantin Shalygin wrote:

Rafal, there is not your case! You spillover is because your db is lower than 30Gb. Please consult with http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html

This problem is related. The data should be only on DB, if there is free space. There is interesting split:

osd-379
{
    "rocksdb_compaction_statistics": "",
    "": "",
    "": "** Compaction Stats [default] **",
    "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
    "": "----------------------------------------------------------------------------------------------------------------------------------------------------------",
    "": "  L0      2/0   56.73 MB   0.2      0.0     0.0      0.0       0.1      0.1       0.0   1.0      0.0     73.4         1         2    0.386       0      0",
    "": "  L3    135/0    8.47 GB   0.1      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0",
    "": " Sum    137/0    8.53 GB   0.0      0.0     0.0      0.0       0.1      0.1       0.0   1.0      0.0     73.4         1         2    0.386       0      0",
    "": " Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.1      0.1       0.0   1.0      0.0     73.4         1         2    0.386       0      0",
    "": "Uptime(secs): 136650.4 total, 136650.4 interval",
    "": "Flush(GB): cumulative 0.055, interval 0.055",
    "": "AddFile(GB): cumulative 0.000, interval 0.000",
    "": "AddFile(Total Files): cumulative 0, interval 0",
    "": "AddFile(L0 Files): cumulative 0, interval 0",
    "": "AddFile(Keys): cumulative 0, interval 0",
    "": "Cumulative compaction: 0.06 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.8 seconds",
    "": "Interval compaction: 0.06 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.8 seconds",
    "": "Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count",
    "": "",
    "": "** File Read Latency Histogram By Level [default] **",
    "": "",
    "": "** DB Stats **",
    "": "Uptime(secs): 136650.4 total, 136650.4 interval",
    "": "Cumulative writes: 259K writes, 794K keys, 259K commit groups, 1.0 writes per commit group, ingest: 0.39 GB, 0.00 MB/s",
    "": "Cumulative WAL: 259K writes, 129K syncs, 2.00 writes per sync, written: 0.39 GB, 0.00 MB/s",
    "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
    "": "Interval writes: 259K writes, 794K keys, 259K commit groups, 1.0 writes per commit group, ingest: 395.78 MB, 0.00 MB/s",
    "": "Interval WAL: 259K writes, 129K syncs, 2.00 writes per sync, written: 0.39 MB, 0.00 MB/s",
    "": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent" 
}

"You spillover is because your db is lower than 30Gb", why 30Gb are problem in my case?

According to your screen, we noticed this is normal effect and twice compacting cluster are clearing slow.

#13 Updated by Konstantin Shalygin 8 months ago

why 30Gb are problem in my case?

Because of compaction levels: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

#14 Updated by Rafal Wadolowski 8 months ago

Konstanin, okey but in documentation are default settings.
We have

bluestore_rocksdb_options = "compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=1610612736,max_bytes_for_level_multiplier=10,compaction_threads=32,flusher_threads=8" 

so L0 is 1.5GB,L1 1.5GB, L2 15GB
L0+L1+L2 18GB
In https://github.com/ceph/ceph/pull/22025 I proposed option to rocksdb, that will help osd.379, but I'm not sure if that change will cover all spillover cases.

#15 Updated by Konstantin Shalygin 8 months ago

256MB+2.56GB+25.6GB=~28-29GB - for default Luminous options.

#16 Updated by Xiaoxi Chen 6 months ago

BLUEFS_SPILLOVER BlueFS spillover detected on 7 OSD
osd.248 spilled over 257 MiB metadata from 'db' device (3.2 GiB used of 20 GiB) to slow device
osd.266 spilled over 264 MiB metadata from 'db' device (3.4 GiB used of 20 GiB) to slow device
osd.283 spilled over 330 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device
osd.294 spilled over 594 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device
osd.320 spilled over 279 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device
osd.371 spilled over 264 MiB metadata from 'db' device (3.4 GiB used of 20 GiB) to slow device
osd.391 spilled over 264 MiB metadata from 'db' device (3.3 GiB used of 20 GiB) to slow device

one more instance, even more wired... have no idea how come I only use 3.3GB but already split over

#17 Updated by Josh Durgin 6 months ago

  • Assignee set to Adam Kupczyk

Adam's looking at similar spillover issues

#18 Updated by Brett Chancellor 5 months ago

This is also showing up in 14.2.1 in instances where the db is overly provisioned.

HEALTH_WARN BlueFS spillover detected on 3 OSD
BLUEFS_SPILLOVER BlueFS spillover detected on 3 OSD
osd.52 spilled over 804 MiB metadata from 'db' device (29 GiB used of 148 GiB) to slow device
osd.221 spilled over 3.9 GiB metadata from 'db' device (29 GiB used of 148 GiB) to slow device
osd.245 spilled over 1.7 GiB metadata from 'db' device (28 GiB used of 148 GiB) to slow device

#19 Updated by Sage Weil 4 months ago

  • Priority changed from High to Normal

#20 Updated by Sage Weil 4 months ago

  • Status changed from Verified to In Progress

#21 Updated by Dan van der Ster 2 months ago

Us to, on hdd-only OSDs (no dedicated block.db or wal):

BLUEFS_SPILLOVER BlueFS spillover detected on 94 OSD
osd.98 spilled over 1.1 GiB metadata from 'db' device (147 MiB used of 26 GiB) to slow device
osd.135 spilled over 1.3 GiB metadata from 'db' device (183 MiB used of 26 GiB) to slow device
...

#22 Updated by Igor Fedotov 2 months ago

@Dan - this sounds weird - spillover without dedicated db... Could you please share 'ceph osd metadata output'?

#23 Updated by Rafal Wadolowski 2 months ago

@Adam are there any news about this problem? We have about ~1500 osds with spillover.
If you need some more data, feel free to contact me :)

@Dan do you have default configuration for rocksdb?

#24 Updated by Dan van der Ster 2 months ago

@Igor, @Rafal: please ignore. I was confused about the configuration of this cluster. It indeed has 26GB rocksdb's, so the spillover makes perfect sense.

#25 Updated by Marcin W 24 days ago

Due to spillover, I'm trying to optimize RocksDB options based on data partition size(roughly 9TB in example below). This will help to estimate how much space is needed for new DB partition on NVME and adjust parameters accordingly.
Would this calculation make sense and would it prevent spillover?

    def calculate_rocksdb_levels(self, base, multiplier, levels):
        # default Ceph setting:
        # base=256, multiplier=10, levels=5

        level_sizes = [ 0, base ]

        for level in range(2, levels+1):
            # L(n+1) = (Ln) * max_bytes_for_level_multiplier
            level_prev      = level - 1
            level_size_prev = level_sizes[level_prev]
            level_size      = level_size_prev * multiplier
            level_sizes.append(level_size)
        level_sizes_all = int((base * (1 - (multiplier**levels))) / (1-multiplier))
        # log.debug('level_sizes_all=%s, level_sizes=%s', level_sizes_all, level_sizes)
        return level_sizes_all, level_sizes

    def calculate_rocksdb_new_size(self, partition_db_size):
        # Add 10% of extra space for compaction and others. Enough?
        partition_db_size_w_spare = int(partition_db_size - (partition_db_size / 10))
        nearest_size        = 0
        nearest_settings    = {}

        # Only subset of base, multiplier and levels is taken into consideration here
        # Brute force geometric progression calculations to find
        # total space of all DB levels to be as close to partition_db_size_w_spare as possible
        for base in [ 64, 96, 128, 192, 256 ]:
            for multiplier in range(3, 10+1):
                for levels in range(4, 6+1):
                    # log.debug('base=%s, multiplier=%s, levels=%s', base, multiplier, levels)
                    level_sizes_all, level_sizes = self.calculate_rocksdb_levels(base, multiplier, levels)
                    if level_sizes_all < partition_db_size_w_spare:
                        if level_sizes_all > nearest_size:
                            nearest_size        = level_sizes_all
                            nearest_settings    = {
                                'base':         base,
                                'multiplier':   multiplier,
                                'levels':       levels,
                                'level_sizes':  level_sizes[1:],
                            }

        nearest_db_size = int(nearest_size + (nearest_size / 10))
        base            = nearest_settings['base']
        multiplier      = nearest_settings['multiplier']
        levels          = nearest_settings['levels']
        mem_buffers     = int(base/16)      # base / write_buffer_size

        db_settings = {
            'compaction_readahead_size':            '2MB',
            'compaction_style':                     'kCompactionStyleLevel',
            'compaction_threads':                   '%s' % (mem_buffers * 2),
            'compression':                          'kNoCompression',
            'flusher_threads':                      '8',
            'level0_file_num_compaction_trigger':   '%s' % int(mem_buffers / 2),
            'level0_slowdown_writes_trigger':       '%s' % (mem_buffers + 8),
            'level0_stop_writes_trigger':           '%s' % (mem_buffers + 16),
            'max_background_compactions':           '%s' % (mem_buffers * 2),
            'max_bytes_for_level_base':             '%sMB' % base,
            'max_bytes_for_level_multiplier':       '%s' % multiplier,
            'max_write_buffer_number':              '%s' % mem_buffers,
            'min_write_buffer_number_to_merge':     '%s' % int(mem_buffers / 2),
            'num_levels':                           '%s' % levels,
            'recycle_log_file_num':                 '2',
            'target_file_size_base':                '16MB',
            'write_buffer_size':                    '16MB',
        }

        db_settings_join = ','.join([ '%s=%s' % (k, db_settings[k]) for k in sorted(db_settings.keys()) ])
        log.debug('partition_db_size=%s, partition_db_size_w_spare=%s, nearest_db_size=%s, %s', partition_db_size, partition_db_size_w_spare, nearest_db_size, db_settings_join)
        log.debug('final partition_db_size=%s, levels=%s, levels_total=%s', nearest_db_size, nearest_settings['level_sizes'], nearest_size)
        return nearest_db_size, db_settings_join

    def calc_db_size_for_block(self, data_size):
        # Assume DB size is about 5% of data partition
        db_size         = int(int(int((data_size * 5 / 100) + 1) / 2) * 2)
        log.debug('data_size=%s, db_size_free=%s', data_size, db_size)

        return self.calculate_rocksdb_new_size(db_size)

/dev/sdc: for data, partition size: 9537532 MB, 5% = partition_db_size(476876 MB)
nearest_db_size=437888 MB, compaction_readahead_size=2MB,compaction_style=kCompactionStyleLevel,compaction_threads=32,compression=kNoCompression,flusher_threads=8,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=24,level0_stop_writes_trigger=32,max_background_compactions=32,max_bytes_for_level_base=256MB,max_bytes_for_level_multiplier=6,max_write_buffer_number=16,min_write_buffer_number_to_merge=8,num_levels=5,recycle_log_file_num=2,target_file_size_base=16MB,write_buffer_size=16MB'
final partition_db_size=437888, levels=[256, 1536, 9216, 55296, 331776], levels_total=398080'

At this point, LVM will create 437888MB db_lv and ceph.conf needs extra section:

[osd.X]
bluestore_rocksdb_options = {{ db_settings_join }}

Also available in: Atom PDF