Project

General

Profile

Backport #24375

Updated by Kefu Chai about 1 year ago

https://github.com/ceph/ceph/pull/22361 in rocksdb, by default, "max_bytes_for_level_base" is 256MB, "max_bytes_for_level_multiplier" is 10. so with this setting, the limit of each level of a rocksdb would look like

# L0: in memory
# L1: 256MB
# L2: 2.56 GB
# L3: 25.6 GB
# L4: 256 GB
# L5: 2.56 TB
# L6: 25.6 TB

for monitor, 2.56 GB is relative large even for a large cluster. depending on the application of OSD, i'd say 2.56 GB is quite large for omap even taking the load of rgw into consideration.

in the case of monitor, if the cluster has been running for a long time in a large scale deployment, there is chance that the old and stale data could be migrated to L3. and new K/V data come in, the are written to lower level, like L0, L1. like

# L1: 250MB
# L2: 2 GB
# L3: 25 GB
# L4: 25 GB // stale data. non-user data

then we will be suffering from "space amplification". as the space amplification is (25 + 2 + 0.25 + 25) / (25 + 2 + 0.25) = 1.91 .

and the auto compaction does not help in this case, as none of sizes exceeds max_bytes limit. so a more flexible approach is to enable the dynamic level size for compaction.

----
[0] https://rocksdb.org/blog/2015/07/23/dynamic-level.html
[1] https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

Back