Bug #61953
closedbluestore: bluestore slow latency report when rocksdb compaction
0%
Description
2023-07-11T06:30:19.492+0000 7f692d8c4700 4 rocksdb: [compaction/compaction_job.cc:1676] [P] [JOB 37] Compacting 2@0 + 1@1 files to L1, score 1.19 2023-07-11T06:30:19.492+0000 7f692d8c4700 4 rocksdb: [compaction/compaction_job.cc:1680] [P] Compaction start summary: Base version 95 Base level 0, inputs: [25842(30MB) 25824(30MB)], [25822(52MB)] 2023-07-11T06:30:19.492+0000 7f692d8c4700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1689057019492695, "job": 37, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [25842, 25824], "files_L1": [25822], "score": 1.18745, "input_data_size": 119089362} 2023-07-11T06:30:19.492+0000 7f692d0c3700 4 rocksdb: [db_impl/db_impl_files.cc:353] [JOB 36] Try to delete WAL files size 275931261, prev total WAL file size 744269564, number of live WAL files 4. 2023-07-11T06:30:19.967+0000 7f6919035700 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for submit_transact, latency = 0.033606321s 2023-07-11T06:30:19.967+0000 7f691502d700 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for submit_transact, latency = 0.031010913s 2023-07-11T06:30:19.967+0000 7f692beb2700 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 0.034002703s, txc = 0x56193fab0700 2023-07-11T06:30:19.967+0000 7f691382a700 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for submit_transact, latency = 0.036432911s 2023-07-11T06:30:19.968+0000 7f692beb2700 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 0.032162540s, txc = 0x5618f1cee380 2023-07-11T06:30:19.968+0000 7f692beb2700 0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 0.037519719s, txc = 0x5617328bca80 2023-07-11T06:30:20.138+0000 7f692d8c4700 4 rocksdb: [compaction/compaction_job.cc:1349] [P] [JOB 37] Generated table #25843: 278981 keys, 55386292 bytes
Updated by Igor Fedotov 10 months ago
What disk is used for DB - SSD or HDD? Which model?
Updated by yite gu 10 months ago
Igor Fedotov wrote:
What disk is used for DB - SSD or HDD? Which model?
# ceph device ls DEVICE HOST:DEV DAEMONS WEAR LIFE EXPECTANCY INTEL_SSDPF2KX076T1_BTAX226502JE7P6DGN ceph01:nvme6n1 osd.4 0% INTEL_SSDPF2KX076T1_BTAX226502YL7P6DGN ceph01:nvme5n1 osd.1 0% INTEL_SSDPF2KX076T1_BTAX226502Z47P6DGN ceph01:nvme2n1 osd.6 0% INTEL_SSDPF2KX076T1_BTAX226503QS7P6DGN ceph01:nvme8n1 osd.7 0% INTEL_SSDPF2KX076T1_BTAX226503V97P6DGN ceph01:nvme3n1 osd.0 0% INTEL_SSDPF2KX076T1_BTAX226504XR7P6DGN ceph01:nvme1n1 osd.5 0% INTEL_SSDPF2KX076T1_BTAX2265054M7P6DGN ceph01:nvme7n1 osd.2 0% INTEL_SSDPF2KX076T1_BTAX226505877P6DGN ceph01:nvme4n1 osd.3 0%
# smartctl -i /dev/nvme6n1 smartctl 6.6 (build date Nov 11 2020) [x86_64-linux-5.4.56.bsk.9-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: INTEL SSDPF2KX076T1 Serial Number: BTAX226502JE7P6DGN Firmware Version: 9CV10200 PCI Vendor/Subsystem ID: 0x8086 IEEE OUI Identifier: 0x5cd2e4 Total NVM Capacity: 7,681,501,126,656 [7.68 TB] Unallocated NVM Capacity: 0 Controller ID: 0 Number of Namespaces: 128 Namespace 1 Size/Capacity: 7,681,501,126,656 [7.68 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 5cd2e4 5e02e00500 Local Time is: Tue Jul 11 16:57:04 2023 CST
Updated by Igor Fedotov 10 months ago
Do you use custom settings for bluestore_log_op_age? It's equal to 5 seconds by default but your warnings pop up after ~0.03 sec delay. Which makes me think your setting is too low.
0.03 sec latency for a few ops during DB compaction looks acceptable to me.
Updated by yite gu 10 months ago
Igor Fedotov wrote:
Do you use custom settings for bluestore_log_op_age? It's equal to 5 seconds by default but your warnings pop up after ~0.03 sec delay. Which makes me think your setting is too low.
0.03 sec latency for a few ops during DB compaction looks acceptable to me.
Yes, I custom setting it to 0.02. When shaking, the RBD delay will reach over 100ms. If this is the expected behavior, do we have a method to optimize the shaking amplitude. Alternatively, I should consider expanding the cluster to improve performance
Updated by Mark Nelson 10 months ago
I tend to agree with Igor. We are unlikely to see consistent low latency from RocksDB during compaction events on the read path. We may or may not see low write latency depending on the ability of the WAL to absorb writes without triggering throttling during L0 compaction.
One strategy for avoiding some of this is to ensure that there is plenty of memory for onode cache to avoid onode misses to the maximum extent possible. We have pretty much a mountain of evidence at this point that we really want to avoid onode misses whenever possible. While it doesn't solve the specific issue raised here, it would be worth looking at "ceph tell osd.<num> perf dump" and seeing what the onode cache hit/miss counters look like. If they are below 90%, it may be worth increasing the osd_memory_target.
Updated by yite gu 10 months ago
Mark Nelson wrote:
I tend to agree with Igor. We are unlikely to see consistent low latency from RocksDB during compaction events on the read path. We may or may not see low write latency depending on the ability of the WAL to absorb writes without triggering throttling during L0 compaction.
One strategy for avoiding some of this is to ensure that there is plenty of memory for onode cache to avoid onode misses to the maximum extent possible. We have pretty much a mountain of evidence at this point that we really want to avoid onode misses whenever possible. While it doesn't solve the specific issue raised here, it would be worth looking at "ceph tell osd.<num> perf dump" and seeing what the onode cache hit/miss counters look like. If they are below 90%, it may be worth increasing the osd_memory_target.
Thanks your reply. Firstly, I provide info about you mentioned:
$ ceph tell osd.6 config get osd_memory_target { "osd_memory_target": "13743895347" }
$ ceph tell osd.6 heap stats osd.6 tcmalloc heap stats:------------------------------------------------ MALLOC: 11929995344 (11377.3 MiB) Bytes in use by application MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist MALLOC: + 1169686704 ( 1115.5 MiB) Bytes in central cache freelist MALLOC: + 2769408 ( 2.6 MiB) Bytes in transfer cache freelist MALLOC: + 35075328 ( 33.5 MiB) Bytes in thread cache freelists MALLOC: + 64618496 ( 61.6 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 13202145280 (12590.5 MiB) Actual memory used (physical + swap) MALLOC: + 2997968896 ( 2859.1 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 16200114176 (15449.6 MiB) Virtual address space used MALLOC: MALLOC: 822370 Spans in use MALLOC: 52 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory.
$ ceph tell osd.6 perf dump | grep bluestore_onode "bluestore_onodes": 352910, "bluestore_onode_hits": 75784717, "bluestore_onode_misses": 1511906, "bluestore_onode_shard_hits": 48889504, "bluestore_onode_shard_misses": 275548, "bluestore_onode_reshard": 27165,
75784717.0 / (75784717 + 1511906)
0.9804402062946528
Updated by yite gu 10 months ago
Mark Nelson wrote:
I tend to agree with Igor. We are unlikely to see consistent low latency from RocksDB during compaction events on the read path. We may or may not see low write latency depending on the ability of the WAL to absorb writes without triggering throttling during L0 compaction.
One strategy for avoiding some of this is to ensure that there is plenty of memory for onode cache to avoid onode misses to the maximum extent possible. We have pretty much a mountain of evidence at this point that we really want to avoid onode misses whenever possible. While it doesn't solve the specific issue raised here, it would be worth looking at "ceph tell osd.<num> perf dump" and seeing what the onode cache hit/miss counters look like. If they are below 90%, it may be worth increasing the osd_memory_target.
I think we should have a metrics to notify admin while db happen auto compactions
Updated by Adam Kupczyk 8 months ago
- Status changed from New to Won't Fix
Its normal behaviour for BlueStore to be a bit slower when compacting.