Bug #61953: bluestore: bluestore slow latency report when rocksdb compaction - bluestore - Ceph

Actions

Copy link

Bug #61953

closed

bluestore: bluestore slow latency report when rocksdb compaction

Added by yite gu 10 months ago. Updated 8 months ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2023-07-11T06:30:19.492+0000 7f692d8c4700  4 rocksdb: [compaction/compaction_job.cc:1676] [P] [JOB 37] Compacting 2@0 + 1@1 files to L1, score 1.19
2023-07-11T06:30:19.492+0000 7f692d8c4700  4 rocksdb: [compaction/compaction_job.cc:1680] [P] Compaction start summary: Base version 95 Base level 0, inputs: [25842(30MB) 25824(30MB)], [25822(52MB)]

2023-07-11T06:30:19.492+0000 7f692d8c4700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1689057019492695, "job": 37, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [25842, 25824], "files_L1": [25822], "score": 1.18745, "input_data_size": 119089362}
2023-07-11T06:30:19.492+0000 7f692d0c3700  4 rocksdb: [db_impl/db_impl_files.cc:353] [JOB 36] Try to delete WAL files size 275931261, prev total WAL file size 744269564, number of live WAL files 4.

2023-07-11T06:30:19.967+0000 7f6919035700  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for submit_transact, latency = 0.033606321s
2023-07-11T06:30:19.967+0000 7f691502d700  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for submit_transact, latency = 0.031010913s
2023-07-11T06:30:19.967+0000 7f692beb2700  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 0.034002703s, txc = 0x56193fab0700
2023-07-11T06:30:19.967+0000 7f691382a700  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency slow operation observed for submit_transact, latency = 0.036432911s
2023-07-11T06:30:19.968+0000 7f692beb2700  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 0.032162540s, txc = 0x5618f1cee380
2023-07-11T06:30:19.968+0000 7f692beb2700  0 bluestore(/var/lib/ceph/osd/ceph-6) log_latency_fn slow operation observed for _txc_committed_kv, latency = 0.037519719s, txc = 0x5617328bca80
2023-07-11T06:30:20.138+0000 7f692d8c4700  4 rocksdb: [compaction/compaction_job.cc:1349] [P] [JOB 37] Generated table #25843: 278981 keys, 55386292 bytes

Actions

Copy link

Updated by Igor Fedotov 10 months ago

What disk is used for DB - SSD or HDD? Which model?

Actions

Copy link

Updated by yite gu 10 months ago

Igor Fedotov wrote:

What disk is used for DB - SSD or HDD? Which model?

# ceph device ls
DEVICE                                  HOST:DEV        DAEMONS  WEAR  LIFE EXPECTANCY
INTEL_SSDPF2KX076T1_BTAX226502JE7P6DGN  ceph01:nvme6n1  osd.4      0%
INTEL_SSDPF2KX076T1_BTAX226502YL7P6DGN  ceph01:nvme5n1  osd.1      0%
INTEL_SSDPF2KX076T1_BTAX226502Z47P6DGN  ceph01:nvme2n1  osd.6      0%
INTEL_SSDPF2KX076T1_BTAX226503QS7P6DGN  ceph01:nvme8n1  osd.7      0%
INTEL_SSDPF2KX076T1_BTAX226503V97P6DGN  ceph01:nvme3n1  osd.0      0%
INTEL_SSDPF2KX076T1_BTAX226504XR7P6DGN  ceph01:nvme1n1  osd.5      0%
INTEL_SSDPF2KX076T1_BTAX2265054M7P6DGN  ceph01:nvme7n1  osd.2      0%
INTEL_SSDPF2KX076T1_BTAX226505877P6DGN  ceph01:nvme4n1  osd.3      0%

# smartctl -i /dev/nvme6n1
smartctl 6.6 (build date Nov 11 2020) [x86_64-linux-5.4.56.bsk.9-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPF2KX076T1
Serial Number:                      BTAX226502JE7P6DGN
Firmware Version:                   9CV10200
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Total NVM Capacity:                 7,681,501,126,656 [7.68 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
Number of Namespaces:               128
Namespace 1 Size/Capacity:          7,681,501,126,656 [7.68 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            5cd2e4 5e02e00500
Local Time is:                      Tue Jul 11 16:57:04 2023 CST

Actions

Copy link

Updated by yite gu 10 months ago

db and block on the same device

Actions

Copy link

Updated by Igor Fedotov 10 months ago

Do you use custom settings for bluestore_log_op_age? It's equal to 5 seconds by default but your warnings pop up after ~0.03 sec delay. Which makes me think your setting is too low.
0.03 sec latency for a few ops during DB compaction looks acceptable to me.

Actions

Copy link

Updated by yite gu 10 months ago

Igor Fedotov wrote:

Do you use custom settings for bluestore_log_op_age? It's equal to 5 seconds by default but your warnings pop up after ~0.03 sec delay. Which makes me think your setting is too low.
0.03 sec latency for a few ops during DB compaction looks acceptable to me.

Yes, I custom setting it to 0.02. When shaking, the RBD delay will reach over 100ms. If this is the expected behavior, do we have a method to optimize the shaking amplitude. Alternatively, I should consider expanding the cluster to improve performance

Actions

Copy link

Updated by Mark Nelson 10 months ago

I tend to agree with Igor. We are unlikely to see consistent low latency from RocksDB during compaction events on the read path. We may or may not see low write latency depending on the ability of the WAL to absorb writes without triggering throttling during L0 compaction.

One strategy for avoiding some of this is to ensure that there is plenty of memory for onode cache to avoid onode misses to the maximum extent possible. We have pretty much a mountain of evidence at this point that we really want to avoid onode misses whenever possible. While it doesn't solve the specific issue raised here, it would be worth looking at "ceph tell osd.<num> perf dump" and seeing what the onode cache hit/miss counters look like. If they are below 90%, it may be worth increasing the osd_memory_target.

Actions

Copy link

Updated by yite gu 10 months ago

Mark Nelson wrote:

I tend to agree with Igor. We are unlikely to see consistent low latency from RocksDB during compaction events on the read path. We may or may not see low write latency depending on the ability of the WAL to absorb writes without triggering throttling during L0 compaction.

One strategy for avoiding some of this is to ensure that there is plenty of memory for onode cache to avoid onode misses to the maximum extent possible. We have pretty much a mountain of evidence at this point that we really want to avoid onode misses whenever possible. While it doesn't solve the specific issue raised here, it would be worth looking at "ceph tell osd.<num> perf dump" and seeing what the onode cache hit/miss counters look like. If they are below 90%, it may be worth increasing the osd_memory_target.

Thanks your reply. Firstly, I provide info about you mentioned:

$ ceph tell osd.6 config get osd_memory_target
{
    "osd_memory_target": "13743895347" 
}

$ ceph tell osd.6 heap stats
osd.6 tcmalloc heap stats:------------------------------------------------
MALLOC:    11929995344 (11377.3 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +   1169686704 ( 1115.5 MiB) Bytes in central cache freelist
MALLOC: +      2769408 (    2.6 MiB) Bytes in transfer cache freelist
MALLOC: +     35075328 (   33.5 MiB) Bytes in thread cache freelists
MALLOC: +     64618496 (   61.6 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  13202145280 (12590.5 MiB) Actual memory used (physical + swap)
MALLOC: +   2997968896 ( 2859.1 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  16200114176 (15449.6 MiB) Virtual address space used
MALLOC:
MALLOC:         822370              Spans in use
MALLOC:             52              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

$ ceph tell osd.6 perf dump | grep bluestore_onode
        "bluestore_onodes": 352910,
        "bluestore_onode_hits": 75784717,
        "bluestore_onode_misses": 1511906,
        "bluestore_onode_shard_hits": 48889504,
        "bluestore_onode_shard_misses": 275548,
        "bluestore_onode_reshard": 27165,

75784717.0 / (75784717 + 1511906)
0.9804402062946528

Actions

Copy link

Updated by yite gu 10 months ago

Mark Nelson wrote:

I tend to agree with Igor. We are unlikely to see consistent low latency from RocksDB during compaction events on the read path. We may or may not see low write latency depending on the ability of the WAL to absorb writes without triggering throttling during L0 compaction.

One strategy for avoiding some of this is to ensure that there is plenty of memory for onode cache to avoid onode misses to the maximum extent possible. We have pretty much a mountain of evidence at this point that we really want to avoid onode misses whenever possible. While it doesn't solve the specific issue raised here, it would be worth looking at "ceph tell osd.<num> perf dump" and seeing what the onode cache hit/miss counters look like. If they are below 90%, it may be worth increasing the osd_memory_target.

I think we should have a metrics to notify admin while db happen auto compactions

Actions

Copy link