Bug #48729: Bluestore memory leak on srub operations - bluestore - Ceph

Actions

Copy link

Bug #48729

closed

Bluestore memory leak on srub operations

Added by Rafal Wadolowski over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.11

ceph-qa-suite:

Pull request ID:

39720

Crash signature (v1):

Crash signature (v2):

Description

We observed some unlimited growing ram on OSD processes.
During our investigation (valgrind), we gathered information about memory leak.

==3288134==
==3288134== 1,482,956,416 bytes in 6,390,722 blocks are possibly lost in loss record 4,882 of 4,883
==3288134==    at 0xABF4E76: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3288134==    by 0xABF4FE1: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3288134==    by 0x10DA5CA: ceph::buffer::v14_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int) (buffer.cc:121)
==3288134==    by 0x10DA7A1: ceph::buffer::v14_2_0::create_aligned(unsigned int, unsigned int) (buffer.cc:417)
==3288134==    by 0x10DB15B: ceph::buffer::v14_2_0::copy(char const*, unsigned int) (buffer.cc:365)
==3288134==    by 0xD3858E: BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&) (buffer.h:233)
==3288134==    by 0xD3BB84: BlueStore::Collection::get_onode(ghobject_t const&, bool) (BlueStore.cc:3467)
==3288134==    by 0xD566DE: BlueStore::stat(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, stat*, bool) (BlueStore.cc:8908)
==3288134==    by 0xAFCBA7: PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&) (PGBackend.cc:586)
==3288134==    by 0x98CD1A: PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&) (PG.cc:5047)
==3288134==    by 0x9BCADA: PG::chunky_scrub(ThreadPool::TPHandle&) (PG.cc:5673)
==3288134==    by 0x9BDF5E: PG::scrub(unsigned int, ThreadPool::TPHandle&) (PG.cc:5237)

After some time OSD process is killed by OOM Killer

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Wondering if you can try a patch from https://tracker.ceph.com/issues/46027 and check whether it's helpful in your case...

Actions

Copy link

Updated by Rafal Wadolowski over 3 years ago

I will put this on test environment. Will see

Actions

Copy link

Updated by Rafal Wadolowski over 3 years ago

Unfortunately, That's not the case. After 4 days some of the osds took >10GB of ram.
In example:

{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 618237,
                "bytes": 49458960
            },
            "bluestore_cache_data": {
                "items": 2706603,
                "bytes": 256537796
            },
            "bluestore_cache_onode": {
                "items": 2707283,
                "bytes": 1775977648
            },
            "bluestore_cache_meta": {
                "items": 288856905,
                "bytes": 1867155824
            },
            "bluestore_cache_other": {
                "items": 17828824,
                "bytes": 302498676
            },
            "bluestore_Buffer": {
                "items": 286,
                "bytes": 38896
            },
            "bluestore_Extent": {
                "items": 17758438,
                "bytes": 852405024
            },
            "bluestore_Blob": {
                "items": 17758439,
                "bytes": 1846877656
            },
            "bluestore_SharedBlob": {
                "items": 17758439,
                "bytes": 1988945168
            },
            "bluestore_inline_bl": {
                "items": 75,
                "bytes": 9822
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 32,
                "bytes": 23296
            },
            "bluestore_writing_deferred": {
                "items": 475,
                "bytes": 2434417
            },
            "bluestore_writing": {
                "items": 0,
                "bytes": 0
            },
            "bluefs": {
                "items": 957,
                "bytes": 27936
            },
            "bluefs_file_reader": {
                "items": 100,
                "bytes": 749944
            },
            "bluefs_file_writer": {
                "items": 4,
                "bytes": 1216
            },
            "buffer_anon": {
                "items": 1202,
                "bytes": 4023826
            },
            "buffer_meta": {
                "items": 1159,
                "bytes": 101992
            },
            "osd": {
                "items": 172,
                "bytes": 2238752
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 873105,
                "bytes": 229538616
            },
            "osdmap": {
                "items": 20191,
                "bytes": 1143144
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 366890926,
            "bytes": 9180188609
        }
    }
}

ceph.conf

[global]
bluestore_rocksdb_options = "compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=1610612736,max_bytes_for_level_multiplier=10" 
cluster network = 10.0.20.0/24
err_to_syslog = true
fsid = 874ec41f-e6eb-4c0b-be4e-e82be057195b
log_to_syslog = true
mon host = [v2:10.0.20.11:3300,v1:10.0.20.11:6789],[v2:10.0.20.12:3300,v1:10.0.20.12:6789],[v2:10.0.20.14:3300,v1:10.0.20.14:6789]
mon initial members = ceph1-hdd,ceph2-hdd,ceph4-hdd
osd pool default crush rule = -1
osd_command_thread_suicide_timeout = 7200
osd_command_thread_timeout = 4800
osd_crush_update_on_start = true
osd_max_pg_per_osd_hard_ratio = 32
osd_max_scrubs = 10
osd_memory_target = 3221225472
osd_op_thread_suicide_timeout = 600
osd_op_thread_timeout = 300
osd_pool_default_ec_fast_read = true
osd_recovery_max_active = 1
public network = 10.0.20.0/24
[osd]
osd memory target = 3221225472

PS. I see that i duplicated osd_memory_target, that's not the problem :)

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Presuming mem utilization is still that high could you please temporary set debug_bluestore to 20 for the osd in question and leave it running (without any load if possible) for 1 min.

Actions

Copy link

Updated by Rafal Wadolowski over 3 years ago

@Igor Gajowiak

here you are:
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5b9ea421deb745bfb4dab930cebe153f/ceph-sharings/ceph_17_debug_leak.txt

Without load - only normal cluster ops.

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Status changed from New to Triaged

It looks like high RAM usage is caused by improper onode cache trimming inside BlueStore. Which in turn might be caused by some bug in onode ref counting.
One can see that from repetitive log output which performs no actual cache trimming due to pinned onodes.

2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.MempoolThread(0x5570288a2b70) _trim_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 62310608 meta_alloc: 67108864 meta_used: 7976997419 data_alloc: 67108864 data_used: 0
2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim onodes 2499728 / 4234 buffers 0 / 13421772
2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim considering 0x557054ff7e40
2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim #-1:e2a35965:::inc_osdmap.7948:0# has 2 refs; skipping
...
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.MempoolThread(0x5570288a2b70) _trim_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 62310608 meta_alloc: 67108864 meta_used: 7976908459 data_alloc: 67108864 data_used: 0
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim onodes 2499728 / 4234 buffers 0 / 13421772
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim considering 0x557054ff7e40
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim #-1:e2a35965:::inc_osdmap.7948:0# has 2 refs; skipping
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim considering 0x557054ff8100
...

Starting Octopus onode caching has been significantly reworked and I presume the bug isn't present anymore.
Not sure we want to backport all the relevant changes back to Nautilus due to their complexity though.

Just in case I'd suggest to try the latest nautilus but highly likely this wouldn't fix the issue.

Another (quite questinable) workaround might be to tune bluestore_cache_trim_max_skip_pinned parameter to force cache trimming to skip first pinned entries and proceed. But IMO this isn't a very good solution since those pinned entries might live forever and their amount might grow high...

Hence IMO upgrading to Octopus is the preferred way of dealing with this bug.

Actions

Copy link