Bug #48729
closedBluestore memory leak on srub operations
0%
Description
We observed some unlimited growing ram on OSD processes.
During our investigation (valgrind), we gathered information about memory leak.
==3288134== ==3288134== 1,482,956,416 bytes in 6,390,722 blocks are possibly lost in loss record 4,882 of 4,883 ==3288134== at 0xABF4E76: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==3288134== by 0xABF4FE1: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==3288134== by 0x10DA5CA: ceph::buffer::v14_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int) (buffer.cc:121) ==3288134== by 0x10DA7A1: ceph::buffer::v14_2_0::create_aligned(unsigned int, unsigned int) (buffer.cc:417) ==3288134== by 0x10DB15B: ceph::buffer::v14_2_0::copy(char const*, unsigned int) (buffer.cc:365) ==3288134== by 0xD3858E: BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&) (buffer.h:233) ==3288134== by 0xD3BB84: BlueStore::Collection::get_onode(ghobject_t const&, bool) (BlueStore.cc:3467) ==3288134== by 0xD566DE: BlueStore::stat(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, stat*, bool) (BlueStore.cc:8908) ==3288134== by 0xAFCBA7: PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&) (PGBackend.cc:586) ==3288134== by 0x98CD1A: PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&) (PG.cc:5047) ==3288134== by 0x9BCADA: PG::chunky_scrub(ThreadPool::TPHandle&) (PG.cc:5673) ==3288134== by 0x9BDF5E: PG::scrub(unsigned int, ThreadPool::TPHandle&) (PG.cc:5237)
After some time OSD process is killed by OOM Killer
Updated by Igor Fedotov over 3 years ago
Wondering if you can try a patch from https://tracker.ceph.com/issues/46027 and check whether it's helpful in your case...
Updated by Rafal Wadolowski over 3 years ago
I will put this on test environment. Will see
Updated by Rafal Wadolowski over 3 years ago
Unfortunately, That's not the case. After 4 days some of the osds took >10GB of ram.
In example:
{ "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 618237, "bytes": 49458960 }, "bluestore_cache_data": { "items": 2706603, "bytes": 256537796 }, "bluestore_cache_onode": { "items": 2707283, "bytes": 1775977648 }, "bluestore_cache_meta": { "items": 288856905, "bytes": 1867155824 }, "bluestore_cache_other": { "items": 17828824, "bytes": 302498676 }, "bluestore_Buffer": { "items": 286, "bytes": 38896 }, "bluestore_Extent": { "items": 17758438, "bytes": 852405024 }, "bluestore_Blob": { "items": 17758439, "bytes": 1846877656 }, "bluestore_SharedBlob": { "items": 17758439, "bytes": 1988945168 }, "bluestore_inline_bl": { "items": 75, "bytes": 9822 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 32, "bytes": 23296 }, "bluestore_writing_deferred": { "items": 475, "bytes": 2434417 }, "bluestore_writing": { "items": 0, "bytes": 0 }, "bluefs": { "items": 957, "bytes": 27936 }, "bluefs_file_reader": { "items": 100, "bytes": 749944 }, "bluefs_file_writer": { "items": 4, "bytes": 1216 }, "buffer_anon": { "items": 1202, "bytes": 4023826 }, "buffer_meta": { "items": 1159, "bytes": 101992 }, "osd": { "items": 172, "bytes": 2238752 }, "osd_mapbl": { "items": 0, "bytes": 0 }, "osd_pglog": { "items": 873105, "bytes": 229538616 }, "osdmap": { "items": 20191, "bytes": 1143144 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 366890926, "bytes": 9180188609 } } }
ceph.conf
[global] bluestore_rocksdb_options = "compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=1610612736,max_bytes_for_level_multiplier=10" cluster network = 10.0.20.0/24 err_to_syslog = true fsid = 874ec41f-e6eb-4c0b-be4e-e82be057195b log_to_syslog = true mon host = [v2:10.0.20.11:3300,v1:10.0.20.11:6789],[v2:10.0.20.12:3300,v1:10.0.20.12:6789],[v2:10.0.20.14:3300,v1:10.0.20.14:6789] mon initial members = ceph1-hdd,ceph2-hdd,ceph4-hdd osd pool default crush rule = -1 osd_command_thread_suicide_timeout = 7200 osd_command_thread_timeout = 4800 osd_crush_update_on_start = true osd_max_pg_per_osd_hard_ratio = 32 osd_max_scrubs = 10 osd_memory_target = 3221225472 osd_op_thread_suicide_timeout = 600 osd_op_thread_timeout = 300 osd_pool_default_ec_fast_read = true osd_recovery_max_active = 1 public network = 10.0.20.0/24 [osd] osd memory target = 3221225472
PS. I see that i duplicated osd_memory_target, that's not the problem :)
Updated by Igor Fedotov over 3 years ago
Presuming mem utilization is still that high could you please temporary set debug_bluestore to 20 for the osd in question and leave it running (without any load if possible) for 1 min.
Updated by Rafal Wadolowski over 3 years ago
here you are:
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5b9ea421deb745bfb4dab930cebe153f/ceph-sharings/ceph_17_debug_leak.txt
Without load - only normal cluster ops.
Updated by Igor Fedotov over 3 years ago
- Status changed from New to Triaged
It looks like high RAM usage is caused by improper onode cache trimming inside BlueStore. Which in turn might be caused by some bug in onode ref counting.
One can see that from repetitive log output which performs no actual cache trimming due to pinned onodes.
2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.MempoolThread(0x5570288a2b70) _trim_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 62310608 meta_alloc: 67108864 meta_used: 7976997419 data_alloc: 67108864 data_used: 0
2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim onodes 2499728 / 4234 buffers 0 / 13421772
2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim considering 0x557054ff7e40
2021-01-12 10:41:27.640 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim #-1:e2a35965:::inc_osdmap.7948:0# has 2 refs; skipping
...
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.MempoolThread(0x5570288a2b70) _trim_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 62310608 meta_alloc: 67108864 meta_used: 7976908459 data_alloc: 67108864 data_used: 0
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim onodes 2499728 / 4234 buffers 0 / 13421772
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim considering 0x557054ff7e40
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim #-1:e2a35965:::inc_osdmap.7948:0# has 2 refs; skipping
2021-01-12 10:41:27.692 7fdd634b0700 20 bluestore.2QCache(0x557028822c40) _trim considering 0x557054ff8100
...
Starting Octopus onode caching has been significantly reworked and I presume the bug isn't present anymore.
Not sure we want to backport all the relevant changes back to Nautilus due to their complexity though.
Just in case I'd suggest to try the latest nautilus but highly likely this wouldn't fix the issue.
Another (quite questinable) workaround might be to tune bluestore_cache_trim_max_skip_pinned parameter to force cache trimming to skip first pinned entries and proceed. But IMO this isn't a very good solution since those pinned entries might live forever and their amount might grow high...
Hence IMO upgrading to Octopus is the preferred way of dealing with this bug.
Updated by Igor Fedotov about 3 years ago
- Status changed from Triaged to Pending Backport
Updated by Yuri Weinstein about 3 years ago
Updated by Igor Fedotov about 3 years ago
- Status changed from Pending Backport to Resolved
Updated by Igor Fedotov almost 2 years ago
- Has duplicate Bug #55761: ceph osd bluestore memory leak added