Bug #21761
closedceph-osd consumes way too much memory during recovery
0%
Description
The ceph-osd processes have been seen consuming upwards of 20GB of RAM per process when a system is recovering. Even on systems that are provisioned with 192GB RAM, 8 or 10 OSD processes can consume all of the RAM and make it impossible to bring up all of the OSDS which further complicates the recovery since the ones that dont have enough RAM to start get marked out which causes more rebalancing.
This is a critical issue for Jewel users (10.2.9) using filestore.
Example of a system with 128GB RAM, 11 OSDs (4TB drives) running a recovery operation (triggered by cephfs snapshot bug #21412):
$ top Tasks: 460 total, 1 running, 459 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.4 us, 2.0 sy, 0.0 ni, 83.2 id, 10.3 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 13193300+total, 1150304 free, 12910152+used, 1681184 buff/cache KiB Swap: 999420 total, 2368 free, 997052 used. 483084 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4771 ceph 20 0 18.831g 0.014t 30392 S 25.0 11.3 158:41.12 ceph-osd 35087 ceph 20 0 21.398g 0.017t 20092 S 25.0 13.5 145:30.42 ceph-osd 3929 ceph 20 0 22.269g 0.018t 27516 S 18.8 14.5 105:34.46 ceph-osd 86341 ceph 20 0 23.413g 0.019t 33712 S 12.5 15.4 100:29.06 ceph-osd 108322 ceph 20 0 4398468 2.628g 483264 D 12.5 2.1 0:22.57 ceph-osd 4068 ceph 20 0 22.543g 0.018t 49780 S 6.2 14.4 161:15.71 ceph-osd 11898 ceph 20 0 19.647g 0.014t 34524 S 6.2 11.6 144:21.33 ceph-osd $ free -h total used free shared buff/cache available Mem: 125G 123G 1.1G 88M 1.3G 385M Swap: 975M 973M 2.3M
Starting new OSDs becomes nearly impossible once the system memory becomes low. The ceph-osd process will start, but then soon crashes and restarts again endlessly, the crash log looks like this:
-10> 2017-10-11 10:18:44.480698 7f6cf18658c0 5 osd.72 pg_epoch: 303792 pg[28.36b(unlocked)] enter Initial -9> 2017-10-11 10:18:44.480996 7f6cf18658c0 5 osd.72 pg_epoch: 303792 pg[28.36b( v 269495'33 (0'0,269495'33] local-les=303763 n=12 ec=269489 les/c/f 303608/300422/0 303758/303760/284901) [100,72,37] r=1 lpr=0 pi=269489-303759/139 crt=269495'33 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.000298 0 0.000000 -8> 2017-10-11 10:18:44.481010 7f6cf18658c0 5 osd.72 pg_epoch: 303792 pg[28.36b( v 269495'33 (0'0,269495'33] local-les=303763 n=12 ec=269489 les/c/f 303608/300422/0 303758/303760/284901) [100,72,37] r=1 lpr=0 pi=269489-303759/139 crt=269495'33 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset -7> 2017-10-11 10:18:44.645087 7f6cf18658c0 5 osd.72 pg_epoch: 303768 pg[1.4f4(unlocked)] enter Initial -6> 2017-10-11 10:18:46.762577 7f6cf18658c0 5 osd.72 pg_epoch: 303768 pg[1.4f4( v 279387'245575 (202648'235117,279387'245575] local-les=303763 n=8170 ec=23117 les/c/f 303101/280936/0 303758/303760/302603) [67,72,100] r=1 lpr=0 pi=278053-303759/75 crt=279387'245575 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 2.117488 0 0.000000 -5> 2017-10-11 10:18:46.762616 7f6cf18658c0 5 osd.72 pg_epoch: 303768 pg[1.4f4( v 279387'245575 (202648'235117,279387'245575] local-les=303763 n=8170 ec=23117 les/c/f 303101/280936/0 303758/303760/302603) [67,72,100] r=1 lpr=0 pi=278053-303759/75 crt=279387'245575 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset -4> 2017-10-11 10:18:46.816171 7f6cf18658c0 5 osd.72 pg_epoch: 303791 pg[9.644(unlocked)] enter Initial -3> 2017-10-11 10:18:46.825798 7f6cf18658c0 5 osd.72 pg_epoch: 303791 pg[9.644( v 269495'38555 (35683'35552,269495'38555] local-les=303763 n=27 ec=1076 les/c/f 300534/300248/0 303758/303759/295662) [62,68,72] r=2 lpr=0 pi=203019-303758/217 crt=269495'38555 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.009627 0 0.000000 -2> 2017-10-11 10:18:46.825815 7f6cf18658c0 5 osd.72 pg_epoch: 303791 pg[9.644( v 269495'38555 (35683'35552,269495'38555] local-les=303763 n=27 ec=1076 les/c/f 300534/300248/0 303758/303759/295662) [62,68,72] r=2 lpr=0 pi=203019-303758/217 crt=269495'38555 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset -1> 2017-10-11 10:18:46.974700 7f6cf18658c0 5 osd.72 pg_epoch: 303798 pg[1.5a4(unlocked)] enter Initial 0> 2017-10-11 10:18:47.517124 7f6cf18658c0 -1 *** Caught signal (Aborted) ** in thread 7f6cf18658c0 thread_name:ceph-osd ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x984c4e) [0x56400a2d3c4e] 2: (()+0x11390) [0x7f6cf0723390] 3: (gsignal()+0x38) [0x7f6cee6c1428] 4: (abort()+0x16a) [0x7f6cee6c302a] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6cef00384d] 6: (()+0x8d6b6) [0x7f6cef0016b6] 7: (()+0x8d701) [0x7f6cef001701] 8: (()+0x8d919) [0x7f6cef001919] 9: (()+0x1230f) [0x7f6cf13fb30f] 10: (operator new[](unsigned long)+0x4e7) [0x7f6cf141f4b7] 11: (leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x313) [0x7f6cf0dace63] 12: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x276) [0x7f6cf0db1426] 13: (()+0x421be) [0x7f6cf0db51be] 14: (()+0x42240) [0x7f6cf0db5240] 15: (()+0x4261e) [0x7f6cf0db561e] 16: (()+0x3d835) [0x7f6cf0db0835] 17: (()+0x1fffb) [0x7f6cf0d92ffb] 18: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::next()+0x8f) [0x56400a18f40f] 19: (DBObjectMap::DBObjectMapIteratorImpl::next(bool)+0x34) [0x56400a142954] 20: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, DoutPrefixProvider const*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0xafb) [0x564009f7e97b] 21: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x313) [0x564009dacfb3] 22: (OSD::load_pgs()+0x87a) [0x564009ce796a] 23: (OSD::init()+0x2026) [0x564009cf2c56] 24: (main()+0x2ef1) [0x564009c64391] 25: (__libc_start_main()+0xf0) [0x7f6cee6ac830] 26: (_start()+0x29) [0x564009ca5b99] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Wyllys Ingersoll over 6 years ago
Here are some heap stats from one of the OSDs that is currently consuming about 21GB of RAM according to "top":
osd.31 tcmalloc heap stats:------------------------------------------------ MALLOC: 17991446352 (17158.0 MiB) Bytes in use by application MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist MALLOC: + 305843008 ( 291.7 MiB) Bytes in central cache freelist MALLOC: + 81920 ( 0.1 MiB) Bytes in transfer cache freelist MALLOC: + 105874800 ( 101.0 MiB) Bytes in thread cache freelists MALLOC: + 66924704 ( 63.8 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 18470170784 (17614.5 MiB) Actual memory used (physical + swap) MALLOC: + 4306345984 ( 4106.9 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 22776516768 (21721.4 MiB) Virtual address space used MALLOC: MALLOC: 682446 Spans in use MALLOC: 791 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory.
Updated by Sage Weil over 6 years ago
- Status changed from New to Need More Info
We fixed a problem that could lead to a similar scenario recently. When this happens, do you have lots of PGs in the recovery_wait state? If so, can you look at
'ceph pg dump | grep recovery_wait' and see if the LOG column is big (> 10000)? If so, this is a priority inversion problem. We just fixed it in luminous (12.2.2, not out yet). A workaround is to set osd_max_backfills to a larger value to get recovery started on the OSDs that have recovery_wait PGs. Once those PGs recover and it's just backfill the memory usage will drop.
Updated by Sage Weil over 6 years ago
- Related to Bug #21331: pg recovery priority inversion added
Updated by Wyllys Ingersoll over 6 years ago
The problem has since sorted itself out (though it took over a week to get back to HEALTH_OK). I believe the recovery_wait number was in the hundreds or maybe 1000-2000 range the peak.
We ended up tuning a bunch of things down to make the memory issue manageable. This was all triggered by removing a very old snapshot from cephfs that had many TB of changes added to the system after it was originally taken. These are the parameters we set, not sure if they were all effective or not, its sometimes hard to see the impact, but this is where we ended up getting healthy. Clearly our max_backfills value was too low, perhaps it would have gone faster with a larger value.
osd map cache size = 200
osd map max advance = 100
osd map share max epochs = 100
osd pg epoch persisted max stale = 100
osd_recovery_max_active = 4
osd_recovery_threads = 1
osd_recovery_max_single_start = 1
osd_recovery_op_priority = 2
osd_op_threads = 8
osd_max_backfills = 2
osd_backfill_scan_max = 16
osd_backfill_scan_min = 4
osd_snap_trim_sleep = 0.1
osd_pg_max_concurrent_snap_trims = 1
osd_max_trimming_pgs = 1
osd_max_pg_log_entries = 1000
osd_min_pg_log_entries = 256
osd_pg_log_trim_min = 200
Updated by Sage Weil over 5 years ago
- Status changed from Need More Info to Can't reproduce