Project

General

Profile

Bug #21761

ceph-osd consumes way too much memory during recovery

Added by Wyllys Ingersoll 5 months ago. Updated 4 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
Start date:
10/11/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

The ceph-osd processes have been seen consuming upwards of 20GB of RAM per process when a system is recovering. Even on systems that are provisioned with 192GB RAM, 8 or 10 OSD processes can consume all of the RAM and make it impossible to bring up all of the OSDS which further complicates the recovery since the ones that dont have enough RAM to start get marked out which causes more rebalancing.

This is a critical issue for Jewel users (10.2.9) using filestore.

Example of a system with 128GB RAM, 11 OSDs (4TB drives) running a recovery operation (triggered by cephfs snapshot bug #21412):


$ top

Tasks: 460 total,   1 running, 459 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.4 us,  2.0 sy,  0.0 ni, 83.2 id, 10.3 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 13193300+total,  1150304 free, 12910152+used,  1681184 buff/cache
KiB Swap:   999420 total,     2368 free,   997052 used.   483084 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                            
   4771 ceph      20   0 18.831g 0.014t  30392 S  25.0 11.3 158:41.12 ceph-osd                                                                                           
  35087 ceph      20   0 21.398g 0.017t  20092 S  25.0 13.5 145:30.42 ceph-osd                                                                                           
   3929 ceph      20   0 22.269g 0.018t  27516 S  18.8 14.5 105:34.46 ceph-osd                                                                                           
  86341 ceph      20   0 23.413g 0.019t  33712 S  12.5 15.4 100:29.06 ceph-osd                                                                                           
 108322 ceph      20   0 4398468 2.628g 483264 D  12.5  2.1   0:22.57 ceph-osd                                                                                           
   4068 ceph      20   0 22.543g 0.018t  49780 S   6.2 14.4 161:15.71 ceph-osd                                                                                           
  11898 ceph      20   0 19.647g 0.014t  34524 S   6.2 11.6 144:21.33 ceph-osd  

  $ free -h
              total        used        free      shared  buff/cache   available
Mem:           125G        123G        1.1G         88M        1.3G        385M
Swap:          975M        973M        2.3M                                                                                      

Starting new OSDs becomes nearly impossible once the system memory becomes low. The ceph-osd process will start, but then soon crashes and restarts again endlessly, the crash log looks like this:

   -10> 2017-10-11 10:18:44.480698 7f6cf18658c0  5 osd.72 pg_epoch: 303792 pg[28.36b(unlocked)] enter Initial
    -9> 2017-10-11 10:18:44.480996 7f6cf18658c0  5 osd.72 pg_epoch: 303792 pg[28.36b( v 269495'33 (0'0,269495'33] local-les=303763 n=12 ec=269489 les/c/f 303608/300422/0 303758/303760/284901) [100,72,37] r=1 lpr=0 pi=269489-303759/139 crt=269495'33 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.000298 0 0.000000
    -8> 2017-10-11 10:18:44.481010 7f6cf18658c0  5 osd.72 pg_epoch: 303792 pg[28.36b( v 269495'33 (0'0,269495'33] local-les=303763 n=12 ec=269489 les/c/f 303608/300422/0 303758/303760/284901) [100,72,37] r=1 lpr=0 pi=269489-303759/139 crt=269495'33 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
    -7> 2017-10-11 10:18:44.645087 7f6cf18658c0  5 osd.72 pg_epoch: 303768 pg[1.4f4(unlocked)] enter Initial
    -6> 2017-10-11 10:18:46.762577 7f6cf18658c0  5 osd.72 pg_epoch: 303768 pg[1.4f4( v 279387'245575 (202648'235117,279387'245575] local-les=303763 n=8170 ec=23117 les/c/f 303101/280936/0 303758/303760/302603) [67,72,100] r=1 lpr=0 pi=278053-303759/75 crt=279387'245575 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 2.117488 0 0.000000
    -5> 2017-10-11 10:18:46.762616 7f6cf18658c0  5 osd.72 pg_epoch: 303768 pg[1.4f4( v 279387'245575 (202648'235117,279387'245575] local-les=303763 n=8170 ec=23117 les/c/f 303101/280936/0 303758/303760/302603) [67,72,100] r=1 lpr=0 pi=278053-303759/75 crt=279387'245575 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
    -4> 2017-10-11 10:18:46.816171 7f6cf18658c0  5 osd.72 pg_epoch: 303791 pg[9.644(unlocked)] enter Initial
    -3> 2017-10-11 10:18:46.825798 7f6cf18658c0  5 osd.72 pg_epoch: 303791 pg[9.644( v 269495'38555 (35683'35552,269495'38555] local-les=303763 n=27 ec=1076 les/c/f 300534/300248/0 303758/303759/295662) [62,68,72] r=2 lpr=0 pi=203019-303758/217 crt=269495'38555 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.009627 0 0.000000
    -2> 2017-10-11 10:18:46.825815 7f6cf18658c0  5 osd.72 pg_epoch: 303791 pg[9.644( v 269495'38555 (35683'35552,269495'38555] local-les=303763 n=27 ec=1076 les/c/f 300534/300248/0 303758/303759/295662) [62,68,72] r=2 lpr=0 pi=203019-303758/217 crt=269495'38555 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
    -1> 2017-10-11 10:18:46.974700 7f6cf18658c0  5 osd.72 pg_epoch: 303798 pg[1.5a4(unlocked)] enter Initial
     0> 2017-10-11 10:18:47.517124 7f6cf18658c0 -1 *** Caught signal (Aborted) **
 in thread 7f6cf18658c0 thread_name:ceph-osd

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (()+0x984c4e) [0x56400a2d3c4e]
 2: (()+0x11390) [0x7f6cf0723390]
 3: (gsignal()+0x38) [0x7f6cee6c1428]
 4: (abort()+0x16a) [0x7f6cee6c302a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f6cef00384d]
 6: (()+0x8d6b6) [0x7f6cef0016b6]
 7: (()+0x8d701) [0x7f6cef001701]
 8: (()+0x8d919) [0x7f6cef001919]
 9: (()+0x1230f) [0x7f6cf13fb30f]
 10: (operator new[](unsigned long)+0x4e7) [0x7f6cf141f4b7]
 11: (leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x313) [0x7f6cf0dace63]
 12: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x276) [0x7f6cf0db1426]
 13: (()+0x421be) [0x7f6cf0db51be]
 14: (()+0x42240) [0x7f6cf0db5240]
 15: (()+0x4261e) [0x7f6cf0db561e]
 16: (()+0x3d835) [0x7f6cf0db0835]
 17: (()+0x1fffb) [0x7f6cf0d92ffb]
 18: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::next()+0x8f) [0x56400a18f40f]
 19: (DBObjectMap::DBObjectMapIteratorImpl::next(bool)+0x34) [0x56400a142954]
 20: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, DoutPrefixProvider const*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0xafb) [0x564009f7e97b]
 21: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x313) [0x564009dacfb3]
 22: (OSD::load_pgs()+0x87a) [0x564009ce796a]
 23: (OSD::init()+0x2026) [0x564009cf2c56]
 24: (main()+0x2ef1) [0x564009c64391]
 25: (__libc_start_main()+0xf0) [0x7f6cee6ac830]
 26: (_start()+0x29) [0x564009ca5b99]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues

Related to RADOS - Bug #21331: pg recovery priority inversion Resolved 09/09/2017

History

#1 Updated by Wyllys Ingersoll 4 months ago

Here are some heap stats from one of the OSDs that is currently consuming about 21GB of RAM according to "top":

osd.31 tcmalloc heap stats:------------------------------------------------
MALLOC:    17991446352 (17158.0 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +    305843008 (  291.7 MiB) Bytes in central cache freelist
MALLOC: +        81920 (    0.1 MiB) Bytes in transfer cache freelist
MALLOC: +    105874800 (  101.0 MiB) Bytes in thread cache freelists
MALLOC: +     66924704 (   63.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  18470170784 (17614.5 MiB) Actual memory used (physical + swap)
MALLOC: +   4306345984 ( 4106.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  22776516768 (21721.4 MiB) Virtual address space used
MALLOC:
MALLOC:         682446              Spans in use
MALLOC:            791              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

#2 Updated by Sage Weil 4 months ago

  • Status changed from New to Need More Info

We fixed a problem that could lead to a similar scenario recently. When this happens, do you have lots of PGs in the recovery_wait state? If so, can you look at
'ceph pg dump | grep recovery_wait' and see if the LOG column is big (> 10000)? If so, this is a priority inversion problem. We just fixed it in luminous (12.2.2, not out yet). A workaround is to set osd_max_backfills to a larger value to get recovery started on the OSDs that have recovery_wait PGs. Once those PGs recover and it's just backfill the memory usage will drop.

#3 Updated by Sage Weil 4 months ago

  • Related to Bug #21331: pg recovery priority inversion added

#4 Updated by Wyllys Ingersoll 4 months ago

The problem has since sorted itself out (though it took over a week to get back to HEALTH_OK). I believe the recovery_wait number was in the hundreds or maybe 1000-2000 range the peak.

We ended up tuning a bunch of things down to make the memory issue manageable. This was all triggered by removing a very old snapshot from cephfs that had many TB of changes added to the system after it was originally taken. These are the parameters we set, not sure if they were all effective or not, its sometimes hard to see the impact, but this is where we ended up getting healthy. Clearly our max_backfills value was too low, perhaps it would have gone faster with a larger value.

osd map cache size = 200
osd map max advance = 100
osd map share max epochs = 100
osd pg epoch persisted max stale = 100

osd_recovery_max_active = 4
osd_recovery_threads = 1
osd_recovery_max_single_start = 1
osd_recovery_op_priority = 2
osd_op_threads = 8

osd_max_backfills = 2
osd_backfill_scan_max = 16
osd_backfill_scan_min = 4

osd_snap_trim_sleep = 0.1
osd_pg_max_concurrent_snap_trims = 1
osd_max_trimming_pgs = 1

osd_max_pg_log_entries = 1000
osd_min_pg_log_entries = 256
osd_pg_log_trim_min = 200

Also available in: Atom PDF