Bug #41157: mgr: memory leak causing allocation failures - mgr - Ceph

Bug #41157

Updated by Patrick Donnelly almost 5 years ago

Smoking gun job: /ceph/teuthology-archive/pdonnell-2019-08-08_18:11:18-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4199128 

 <pre> 
 root         10635    0.0    0.0 243252    4636 ?          Ss     18:46     0:00 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper term ceph-mgr -f --cluster 2019-08-07T20:57:38.971 INFO:tasks.ceph.osd.3.smithi198.stderr: ceph -i x version 15.0.0-3571-gce59832 (ce598323c9764ebbeac2e10927c0f38008688555) octopus (dev) 
 root         10663    0.0    0.0 151632    6184 ?          S      18:46     0:00 /usr/bin/python /bin/daemon-helper term ceph-mgr -f --cluster ceph -i x 2019-08-07T20:57:38.971 INFO:tasks.ceph.osd.3.smithi198.stderr: 1: (()+0xf5d0) [0x7f37826f55d0] 
 root         10665    142 33.9 13638124 11097244 ?     Ssl    18:46    93:16 ceph-mgr -f --cluster ceph -i x 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 2: (pthread_kill()+0x31) [0x7f37826f29d1] 
 </pre> 

 Using 150% CPU and 10.7GB of RAM (always increasing). Eventually the job fails as in: 

 /ceph/teuthology-archive/pdonnell-2019-08-07_15:57:31-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4193689/teuthology.log 

 Because the system RAM is exhausted. 

 ceph-mgr log is spewing out non-stop, which is probably related to the cause: 

 <pre> 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x244) [0x55f5b570bc64] 
 2019-08-08T19:53:14.436+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 4: (ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x20d) [0x55f5b570c4fd] 
 2019-08-08T19:53:14.436+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 5: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x868) [0x55f5b56038a8] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 6: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x80) [0x55f5b51e5300] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 7: (OSD::dispatch_context_transaction(PeeringCtx&, PG*, ThreadPool::TPHandle*)+0x5e) [0x55f5b513746e] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x276) [0x55f5b518ea26] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55f5b53984b1] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1508) [0x55f5b51837b8] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55f5b572b766] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 

 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f5b572d8c0] 
 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: 13: (()+0x7dd5) [0x7f37826eddd5] 
 2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: 14: (clone()+0x6d) [0x7f37815b402d] 
 2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 
 </pre> 

 From: /ceph/teuthology-archive/pdonnell-2019-08-07_15:57:31-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4193689/teuthology.log 

 The ceph-mgr daemons and one ceph-mon also triggered the OOM killer which is probably related.

Back

Project

General

Profile

Ceph » mgr

Bug #41157