Bug #41157
Updated by Patrick Donnelly almost 5 years ago
Smoking gun job: /ceph/teuthology-archive/pdonnell-2019-08-08_18:11:18-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4199128 <pre> root 10635 0.0 0.0 243252 4636 ? Ss 18:46 0:00 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper term ceph-mgr -f --cluster 2019-08-07T20:57:38.971 INFO:tasks.ceph.osd.3.smithi198.stderr: ceph -i x version 15.0.0-3571-gce59832 (ce598323c9764ebbeac2e10927c0f38008688555) octopus (dev) root 10663 0.0 0.0 151632 6184 ? S 18:46 0:00 /usr/bin/python /bin/daemon-helper term ceph-mgr -f --cluster ceph -i x 2019-08-07T20:57:38.971 INFO:tasks.ceph.osd.3.smithi198.stderr: 1: (()+0xf5d0) [0x7f37826f55d0] root 10665 142 33.9 13638124 11097244 ? Ssl 18:46 93:16 ceph-mgr -f --cluster ceph -i x 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 2: (pthread_kill()+0x31) [0x7f37826f29d1] </pre> Using 150% CPU and 10.7GB of RAM (always increasing). Eventually the job fails as in: /ceph/teuthology-archive/pdonnell-2019-08-07_15:57:31-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4193689/teuthology.log Because the system RAM is exhausted. ceph-mgr log is spewing out non-stop, which is probably related to the cause: <pre> 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x244) [0x55f5b570bc64] 2019-08-08T19:53:14.436+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 4: (ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x20d) [0x55f5b570c4fd] 2019-08-08T19:53:14.436+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 5: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x868) [0x55f5b56038a8] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 6: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x80) [0x55f5b51e5300] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 7: (OSD::dispatch_context_transaction(PeeringCtx&, PG*, ThreadPool::TPHandle*)+0x5e) [0x55f5b513746e] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x276) [0x55f5b518ea26] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55f5b53984b1] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1508) [0x55f5b51837b8] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55f5b572b766] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f5b572d8c0] 2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: 13: (()+0x7dd5) [0x7f37826eddd5] 2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: 14: (clone()+0x6d) [0x7f37815b402d] 2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. </pre> From: /ceph/teuthology-archive/pdonnell-2019-08-07_15:57:31-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4193689/teuthology.log The ceph-mgr daemons and one ceph-mon also triggered the OOM killer which is probably related.