Project

General

Profile

Bug #41157

Updated by Patrick Donnelly over 3 years ago

Smoking gun job: /ceph/teuthology-archive/pdonnell-2019-08-08_18:11:18-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4199128

<pre>
root 10635 0.0 0.0 243252 4636 ? Ss 18:46 0:00 sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper term ceph-mgr -f --cluster 2019-08-07T20:57:38.971 INFO:tasks.ceph.osd.3.smithi198.stderr: ceph -i x version 15.0.0-3571-gce59832 (ce598323c9764ebbeac2e10927c0f38008688555) octopus (dev)
root 10663 0.0 0.0 151632 6184 ? S 18:46 0:00 /usr/bin/python /bin/daemon-helper term ceph-mgr -f --cluster ceph -i x 2019-08-07T20:57:38.971 INFO:tasks.ceph.osd.3.smithi198.stderr: 1: (()+0xf5d0) [0x7f37826f55d0]
root 10665 142 33.9 13638124 11097244 ? Ssl 18:46 93:16 ceph-mgr -f --cluster ceph -i x 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 2: (pthread_kill()+0x31) [0x7f37826f29d1]
</pre>

Using 150% CPU and 10.7GB of RAM (always increasing). Eventually the job fails as in:

/ceph/teuthology-archive/pdonnell-2019-08-07_15:57:31-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4193689/teuthology.log

Because the system RAM is exhausted.

ceph-mgr log is spewing out non-stop, which is probably related to the cause:

<pre>
2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x244) [0x55f5b570bc64]
2019-08-08T19:53:14.436+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in 2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 4: (ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x20d) [0x55f5b570c4fd]
2019-08-08T19:53:14.436+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 5: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x868) [0x55f5b56038a8]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.972 INFO:tasks.ceph.osd.3.smithi198.stderr: 6: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x80) [0x55f5b51e5300]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 7: (OSD::dispatch_context_transaction(PeeringCtx&, PG*, ThreadPool::TPHandle*)+0x5e) [0x55f5b513746e]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x276) [0x55f5b518ea26]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55f5b53984b1]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1508) [0x55f5b51837b8]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55f5b572b766]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks:

2019-08-08T19:53:14.437+0000 7fce857fa700 20 mgr[telemetry] Not sending report until user re-opts-in
2019-08-07T20:57:38.973 INFO:tasks.ceph.osd.3.smithi198.stderr: 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f5b572d8c0]
2019-08-08T19:53:14.437+0000 7fce857fa700 10 module telemetry health checks: 2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: 13: (()+0x7dd5) [0x7f37826eddd5]
2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: 14: (clone()+0x6d) [0x7f37815b402d]
2019-08-07T20:57:38.974 INFO:tasks.ceph.osd.3.smithi198.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre>

From: /ceph/teuthology-archive/pdonnell-2019-08-07_15:57:31-fs-wip-pdonnell-testing-20190807.132723-distro-basic-smithi/4193689/teuthology.log

The ceph-mgr daemons and one ceph-mon also triggered the OOM killer which is probably related.

Back