Project

General

Profile

Actions

Bug #46368

open

Ceph Manager Crashing in telegraf thread.

Added by Bastian Mäuser almost 4 years ago. Updated almost 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently on one Node the mgr crashed. The crash seems to be related to the Telegraf Module:

-9> 2020-07-05 20:37:29.431 7f48baf54700 10 monclient: _send_mon_message to mon.px3 at v2:10.100.1.3:3300/0
-8> 2020-07-05 20:37:29.555 7f48bbf56700 4 mgr.server handle_report from 0x56403cf09680 osd,15
-7> 2020-07-05 20:37:29.907 7f48bbf56700 4 mgr.server handle_report from 0x56403d2f9a80 osd,33
-6> 2020-07-05 20:37:29.907 7f48bbf56700 4 mgr.server handle_report from 0x56403832ed00 osd,32
-5> 2020-07-05 20:37:29.911 7f48bbf56700 4 mgr.server handle_report from 0x56403cc69180 osd,35
-4> 2020-07-05 20:37:29.919 7f48bbf56700 4 mgr.server handle_report from 0x56403cae2880 osd,8
-3> 2020-07-05 20:37:29.923 7f48bbf56700 4 mgr.server handle_report from 0x56403b748d00 osd,7
-2> 2020-07-05 20:37:29.963 7f48c5f38700 4 mgr ms_dispatch active mgrdigest v1
-1> 2020-07-05 20:37:29.963 7f48c5f38700 4 mgr ms_dispatch mgrdigest v1
0> 2020-07-05 20:37:29.963 7f48af7fe700 -1 ** Caught signal (Segmentation fault) *
in thread 7f48af7fe700 thread_name:telegraf
ceph version 14.2.9 (bed944f8c45b9c98485e99b70e11bbcec6f6659a) nautilus (stable)
1: (()+0x12730) [0x7f48cc693730]
2: (ceph::buffer::v14_2_0::ptr_node::cloner::operator()(ceph::buffer::v14_2_0::ptr_node const&)+0x9) [0x7f48cd9a8b19]
3: (()+0x15f0a3) [0x56401d2200a3]
4: (ActivePyModules::get_python(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1c74) [0x56401d224944]
5: (()+0x172e5b) [0x56401d233e5b]
6: (PyEval_EvalFrameEx()+0x7fd0) [0x7f48cd165f80]
7: (PyEval_EvalFrameEx()+0x7590) [0x7f48cd165540]
8: (()+0x1a2476) [0x7f48cd1f0476]
9: (PyIter_Next()+0xb) [0x7f48cd20e7db]
10: (()+0x88168) [0x7f48cd0d6168]
11: (PyEval_EvalFrameEx()+0x4815) [0x7f48cd1627c5]
12: (PyEval_EvalFrameEx()+0x7590) [0x7f48cd165540]
13: (PyEval_EvalCodeEx()+0x732) [0x7f48cd15d852]
14: (()+0x19647c) [0x7f48cd1e447c]
15: (PyObject_Call()+0x53) [0x7f48cd20edd3]
16: (()+0x1ad6ec) [0x7f48cd1fb6ec]
17: (PyObject_Call()+0x53) [0x7f48cd20edd3]
18: (()+0x1c1311) [0x7f48cd20f311]
19: (PyObject_CallMethod()+0x9d) [0x7f48cd20f58d]
20: (PyModuleRunner::serve()+0x62) [0x56401d2ca402]
21: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1cc) [0x56401d2cabcc]
22: (()+0x7fa3) [0x7f48cc688fa3]
23: (clone()+0x3f) [0x7f48cc01a4cf]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mgr.px2.log
--
end dump of recent events ---

The Manager restarted itsself again, still there might be a hicup somewhere.

Actions #1

Updated by Neha Ojha almost 4 years ago

  • Status changed from New to Need More Info

This looks very similar to https://tracker.ceph.com/issues/24995, which has been fixed in 14.2.10. Can you check if this fixes the problem for you?

Actions

Also available in: Atom PDF