Bug #19348
Updated by Kefu Chai about 7 years ago
# start a cluster with 3 monitors: mon.a, mon.b and mon.c
# stop mon.c
# ceph ping mon.c --connect-timeout=5
it prints out following backtrace
<pre>
timeout = 5
(-4, None, 'Interrupted!')
/var/ceph/ceph/src/msg/async/Event.cc: In function 'EventCenter::~EventCenter()' thread 7f2ccdffb700 time 2017-03-22 16:57:47.437992
/var/ceph/ceph/src/msg/async/Event.cc: 174: FAILED assert(time_events.empty())
ceph version Development (no_version)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x137) [0x7f2cdd3516b8]
2: (EventCenter::~EventCenter()+0xd2) [0x7f2cdd56ebc6]
3: (Worker::~Worker()+0x7f) [0x7f2cdd57fb89]
4: (PosixWorker::~PosixWorker()+0x2a) [0x7f2cdd58298c]
5: (PosixWorker::~PosixWorker()+0x18) [0x7f2cdd5829a8]
6: (NetworkStack::~NetworkStack()+0xa9) [0x7f2cdd57fc71]
7: (PosixNetworkStack::~PosixNetworkStack()+0x4a) [0x7f2cdd582a00]
8: (void __gnu_cxx::new_allocator<PosixNetworkStack>::destroy<PosixNetworkStack>(PosixNetworkStack*)+0x23) [0x7f2cdd57e701]
9: (void std::allocator_traits<std::allocator<PosixNetworkStack> >::destroy<PosixNetworkStack>(std::allocator<PosixNetworkStack>&, PosixNetworkStack*)+0x23)
[0x7f2cdd57e6cd]
10: (std::_Sp_counted_ptr_inplace<PosixNetworkStack, std::allocator<PosixNetworkStack>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x37) [0x7f2cdd57e5b7]
11: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x42) [0x7f2ce678b93c]
12: (std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()+0x27) [0x7f2ce6785a69]
13: (std::__shared_ptr<NetworkStack, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr()+0x1c) [0x7f2cdd56a750]
14: (std::shared_ptr<NetworkStack>::~shared_ptr()+0x18) [0x7f2cdd56a76c]
15: (StackSingleton::~StackSingleton()+0x34) [0x7f2cdd56a85c]
16: (CephContext::TypedSingletonWrapper<StackSingleton>::~TypedSingletonWrapper()+0x34) [0x7f2cdd56de92]
17: (CephContext::TypedSingletonWrapper<StackSingleton>::~TypedSingletonWrapper()+0x18) [0x7f2cdd56dec6]
18: (CephContext::~CephContext()+0x8f) [0x7f2cdd65f697]
19: (CephContext::put()+0x14a) [0x7f2cdd6600e8]
20: (()+0x1d5b9e) [0x7f2ce67bfb9e]
21: (()+0x1dcc87) [0x7f2ce67c6c87]
22: (std::function<void (CephContext*)>::operator()(CephContext*) const+0x49) [0x7f2ce67d5245]
23: (std::unique_ptr<CephContext, std::function<void (CephContext*)> >::~unique_ptr()+0x49) [0x7f2ce67d0f5b]
24: (librados::RadosClient::~RadosClient()+0x140) [0x7f2ce67c25a8]
25: (librados::RadosClient::~RadosClient()+0x18) [0x7f2ce67c25d0]
26: (rados_shutdown()+0x129) [0x7f2ce6751338]
27: (()+0x17f34) [0x7f2ce6b3ef34]
28: (PyEval_EvalFrameEx()+0x7a06) [0x555c52385736]
29: (PyEval_EvalCodeEx()+0x255) [0x555c5237c535]
30: (PyEval_EvalFrameEx()+0x6968) [0x555c52384698]
31: (PyEval_EvalFrameEx()+0x5eef) [0x555c52383c1f]
32: (PyEval_EvalCodeEx()+0x255) [0x555c5237c535]
33: (()+0x115cee) [0x555c52398cee]
34: (PyObject_Call()+0x43) [0x555c5236a673]
35: (()+0x12bfee) [0x555c523aefee]
36: (PyObject_Call()+0x43) [0x555c5236a673]
37: (PyEval_CallObjectWithKeywords()+0x30) [0x555c52388430]
38: (()+0x1ce8b2) [0x555c524518b2]
39: (()+0x7424) [0x7f2ce807d424]
40: (clone()+0x5f) [0x7f2ce749b9bf]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aborted
</pre>
instead we should just rely on the "timeout" of called Rados method, if the timeout param is not supported by the involved Rados method, use "client_mount_timeout" setting instead before connecting the monitor, like
<pre><code class="python">
cluster_handle.conf_set("client_mount_timeout", str(timeout))
</code></pre>
please note, we should allow SIGINT to terminate the waiting with the fix. see @run_in_thread()@.