Project

General

Profile

Actions

Bug #19348

closed

"ceph ping mon.c" cli prints assertion failure on timeout

Added by Kefu Chai about 7 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
Low
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Development
Tags:
low-hanging-fruit
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient, ceph cli, librados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  1. start a cluster with 3 monitors: mon.a, mon.b and mon.c
  2. stop mon.c
  3. ceph ping mon.c --connect-timeout=5

it prints out following backtrace

timeout =  5
(-4, None, 'Interrupted!')
/var/ceph/ceph/src/msg/async/Event.cc: In function 'EventCenter::~EventCenter()' thread 7f2ccdffb700 time 2017-03-22 16:57:47.437992
/var/ceph/ceph/src/msg/async/Event.cc: 174: FAILED assert(time_events.empty())
 ceph version Development (no_version)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x137) [0x7f2cdd3516b8]
 2: (EventCenter::~EventCenter()+0xd2) [0x7f2cdd56ebc6]
 3: (Worker::~Worker()+0x7f) [0x7f2cdd57fb89]
 4: (PosixWorker::~PosixWorker()+0x2a) [0x7f2cdd58298c]
 5: (PosixWorker::~PosixWorker()+0x18) [0x7f2cdd5829a8]
 6: (NetworkStack::~NetworkStack()+0xa9) [0x7f2cdd57fc71]
 7: (PosixNetworkStack::~PosixNetworkStack()+0x4a) [0x7f2cdd582a00]
 8: (void __gnu_cxx::new_allocator<PosixNetworkStack>::destroy<PosixNetworkStack>(PosixNetworkStack*)+0x23) [0x7f2cdd57e701]
 9: (void std::allocator_traits<std::allocator<PosixNetworkStack> >::destroy<PosixNetworkStack>(std::allocator<PosixNetworkStack>&, PosixNetworkStack*)+0x23)
 [0x7f2cdd57e6cd]
 10: (std::_Sp_counted_ptr_inplace<PosixNetworkStack, std::allocator<PosixNetworkStack>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x37) [0x7f2cdd57e5b7]
 11: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x42) [0x7f2ce678b93c]
 12: (std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()+0x27) [0x7f2ce6785a69]
 13: (std::__shared_ptr<NetworkStack, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr()+0x1c) [0x7f2cdd56a750]
 14: (std::shared_ptr<NetworkStack>::~shared_ptr()+0x18) [0x7f2cdd56a76c]
 15: (StackSingleton::~StackSingleton()+0x34) [0x7f2cdd56a85c]
 16: (CephContext::TypedSingletonWrapper<StackSingleton>::~TypedSingletonWrapper()+0x34) [0x7f2cdd56de92]
 17: (CephContext::TypedSingletonWrapper<StackSingleton>::~TypedSingletonWrapper()+0x18) [0x7f2cdd56dec6]
 18: (CephContext::~CephContext()+0x8f) [0x7f2cdd65f697]
 19: (CephContext::put()+0x14a) [0x7f2cdd6600e8]
 20: (()+0x1d5b9e) [0x7f2ce67bfb9e]
 21: (()+0x1dcc87) [0x7f2ce67c6c87]
 22: (std::function<void (CephContext*)>::operator()(CephContext*) const+0x49) [0x7f2ce67d5245]
 23: (std::unique_ptr<CephContext, std::function<void (CephContext*)> >::~unique_ptr()+0x49) [0x7f2ce67d0f5b]
 24: (librados::RadosClient::~RadosClient()+0x140) [0x7f2ce67c25a8]
 25: (librados::RadosClient::~RadosClient()+0x18) [0x7f2ce67c25d0]
 26: (rados_shutdown()+0x129) [0x7f2ce6751338]
 27: (()+0x17f34) [0x7f2ce6b3ef34]
 28: (PyEval_EvalFrameEx()+0x7a06) [0x555c52385736]
 29: (PyEval_EvalCodeEx()+0x255) [0x555c5237c535]
 30: (PyEval_EvalFrameEx()+0x6968) [0x555c52384698]
 31: (PyEval_EvalFrameEx()+0x5eef) [0x555c52383c1f]
 32: (PyEval_EvalCodeEx()+0x255) [0x555c5237c535]
 33: (()+0x115cee) [0x555c52398cee]
 34: (PyObject_Call()+0x43) [0x555c5236a673]
 35: (()+0x12bfee) [0x555c523aefee]
 36: (PyObject_Call()+0x43) [0x555c5236a673]
 37: (PyEval_CallObjectWithKeywords()+0x30) [0x555c52388430]
 38: (()+0x1ce8b2) [0x555c524518b2]
 39: (()+0x7424) [0x7f2ce807d424]
 40: (clone()+0x5f) [0x7f2ce749b9bf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aborted

instead we should just rely on the "timeout" of called Rados method, if the timeout param is not supported by the involved Rados method, use "client_mount_timeout" setting instead before connecting the monitor, like

cluster_handle.conf_set("client_mount_timeout", str(timeout))

please note, we should allow SIGINT to terminate the waiting with the fix. see run_in_thread().

Actions #1

Updated by Kefu Chai about 7 years ago

  • Description updated (diff)
Actions #2

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category changed from ceph cli to Correctness/Safety
  • Component(RADOS) MonClient, ceph cli, librados added
Actions #3

Updated by Anonymous over 6 years ago

  • Assignee set to Anonymous
Actions #4

Updated by Joao Eduardo Luis about 6 years ago

  • Assignee deleted (Anonymous)
  • Tags changed from low-hanging to low-hanging-fruit
Actions #7

Updated by Kefu Chai over 5 years ago

  • Status changed from New to Can't reproduce

not able to reproduce with master HEAD anymore.

Actions

Also available in: Atom PDF