Project

General

Profile

Actions

Bug #61969

open

Ceph-mgr Hangup

Added by weifeng liu 10 months ago. Updated 10 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a bug like Bug39264,mgr hang up every few hours.the pstack output shows there is a deadlock.
My ceph version info is:
[Wed Jul 12 15:20:18 root@node16745 site-packages]# ceph -v
ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)


Files

Actions #1

Updated by weifeng liu 10 months ago

there is a deadlock

Thread 4 (Thread 0x7f268a546700 (LWP 3705)):
#0 0x00007f268faf465d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1 0x00007f268faed979 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2 0x00007f26925099a1 in Mutex::lock(bool) () from target:/usr/lib64/ceph/libceph-common.so.0
#3 0x0000564d3a3b13d6 in DaemonServer::ms_handle_authentication(Connection*) ()
#4 0x00007f2692819a24 in MonClient::handle_auth_request(Connection*, AuthConnectionMeta*, bool, unsigned int, ceph::buffer::v14_2_0::list const&, ceph::buffer::v14_2_0::list*) () from target:/usr/lib64/ceph/libceph-common.so.0

Thread 22 (Thread 0x7f26602b5700 (LWP 2102479)):
#0 0x00007f268faf03af in pthread_rwlock_wrlock () from target:/lib64/libpthread.so.0
#1 0x0000564d3a3d048b in RWLock::get_write(bool) ()
#2 0x0000564d3a3b2684 in DaemonServer::handle_open(MMgrOpen*) ()
#3 0x0000564d3a3cd495 in DaemonServer::ms_dispatch(Message*) ()
#4 0x0000564d3a3e1f1a in Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&) ()
#5 0x00007f26926983da in DispatchQueue::entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#6 0x00007f269274dab1 in DispatchQueue::DispatchThread::entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#7 0x00007f268faeb14a in start_thread () from target:/lib64/libpthread.so.0
#8 0x00007f268e603dc3 in clone () from target:/lib64/libc.so.6

Thread 26 (Thread 0x7f2665ac0700 (LWP 2102483)):
#0 0x00007f268faf164a in pthread_cond_timedwait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1 0x00007f2691d84fb9 in take_gil () from target:/lib64/libpython3.6m.so.1.0
#2 0x00007f2691d8527d in PyEval_RestoreThread () from target:/lib64/libpython3.6m.so.1.0
...
#19 0x00007f2691e28013 in PyObject_CallMethod () from target:/lib64/libpython3.6m.so.1.0
#20 0x0000564d3a41a196 in PyModuleRunner::serve() ()
#21 0x0000564d3a41aa15 in PyModuleRunner::PyModuleRunnerThread::entry() ()

Thread 33 (Thread 0x7f265d2af700 (LWP 2102496)):
#0 0x00007f268faf164a in pthread_cond_timedwait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1 0x00007f2691d84fb9 in take_gil () from target:/lib64/libpython3.6m.so.1.0
#2 0x00007f2691d8527d in PyEval_RestoreThread () from target:/lib64/libpython3.6m.so.1.0
#3 0x0000564d3a3653c0 in ActivePyModules::dump_server(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<DaemonKey, std::shared_ptr<DaemonState>, std::less<DaemonKey>, std::allocator<std::pair<DaemonKey const, std::shared_ptr<DaemonState> > > > const&, ceph::Formatter*) ()
#4 0x0000564d3a366126 in ActivePyModules::list_servers_python() ()
#5 0x0000564d3a37a078 in ceph_get_server(BaseMgrModule*, _object*) ()

Thread 55 (Thread 0x7f265329b700 (LWP 2102520)):
#0 0x00007f268faefd79 in pthread_rwlock_rdlock () from target:/lib64/libpthread.so.0
#1 0x0000564d3a373fdf in RWLock::get_read() const ()
#2 0x0000564d3a3e2c39 in DaemonStateIndex::get(DaemonKey const&) ()
#3 0x0000564d3a36629d in ActivePyModules::get_metadata_python(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#4 0x0000564d3a37a120 in get_metadata(BaseMgrModule*, _object*) ()
#5 0x00007f2691e29287 in call_function () from target:/lib64/libpython3.6m.so.1.0
#6 0x00007f2691e2a168 in _PyEval_EvalFrameDefault () from target:/lib64/libpython3.6m.so.1.0
#7 0x00007f2691d85b54 in _PyEval_EvalCodeWithName () from target:/lib64/libpython3.6m.so.1.0

Actions #2

Updated by weifeng liu 10 months ago

mgrlog:

2023-07-04 16:24:42.320 7fc52a62f700 0 log_channel(audit) log [DBG] : from='client.23925328 -' entity='client.admin' cmd=[{"prefix": "osd pool stats", "target": ["mgr", ""], "format": "json"}]: dispatch
2023-07-04 16:24:43.208 7fc52962d700 0 log_channel(cluster) log [DBG] : pgmap v45055: 3593 pgs: 1 active+clean+scrubbing+deep, 1 active+clean+scrubbing, 9 active+undersized+degraded+remapped+backfilling, 210 active+remapped+backfill_wait, 248 active+undersized+degraded+remapped+backfill_wait, 3124 active+clean; 16 TiB data, 49 TiB used, 124 TiB / 172 TiB avail; 1.3 MiB/s rd, 1.29k op/s; 1751060/66238278 objects degraded (2.644%); 2187552/66238278 objects misplaced (3.303%); 61 MiB/s, 75 objects/s recovering
2023-07-04 16:24:44.996 7fc52a62f700 0 log_channel(audit) log [DBG] : from='client.23900677 -' entity='client.admin' cmd=[{"format":"json","prefix":"rbd perf image stats"}]: dispatch
2023-07-04 16:24:45.220 7fc52962d700 0 log_channel(cluster) log [DBG] : pgmap v45056: 3593 pgs: 1 active+clean+scrubbing+deep, 1 active+clean+scrubbing, 9 active+undersized+degraded+remapped+backfilling, 210 active+remapped+backfill_wait, 248 active+undersized+degraded+remapped+backfill_wait, 3124 active+clean; 16 TiB data, 49 TiB used, 124 TiB / 172 TiB avail; 1.3 MiB/s rd, 1.29k op/s; 1750675/66238278 objects degraded (2.643%); 2187552/66238278 objects misplaced (3.303%); 86 MiB/s, 107 objects/s recovering
2023-07-04 17:39:28.653 7fc54f8b6700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-04 16:39:28.658056)
2023-07-04 17:39:38.657 7fc54f8b6700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-04 16:39:38.658235)
2023-07-04 17:39:48.657 7fc54f8b6700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-04 16:39:48.658423)
2023-07-04 17:39:58.657 7fc54f8b6700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-04 16:39:58.658595)

Actions #3

Updated by Radoslaw Zarzynski 10 months ago

  • Status changed from New to Need More Info

Nautilus is EOL. Does this happen on Pacific / Quincy / Reef maybe?

Actions

Also available in: Atom PDF