Project

General

Profile

Actions

Bug #61869

closed

pybind/cephfs: holds GIL during rmdir

Added by Patrick Donnelly 10 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Immediate
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Component(FS):
cephfs.pyx
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://github.com/ceph/ceph/blob/c42efbf5874de8454e4c7cb3c22bd41bcc0e71f5/src/pybind/cephfs/cephfs.pyx#L1356

!! Holding the GIL prevents any other work in Python from proceeding until the MDS responds to the RPC. This particularly affects the ceph-mgr daemons which, of course, has many other things it needs to do in Python.

An example deadlock:

Thread 12 (Thread 0x7feefb185700 (LWP 822162)):
#0  0x00007fef0339081d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007fef03389ac9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x00005610fa9bd6cf in ClusterState::set_service_map(ServiceMap const&) ()
#3  0x00005610faa271ea in Mgr::handle_service_map(boost::intrusive_ptr<MServiceMap>) ()
#4  0x00005610faa29ec4 in Mgr::ms_dispatch2(boost::intrusive_ptr<Message> const&) ()
#5  0x00005610faa33e95 in MgrStandby::ms_dispatch2(boost::intrusive_ptr<Message> const&) ()
#6  0x00007fef047a92aa in DispatchQueue::entry() () from target:/usr/lib64/ceph/libceph-common.so.2
#7  0x00007fef0485af91 in DispatchQueue::DispatchThread::entry() () from target:/usr/lib64/ceph/libceph-common.so.2
#8  0x00007fef033871cf in start_thread () from target:/lib64/libpthread.so.0
#9  0x00007fef01ddadd3 in clone () from target:/lib64/libc.so.6

has DaemonServer::lock; wants ClusterState::lock

Thread 22 (Thread 0x7fedd2e39700 (LWP 834730)):
#0  0x00007fef0338d838 in pthread_cond_timedwait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x00007fef0d4c7edc in take_gil () from target:/lib64/libpython3.6m.so.1.0
#2  0x00007fef0d4c80fd in PyEval_RestoreThread () from target:/lib64/libpython3.6m.so.1.0
#3  0x00005610faa1e993 in with_gil_t::with_gil_t(without_gil_t&) ()
#4  0x00005610fa976226 in ActivePyModules::get_python(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()

probably has ClusterState::lock; wants GIL

Thread 3 (Thread 0x7feeff18d700 (LWP 822150)):
#0  0x00007fef0339081d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007fef03389ac9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x00005610fa986d17 in std::mutex::lock() ()
#3  0x00005610fa9dee2c in DaemonServer::ms_handle_authentication(Connection*) ()
#4  0x00007fef04906e55 in MonClient::handle_auth_request(Connection*, AuthConnectionMeta*, bool, unsigned int, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*) () from target:/usr/lib64/ceph/libceph-common.so.2
#5  0x00007fef0489165f in ProtocolV2::_handle_auth_request(ceph::buffer::v15_2_0::list&, bool) () from target:/usr/lib64/ceph/libceph-common.so.2
#6  0x00007fef0489261e in ProtocolV2::handle_auth_request_more(ceph::buffer::v15_2_0::list&) () from target:/usr/lib64/ceph/libceph-common.so.2
#7  0x00007fef0489b0c3 in ProtocolV2::handle_frame_payload() () from target:/usr/lib64/ceph/libceph-common.so.2
#8  0x00007fef0489b380 in ProtocolV2::handle_read_frame_dispatch() () from target:/usr/lib64/ceph/libceph-common.so.2
#9  0x00007fef0489b575 in ProtocolV2::_handle_read_frame_epilogue_main() () from target:/usr/lib64/ceph/libceph-common.so.2
#10 0x00007fef0489b622 in ProtocolV2::_handle_read_frame_segment() () from target:/usr/lib64/ceph/libceph-common.so.2
#11 0x00007fef0489c781 in ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int) () from target:/usr/lib64/ceph/libceph-common.so.2
#12 0x00007fef04884eec in ProtocolV2::run_continuation(Ct<ProtocolV2>&) () from target:/usr/lib64/ceph/libceph-common.so.2
#13 0x00007fef0484d3f9 in AsyncConnection::process() () from target:/usr/lib64/ceph/libceph-common.so.2
#14 0x00007fef048a7507 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () from target:/usr/lib64/ceph/libceph-common.so.2
#15 0x00007fef048ada1c in std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from target:/usr/lib64/ceph/libceph-common.so.2
#16 0x00007fef027c2ba3 in execute_native_thread_routine () from target:/lib64/libstdc++.so.6
#17 0x00007fef033871cf in start_thread () from target:/lib64/libpthread.so.0
#18 0x00007fef01ddadd3 in clone () from target:/lib64/libc.so.6

wants DaemonServer::lock; blocking reads on the AsyncConnection center! (No more messages read off the wire) (s.f. #61874)

Question is: who has the GIL creating this three-way deadlock?

Thread 77 (Thread 0x7fedc0614700 (LWP 834795)):
#0  0x00007fef0338d44c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x00007fef027bc8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from target:/lib64/libstdc++.so.6
#2  0x00007feef1e9b5ab in Client::make_request(MetaRequest*, UserPerm const&, boost::intrusive_ptr<Inode>*, bool*, int, ceph::buffer::v15_2_0::list*) () from target:/lib64/libcephfs.so.2
#3  0x00007feef1ebbea6 in Client::_rmdir(Inode*, char const*, UserPerm const&) () from target:/lib64/libcephfs.so.2
#4  0x00007feef1ebc705 in Client::unlinkat(int, char const*, int, UserPerm const&) () from target:/lib64/libcephfs.so.2
#5  0x00007feef2274a87 in __pyx_pw_6cephfs_9LibCephFS_95rmdir () from target:/lib64/python3.6/site-packages/cephfs.cpython-36m-x86_64-linux-gnu.so

We do :(

Note that the thread blocked in ms_handle_authentication is key to a deadlock because it prevents any messages from being processed which may nudge things in the right direction! In particular, the RPC response from the MDS cannot be processed which would allow the GIL to be released. I will have a followup ticket for that issue.


Related issues 4 (0 open4 closed)

Related to mgr - Bug #61874: mgr: DaemonServer::ms_handle_authentication acquires daemon locksResolvedPatrick Donnelly

Actions
Copied to CephFS - Backport #61898: quincy: pybind/cephfs: holds GIL during rmdirResolvedPatrick DonnellyActions
Copied to CephFS - Backport #61899: reef: pybind/cephfs: holds GIL during rmdirResolvedPatrick DonnellyActions
Copied to CephFS - Backport #61900: pacific: pybind/cephfs: holds GIL during rmdirResolvedPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly 10 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 52290
Actions #2

Updated by Patrick Donnelly 10 months ago

  • Description updated (diff)
Actions #3

Updated by Patrick Donnelly 10 months ago

  • Description updated (diff)
Actions #4

Updated by Patrick Donnelly 10 months ago

  • Description updated (diff)
Actions #5

Updated by Patrick Donnelly 10 months ago

  • Related to Bug #61874: mgr: DaemonServer::ms_handle_authentication acquires daemon locks added
Actions #6

Updated by Patrick Donnelly 10 months ago

  • Description updated (diff)
Actions #7

Updated by Patrick Donnelly 10 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Backport Bot 10 months ago

  • Copied to Backport #61898: quincy: pybind/cephfs: holds GIL during rmdir added
Actions #9

Updated by Backport Bot 10 months ago

  • Copied to Backport #61899: reef: pybind/cephfs: holds GIL during rmdir added
Actions #10

Updated by Backport Bot 10 months ago

  • Copied to Backport #61900: pacific: pybind/cephfs: holds GIL during rmdir added
Actions #11

Updated by Backport Bot 10 months ago

  • Tags set to backport_processed
Actions #12

Updated by Patrick Donnelly 9 months ago

  • Status changed from Pending Backport to Resolved
  • ceph-qa-suite ceph-ansible added
Actions

Also available in: Atom PDF