Bug #49605
closedpybind/mgr/volumes: deadlock on async job hangs finisher thread
0%
Description
2021-03-04T14:37:08.589+0000 7fd38cd44700 10 mgr.server ms_handle_authentication ms_handle_authentication new session 0x56077af7f8c0 con 0x56077b160800 entity client.admin addr 2021-03-04T14:37:08.589+0000 7fd38cd44700 10 mgr.server ms_handle_authentication session 0x56077af7f8c0 client.admin has caps allow * 'allow *' 2021-03-04T14:37:08.590+0000 7fd38cd44700 1 --2- [v2:172.21.15.136:6832/32141,v1:172.21.15.136:6833/32141] >> 172.21.15.136:0/205030034 conn(0x56077b160800 0x56077b1f1200 secure :-1 s=READY pgs=4 cs=0 l=1 rev1=1 rx=0x56077b0ee5d0 tx=0x56077b1b9200).ready entity=client.5841 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 2021-03-04T14:37:08.818+0000 7fd36b1fb700 1 -- [v2:172.21.15.136:6832/32141,v1:172.21.15.136:6833/32141] <== client.5841 172.21.15.136:0/205030034 1 ==== mgr_command(tid 0: {"prefix": "fs volume ls", "target": ["mon-mgr", ""]}) v1 ==== 77+0+0 (secure 0 0 0) 0x56077adff760 con 0x56077b160800 2021-03-04T14:37:08.818+0000 7fd36b1fb700 10 mgr.server _handle_command decoded-size=2 prefix=fs volume ls 2021-03-04T14:37:08.819+0000 7fd36b1fb700 20 is_capable service=py module=volumes command=fs volume ls read addr - on cap allow * 2021-03-04T14:37:08.819+0000 7fd36b1fb700 20 allow so far , doing grant allow * 2021-03-04T14:37:08.819+0000 7fd36b1fb700 20 allow all 2021-03-04T14:37:08.819+0000 7fd36b1fb700 10 mgr.server _allowed_command client.admin capable 2021-03-04T14:37:08.819+0000 7fd36b1fb700 0 log_channel(audit) log [DBG] : from='client.5841 -' entity='client.admin' cmd=[{"prefix": "fs volume ls", "target": ["mon-mgr", ""]}]: dispatch 2021-03-04T14:37:08.819+0000 7fd36b1fb700 10 mgr.server _handle_command passing through 2 2021-03-04T14:37:08.821+0000 7fd385535700 10 mgr tick tick
From: /ceph/teuthology-archive/pdonnell-2021-03-04_03:51:01-fs-wip-pdonnell-testing-20210303.195715-distro-basic-smithi/5932239/remote/smithi136/log/ceph-mgr.y.log.gz
The `volume ls` command was never dispatched. It looks like the queued finisher context was dropped somehow.
Updated by Patrick Donnelly about 3 years ago
/ceph/teuthology-archive/pdonnell-2021-03-15_22:16:56-fs-wip-pdonnell-testing-20210315.182203-distro-basic-smithi/5968050/remote/smithi135/log/ceph-mgr.y.log.gz
Updated by Patrick Donnelly about 3 years ago
debugging pr https://github.com/ceph/ceph/pull/40264
Updated by Patrick Donnelly about 3 years ago
- Project changed from mgr to CephFS
- Subject changed from mgr: drops command on the floor to pybind/mgr/volumes: recursive lock hangs finisher thread
- Status changed from New to In Progress
- Assignee set to Patrick Donnelly
- Priority changed from Urgent to Immediate
- Backport changed from pacific,octopus to pacific,octopus,nautilus
- Component(FS) mgr/volumes added
2021-03-20T19:18:17.058+0000 7f9137506700 10 finisher(0x556471c24050) finisher_thread doing [0x556472c7b820,0,0x556476def120,0,0x5564721a7d00,0,0x556471c7c7a0,0,0x556472c7b6a0,0,0x5564721a6180,0,0x5564721a6220,0,0x556475230060,0,0x5564730edf00,0,0x5564730ede20,0,0x5564730edf20,0,0x5564730ede40,0] ... 2021-03-20T19:18:17.066+0000 7f9137506700 0 [volumes DEBUG root] mgr option max_concurrent_clones = 4
and then the finisher thread deadlocks.
Updated by Patrick Donnelly about 3 years ago
- Subject changed from pybind/mgr/volumes: recursive lock hangs finisher thread to pybind/mgr/volumes: cyclic lock hangs finisher thread
Updated by Patrick Donnelly about 3 years ago
- Subject changed from pybind/mgr/volumes: cyclic lock hangs finisher thread to pybind/mgr/volumes: deadlock on async job hangs finisher thread
So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:
1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.
2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?
Updated by Venky Shankar about 3 years ago
Patrick Donnelly wrote:
So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:
1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.
2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?
This is not a deadlock per-se:
Thread-1: lock1 -> lock2
Thread-2: lock2 (blocks, Thread-1 never releases lock2 as the file system is deleted causing the operation to never return).
I would prefer doing #1. Even if we do #2, the thread (worst case all threads) would be blocked on the operation anyway.
Updated by Patrick Donnelly about 3 years ago
Venky Shankar wrote:
Patrick Donnelly wrote:
So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:
1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.
2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?
This is not a deadlock per-se:
Thread-1: lock1 -> lock2
Thread-2: lock2 (blocks, Thread-1 never releases lock2 as the file system is deleted causing the operation to never return).
Right!
I would prefer doing #1. Even if we do #2, the thread (worst case all threads) would be blocked on the operation anyway.
I will proceed with merging #1 since it fixes test failures now.
Updated by Kefu Chai about 3 years ago
- Status changed from In Progress to Pending Backport
Updated by Backport Bot about 3 years ago
- Copied to Backport #50126: octopus: pybind/mgr/volumes: deadlock on async job hangs finisher thread added
Updated by Backport Bot about 3 years ago
- Copied to Backport #50127: pacific: pybind/mgr/volumes: deadlock on async job hangs finisher thread added
Updated by Backport Bot about 3 years ago
- Copied to Backport #50128: nautilus: pybind/mgr/volumes: deadlock on async job hangs finisher thread added
Updated by Patrick Donnelly about 1 year ago
- Status changed from Pending Backport to Resolved