Bug #49605: pybind/mgr/volumes: deadlock on async job hangs finisher thread - CephFS - Ceph

Actions

Copy link

Bug #49605

closed

pybind/mgr/volumes: deadlock on async job hangs finisher thread

Added by Patrick Donnelly about 3 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Patrick Donnelly

Category:

Target version:

Ceph - v17.0.0

% Done:

Source:

Q/A

Tags:

backport_processed

Backport:

pacific,octopus,nautilus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

mgr/volumes

Labels (FS):

Pull request ID:

40316

Crash signature (v1):

Crash signature (v2):

Description

2021-03-04T14:37:08.589+0000 7fd38cd44700 10 mgr.server ms_handle_authentication ms_handle_authentication new session 0x56077af7f8c0 con 0x56077b160800 entity client.admin addr
2021-03-04T14:37:08.589+0000 7fd38cd44700 10 mgr.server ms_handle_authentication  session 0x56077af7f8c0 client.admin has caps allow * 'allow *'
2021-03-04T14:37:08.590+0000 7fd38cd44700  1 --2- [v2:172.21.15.136:6832/32141,v1:172.21.15.136:6833/32141] >> 172.21.15.136:0/205030034 conn(0x56077b160800 0x56077b1f1200 secure :-1 s=READY pgs=4 cs=0 l=1 rev1=1 rx=0x56077b0ee5d0 tx=0x56077b1b9200).ready entity=client.5841 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2021-03-04T14:37:08.818+0000 7fd36b1fb700  1 -- [v2:172.21.15.136:6832/32141,v1:172.21.15.136:6833/32141] <== client.5841 172.21.15.136:0/205030034 1 ==== mgr_command(tid 0: {"prefix": "fs volume ls", "target": ["mon-mgr", ""]}) v1 ==== 77+0+0 (secure 0 0 0) 0x56077adff760 con 0x56077b160800
2021-03-04T14:37:08.818+0000 7fd36b1fb700 10 mgr.server _handle_command decoded-size=2 prefix=fs volume ls
2021-03-04T14:37:08.819+0000 7fd36b1fb700 20 is_capable service=py module=volumes command=fs volume ls read addr - on cap allow *
2021-03-04T14:37:08.819+0000 7fd36b1fb700 20  allow so far , doing grant allow *
2021-03-04T14:37:08.819+0000 7fd36b1fb700 20  allow all
2021-03-04T14:37:08.819+0000 7fd36b1fb700 10 mgr.server _allowed_command  client.admin capable
2021-03-04T14:37:08.819+0000 7fd36b1fb700  0 log_channel(audit) log [DBG] : from='client.5841 -' entity='client.admin' cmd=[{"prefix": "fs volume ls", "target": ["mon-mgr", ""]}]: dispatch
2021-03-04T14:37:08.819+0000 7fd36b1fb700 10 mgr.server _handle_command passing through 2
2021-03-04T14:37:08.821+0000 7fd385535700 10 mgr tick tick

From: /ceph/teuthology-archive/pdonnell-2021-03-04_03:51:01-fs-wip-pdonnell-testing-20210303.195715-distro-basic-smithi/5932239/remote/smithi136/log/ceph-mgr.y.log.gz

The `volume ls` command was never dispatched. It looks like the queued finisher context was dropped somehow.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

/ceph/teuthology-archive/pdonnell-2021-03-15_22:16:56-fs-wip-pdonnell-testing-20210315.182203-distro-basic-smithi/5968050/remote/smithi135/log/ceph-mgr.y.log.gz

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

debugging pr https://github.com/ceph/ceph/pull/40264

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Project changed from mgr to CephFS
Subject changed from mgr: drops command on the floor to pybind/mgr/volumes: recursive lock hangs finisher thread
Status changed from New to In Progress
Assignee set to Patrick Donnelly
Priority changed from Urgent to Immediate
Backport changed from pacific,octopus to pacific,octopus,nautilus
Component(FS) mgr/volumes added

2021-03-20T19:18:17.058+0000 7f9137506700 10 finisher(0x556471c24050) finisher_thread doing [0x556472c7b820,0,0x556476def120,0,0x5564721a7d00,0,0x556471c7c7a0,0,0x556472c7b6a0,0,0x5564721a6180,0,0x5564721a6220,0,0x556475230060,0,0x5564730edf00,0,0x5564730ede20,0,0x5564730edf20,0,0x5564730ede40,0]
...
2021-03-20T19:18:17.066+0000 7f9137506700  0 [volumes DEBUG root]  mgr option max_concurrent_clones = 4

and then the finisher thread deadlocks.

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Subject changed from pybind/mgr/volumes: recursive lock hangs finisher thread to pybind/mgr/volumes: cyclic lock hangs finisher thread

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Subject changed from pybind/mgr/volumes: cyclic lock hangs finisher thread to pybind/mgr/volumes: deadlock on async job hangs finisher thread

So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:

1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.

2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Pull request ID set to 40316

Actions

Copy link

Updated by Venky Shankar about 3 years ago

Patrick Donnelly wrote:

So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:

1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.

2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?

This is not a deadlock per-se:

Thread-1: lock1 -> lock2
Thread-2: lock2 (blocks, Thread-1 never releases lock2 as the file system is deleted causing the operation to never return).

I would prefer doing #1. Even if we do #2, the thread (worst case all threads) would be blocked on the operation anyway.

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Venky Shankar wrote:

Patrick Donnelly wrote:

So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:

1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.

2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?

This is not a deadlock per-se:

Thread-1: lock1 -> lock2
Thread-2: lock2 (blocks, Thread-1 never releases lock2 as the file system is deleted causing the operation to never return).

Right!