Project

General

Profile

Actions

Bug #49605

closed

pybind/mgr/volumes: deadlock on async job hangs finisher thread

Added by Patrick Donnelly about 3 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Immediate
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
mgr/volumes
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-04T14:37:08.589+0000 7fd38cd44700 10 mgr.server ms_handle_authentication ms_handle_authentication new session 0x56077af7f8c0 con 0x56077b160800 entity client.admin addr
2021-03-04T14:37:08.589+0000 7fd38cd44700 10 mgr.server ms_handle_authentication  session 0x56077af7f8c0 client.admin has caps allow * 'allow *'
2021-03-04T14:37:08.590+0000 7fd38cd44700  1 --2- [v2:172.21.15.136:6832/32141,v1:172.21.15.136:6833/32141] >> 172.21.15.136:0/205030034 conn(0x56077b160800 0x56077b1f1200 secure :-1 s=READY pgs=4 cs=0 l=1 rev1=1 rx=0x56077b0ee5d0 tx=0x56077b1b9200).ready entity=client.5841 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2021-03-04T14:37:08.818+0000 7fd36b1fb700  1 -- [v2:172.21.15.136:6832/32141,v1:172.21.15.136:6833/32141] <== client.5841 172.21.15.136:0/205030034 1 ==== mgr_command(tid 0: {"prefix": "fs volume ls", "target": ["mon-mgr", ""]}) v1 ==== 77+0+0 (secure 0 0 0) 0x56077adff760 con 0x56077b160800
2021-03-04T14:37:08.818+0000 7fd36b1fb700 10 mgr.server _handle_command decoded-size=2 prefix=fs volume ls
2021-03-04T14:37:08.819+0000 7fd36b1fb700 20 is_capable service=py module=volumes command=fs volume ls read addr - on cap allow *
2021-03-04T14:37:08.819+0000 7fd36b1fb700 20  allow so far , doing grant allow *
2021-03-04T14:37:08.819+0000 7fd36b1fb700 20  allow all
2021-03-04T14:37:08.819+0000 7fd36b1fb700 10 mgr.server _allowed_command  client.admin capable
2021-03-04T14:37:08.819+0000 7fd36b1fb700  0 log_channel(audit) log [DBG] : from='client.5841 -' entity='client.admin' cmd=[{"prefix": "fs volume ls", "target": ["mon-mgr", ""]}]: dispatch
2021-03-04T14:37:08.819+0000 7fd36b1fb700 10 mgr.server _handle_command passing through 2
2021-03-04T14:37:08.821+0000 7fd385535700 10 mgr tick tick

From: /ceph/teuthology-archive/pdonnell-2021-03-04_03:51:01-fs-wip-pdonnell-testing-20210303.195715-distro-basic-smithi/5932239/remote/smithi136/log/ceph-mgr.y.log.gz

The `volume ls` command was never dispatched. It looks like the queued finisher context was dropped somehow.


Related issues 3 (0 open3 closed)

Copied to CephFS - Backport #50126: octopus: pybind/mgr/volumes: deadlock on async job hangs finisher threadRejectedCory SnyderActions
Copied to CephFS - Backport #50127: pacific: pybind/mgr/volumes: deadlock on async job hangs finisher threadResolvedPatrick DonnellyActions
Copied to CephFS - Backport #50128: nautilus: pybind/mgr/volumes: deadlock on async job hangs finisher threadResolvedPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly about 3 years ago

/ceph/teuthology-archive/pdonnell-2021-03-15_22:16:56-fs-wip-pdonnell-testing-20210315.182203-distro-basic-smithi/5968050/remote/smithi135/log/ceph-mgr.y.log.gz

Actions #3

Updated by Patrick Donnelly about 3 years ago

  • Project changed from mgr to CephFS
  • Subject changed from mgr: drops command on the floor to pybind/mgr/volumes: recursive lock hangs finisher thread
  • Status changed from New to In Progress
  • Assignee set to Patrick Donnelly
  • Priority changed from Urgent to Immediate
  • Backport changed from pacific,octopus to pacific,octopus,nautilus
  • Component(FS) mgr/volumes added
2021-03-20T19:18:17.058+0000 7f9137506700 10 finisher(0x556471c24050) finisher_thread doing [0x556472c7b820,0,0x556476def120,0,0x5564721a7d00,0,0x556471c7c7a0,0,0x556472c7b6a0,0,0x5564721a6180,0,0x5564721a6220,0,0x556475230060,0,0x5564730edf00,0,0x5564730ede20,0,0x5564730edf20,0,0x5564730ede40,0]
...
2021-03-20T19:18:17.066+0000 7f9137506700  0 [volumes DEBUG root]  mgr option max_concurrent_clones = 4

and then the finisher thread deadlocks.

Actions #4

Updated by Patrick Donnelly about 3 years ago

  • Subject changed from pybind/mgr/volumes: recursive lock hangs finisher thread to pybind/mgr/volumes: cyclic lock hangs finisher thread
Actions #5

Updated by Patrick Donnelly about 3 years ago

  • Subject changed from pybind/mgr/volumes: cyclic lock hangs finisher thread to pybind/mgr/volumes: deadlock on async job hangs finisher thread

So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:

1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.

2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?

Actions #6

Updated by Patrick Donnelly about 3 years ago

  • Pull request ID set to 40316
Actions #7

Updated by Venky Shankar about 3 years ago

Patrick Donnelly wrote:

So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:

1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.

2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?

This is not a deadlock per-se:

Thread-1: lock1 -> lock2
Thread-2: lock2 (blocks, Thread-1 never releases lock2 as the file system is deleted causing the operation to never return).

I would prefer doing #1. Even if we do #2, the thread (worst case all threads) would be blocked on the operation anyway.

Actions #8

Updated by Patrick Donnelly about 3 years ago

Venky Shankar wrote:

Patrick Donnelly wrote:

So the issue is that AsyncJobs.get_job() is called with AsyncJobs.lock locked. Then getting the next job involves opening the volume which, for this test, got promptly deleted. That hangs the client and therefore keeps the lock locked. There's a two problems here:

1) The reconfigure_max_concurrent_clones tries to acquire a mutex which, if there is a deadlock situation, can cause the ceph-mgr finisher thread to stop working completely. This method should just update an internal variable. A tick thread should update to the desired state.

2) get_job could hang when opening the client which cause this deadlock. This should probably be reworked to operate outside of the lock?

This is not a deadlock per-se:

Thread-1: lock1 -> lock2
Thread-2: lock2 (blocks, Thread-1 never releases lock2 as the file system is deleted causing the operation to never return).

Right!

I would prefer doing #1. Even if we do #2, the thread (worst case all threads) would be blocked on the operation anyway.

I will proceed with merging #1 since it fixes test failures now.

Actions #9

Updated by Kefu Chai about 3 years ago

  • Status changed from In Progress to Pending Backport
Actions #10

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50126: octopus: pybind/mgr/volumes: deadlock on async job hangs finisher thread added
Actions #11

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50127: pacific: pybind/mgr/volumes: deadlock on async job hangs finisher thread added
Actions #12

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50128: nautilus: pybind/mgr/volumes: deadlock on async job hangs finisher thread added
Actions #13

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #14

Updated by Patrick Donnelly about 1 year ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF