Bug #36266
closedmgr: deadlock in ClusterState
0%
Description
This is a cluster with 3 mons/mgrs. A few minutes ago, the active mgr stopped responding. The cluster successfully failed over to a standby mgr. I'm using the prometheus and dashboard monitors. The exporter metrics get scraped periodically by Prometheus.
@- ceph-mgr --version
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
@
I'm attaching a gdb backtrace. Looks like a deadlock:
Thread 46 is holding the Objecter rwlock and trying to lock ClusterState.
Thread 17 is holding the ClusterState lock and is trying to lock the Objecter rwlock.
I think there's a missing lock in ClusterState::with_osdmap:
https://github.com/ceph/ceph/blob/v12.2.8/src/mgr/ClusterState.h#L128
Files
Updated by John Spray over 5 years ago
- Is duplicate of Bug #23460: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early added
Updated by John Spray over 5 years ago
- Status changed from New to Duplicate
Thanks for the detailed report, this looks like the same issue as https://tracker.ceph.com/issues/23460
The underlying issue is that we have one place in the code that takes the PGMap lock before the OSDMap lock, and another place that takes them in the opposite order.
Updated by Hector Martin over 5 years ago
Note that in this case it was the active mgr that deadlocked, not a standby.
My suggestion to add a lock in ClusterState would break the deadlock situation, by forcing the Thread 46 codepath to lock the ClusterState mutex first, then the Objecter mutex. This is assuming the mutexes are re-entrant within a single thread (since it's going to try to lock ClusterState later again).