Bug #36266: mgr: deadlock in ClusterState - mgr - Ceph

Actions

Copy link

Bug #36266

closed

mgr: deadlock in ClusterState

Added by Hector Martin over 5 years ago. Updated over 5 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.8

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This is a cluster with 3 mons/mgrs. A few minutes ago, the active mgr stopped responding. The cluster successfully failed over to a standby mgr. I'm using the prometheus and dashboard monitors. The exporter metrics get scraped periodically by Prometheus.

ceph-mgr --version
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
@

I'm attaching a gdb backtrace. Looks like a deadlock:

Thread 46 is holding the Objecter rwlock and trying to lock ClusterState.
Thread 17 is holding the ClusterState lock and is trying to lock the Objecter rwlock.

I think there's a missing lock in ClusterState::with_osdmap:
https://github.com/ceph/ceph/blob/v12.2.8/src/mgr/ClusterState.h#L128

Files

ceph-mgr-backtrace.txt (86.6 KB) ceph-mgr-backtrace.txt

Hector Martin, 09/30/2018 11:13 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by John Spray over 5 years ago

Is duplicate of Bug #23460: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early added

Actions

Copy link

Updated by John Spray over 5 years ago

Status changed from New to Duplicate

Thanks for the detailed report, this looks like the same issue as https://tracker.ceph.com/issues/23460

The underlying issue is that we have one place in the code that takes the PGMap lock before the OSDMap lock, and another place that takes them in the opposite order.

Actions

Copy link

Updated by Hector Martin over 5 years ago

Note that in this case it was the active mgr that deadlocked, not a standby.

My suggestion to add a lock in ClusterState would break the deadlock situation, by forcing the Thread 46 codepath to lock the ClusterState mutex first, then the Objecter mutex. This is assuming the mutexes are re-entrant within a single thread (since it's going to try to lock ClusterState later again).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #36266

mgr: deadlock in ClusterState

Updated by John Spray over 5 years ago

Updated by John Spray over 5 years ago

Updated by Hector Martin over 5 years ago