Project

General

Profile

Actions

Bug #36266

closed

mgr: deadlock in ClusterState

Added by Hector Martin over 5 years ago. Updated over 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is a cluster with 3 mons/mgrs. A few minutes ago, the active mgr stopped responding. The cluster successfully failed over to a standby mgr. I'm using the prometheus and dashboard monitors. The exporter metrics get scraped periodically by Prometheus.

@
  1. ceph-mgr --version
    ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
    @

I'm attaching a gdb backtrace. Looks like a deadlock:

Thread 46 is holding the Objecter rwlock and trying to lock ClusterState.
Thread 17 is holding the ClusterState lock and is trying to lock the Objecter rwlock.

I think there's a missing lock in ClusterState::with_osdmap:
https://github.com/ceph/ceph/blob/v12.2.8/src/mgr/ClusterState.h#L128


Files

ceph-mgr-backtrace.txt (86.6 KB) ceph-mgr-backtrace.txt Hector Martin, 09/30/2018 11:13 AM

Related issues 1 (0 open1 closed)

Is duplicate of mgr - Bug #23460: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too earlyResolved

Actions
Actions #1

Updated by John Spray over 5 years ago

  • Is duplicate of Bug #23460: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early added
Actions #2

Updated by John Spray over 5 years ago

  • Status changed from New to Duplicate

Thanks for the detailed report, this looks like the same issue as https://tracker.ceph.com/issues/23460

The underlying issue is that we have one place in the code that takes the PGMap lock before the OSDMap lock, and another place that takes them in the opposite order.

Actions #3

Updated by Hector Martin over 5 years ago

Note that in this case it was the active mgr that deadlocked, not a standby.

My suggestion to add a lock in ClusterState would break the deadlock situation, by forcing the Thread 46 codepath to lock the ClusterState mutex first, then the Objecter mutex. This is assuming the mutexes are re-entrant within a single thread (since it's going to try to lock ClusterState later again).

Actions

Also available in: Atom PDF