Project

General

Profile

Actions

Bug #55687

closed

pacific: Regressions with holding the GIL while attempting to lock a mutex

Added by Cory Snyder almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The mgr process can deadlock if the GIL is held while attempting to lock a mutex. There have been some recent regressions that make this scenario possible again. We have seen this regression cause all 5 of our managers to deadlock and become unavailable in a large cluster.


Files

pacific-mgr-deadlock-gdb.txt (302 KB) pacific-mgr-deadlock-gdb.txt Eugen Block, 12/14/2022 11:41 AM
Actions #1

Updated by Cory Snyder almost 2 years ago

  • Affected Versions v16.2.8 added

These regressions appear to have been introduced here: https://github.com/ceph/ceph/pull/44750

Note that the issues do not exist on the master branch or on Quincy, they were introduced due to mistakes with the Pacific backport.

Actions #2

Updated by Cory Snyder almost 2 years ago

  • Backport deleted (quincy, pacific)
Actions #3

Updated by Cory Snyder almost 2 years ago

  • Regression changed from No to Yes
Actions #4

Updated by Cory Snyder almost 2 years ago

  • Pull request ID set to 46302
Actions #5

Updated by Neha Ojha almost 2 years ago

  • Subject changed from Regressions with holding the GIL while attempting to lock a mutex to pacific: Regressions with holding the GIL while attempting to lock a mutex
  • Status changed from New to Resolved
Actions #6

Updated by Ilya Dryomov almost 2 years ago

  • Target version set to v16.2.9
Actions #7

Updated by Eugen Block over 1 year ago

I upgraded our cluster last week to 16.2.10 and I believe I saw this issue an hour ago for the first time in this cluster. Do I understand correctly, the deadlock would cause the pod to still be "alive" but not respond anymore? I was browsing in the dashboard when it stopped working (pages didn't load), then I checked and a different MGR had taken over. I read somewhere that the prometheus module could play a role in this, but in our cluster it is not active. The logs of the failed mgr pod don't contain much information, unfortunately, but if I can provide anything useful please let me know.

Actions #8

Updated by Eugen Block over 1 year ago

Adding a gdb.txt dump from a mgr in deadlock (slightly different ceph version than ours).

Actions

Also available in: Atom PDF