Bug #55687: pacific: Regressions with holding the GIL while attempting to lock a mutex - mgr - Ceph

Actions

Copy link

Bug #55687

closed

pacific: Regressions with holding the GIL while attempting to lock a mutex

Added by Cory Snyder almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Cory Snyder

Category:

Target version:

Ceph - v16.2.9

% Done:

Source:

Tags:

Backport:

Regression:

Yes

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v16.2.8

ceph-qa-suite:

Pull request ID:

46302

Crash signature (v1):

Crash signature (v2):

Description

The mgr process can deadlock if the GIL is held while attempting to lock a mutex. There have been some recent regressions that make this scenario possible again. We have seen this regression cause all 5 of our managers to deadlock and become unavailable in a large cluster.

Files

pacific-mgr-deadlock-gdb.txt (302 KB) pacific-mgr-deadlock-gdb.txt

Eugen Block, 12/14/2022 11:41 AM

Actions

Copy link

Updated by Cory Snyder almost 2 years ago

Affected Versions v16.2.8 added

These regressions appear to have been introduced here: https://github.com/ceph/ceph/pull/44750

Note that the issues do not exist on the master branch or on Quincy, they were introduced due to mistakes with the Pacific backport.

Actions

Copy link

Updated by Cory Snyder almost 2 years ago

Backport deleted (~~quincy, pacific~~)

Actions

Copy link

Updated by Cory Snyder almost 2 years ago

Regression changed from No to Yes

Actions

Copy link

Updated by Cory Snyder almost 2 years ago

Pull request ID set to 46302

Actions

Copy link

Updated by Neha Ojha almost 2 years ago

Subject changed from Regressions with holding the GIL while attempting to lock a mutex to pacific: Regressions with holding the GIL while attempting to lock a mutex
Status changed from New to Resolved

Actions

Copy link

Updated by Ilya Dryomov almost 2 years ago

Target version set to v16.2.9

Actions

Copy link

Updated by Eugen Block over 1 year ago

I upgraded our cluster last week to 16.2.10 and I believe I saw this issue an hour ago for the first time in this cluster. Do I understand correctly, the deadlock would cause the pod to still be "alive" but not respond anymore? I was browsing in the dashboard when it stopped working (pages didn't load), then I checked and a different MGR had taken over. I read somewhere that the prometheus module could play a role in this, but in our cluster it is not active. The logs of the failed mgr pod don't contain much information, unfortunately, but if I can provide anything useful please let me know.

Actions

Copy link