Bug #55687
closedpacific: Regressions with holding the GIL while attempting to lock a mutex
0%
Description
The mgr process can deadlock if the GIL is held while attempting to lock a mutex. There have been some recent regressions that make this scenario possible again. We have seen this regression cause all 5 of our managers to deadlock and become unavailable in a large cluster.
Files
Updated by Cory Snyder almost 2 years ago
- Affected Versions v16.2.8 added
These regressions appear to have been introduced here: https://github.com/ceph/ceph/pull/44750
Note that the issues do not exist on the master branch or on Quincy, they were introduced due to mistakes with the Pacific backport.
Updated by Neha Ojha almost 2 years ago
- Subject changed from Regressions with holding the GIL while attempting to lock a mutex to pacific: Regressions with holding the GIL while attempting to lock a mutex
- Status changed from New to Resolved
Updated by Eugen Block over 1 year ago
I upgraded our cluster last week to 16.2.10 and I believe I saw this issue an hour ago for the first time in this cluster. Do I understand correctly, the deadlock would cause the pod to still be "alive" but not respond anymore? I was browsing in the dashboard when it stopped working (pages didn't load), then I checked and a different MGR had taken over. I read somewhere that the prometheus module could play a role in this, but in our cluster it is not active. The logs of the failed mgr pod don't contain much information, unfortunately, but if I can provide anything useful please let me know.
Updated by Eugen Block over 1 year ago
Adding a gdb.txt dump from a mgr in deadlock (slightly different ceph version than ours).