pacific: Regressions with holding the GIL while attempting to lock a mutex
The mgr process can deadlock if the GIL is held while attempting to lock a mutex. There have been some recent regressions that make this scenario possible again. We have seen this regression cause all 5 of our managers to deadlock and become unavailable in a large cluster.
#1 Updated by Cory Snyder over 1 year ago
- Affected Versions v16.2.8 added
These regressions appear to have been introduced here: https://github.com/ceph/ceph/pull/44750
Note that the issues do not exist on the master branch or on Quincy, they were introduced due to mistakes with the Pacific backport.
#7 Updated by Eugen Block 12 months ago
I upgraded our cluster last week to 16.2.10 and I believe I saw this issue an hour ago for the first time in this cluster. Do I understand correctly, the deadlock would cause the pod to still be "alive" but not respond anymore? I was browsing in the dashboard when it stopped working (pages didn't load), then I checked and a different MGR had taken over. I read somewhere that the prometheus module could play a role in this, but in our cluster it is not active. The logs of the failed mgr pod don't contain much information, unfortunately, but if I can provide anything useful please let me know.