Bug #51736
closedmgr hung forever when execute multiprocessing.pool.ThreadPool accidentally
0%
Description
Envrionment:
We have 30+ hosts cluster ceph. 3 mons, 3 mgrs, 330 osds.
Description:
After running one day approximately, mgr stuck and the log of mgr will output no longer.
We analyse the mgr log, and find that it stuck at the function of refresh(host) which is decorated by forall_hosts
'forall_hosts' is implemented by 'multiprocessing.pool.ThreadPool'.
Test:
We modify multiprocessing.pool.ThreadPool(10) to multiprocessing.pool.ThreadPool(1). It won't get stuck anymore.
Guess:
It may be deadlock in multiprocessing
Updated by Sebastian Wagner almost 3 years ago
you're sure you did not hit #51733 ?
Updated by Sebastian Wagner almost 3 years ago
- Related to Bug #51733: offline host hangs serve loop for 15 mins added
Updated by hongloumeng a almost 3 years ago
Sebastian Wagner wrote:
you're sure you did not hit #51733 ?
I think that the bug is different from #51733. The bug of #51733 happens only when host is offline, but the bug of 51736 happens when mgr runs for a while.
This issue is resolved in the pr of https://github.com/ceph/ceph/pull/42352
Updated by Sebastian Wagner over 2 years ago
- Priority changed from Normal to Low
should be fixed in quincy
Updated by Sebastian Wagner over 2 years ago
- Status changed from New to Resolved