Bug #51736: mgr hung forever when execute multiprocessing.pool.ThreadPool accidentally - Orchestrator - Ceph

Actions

Copy link

Bug #51736

closed

mgr hung forever when execute multiprocessing.pool.ThreadPool accidentally

Added by hongloumeng a almost 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Low

Assignee:

Category:

cephadm

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

42352

Crash signature (v1):

Crash signature (v2):

Description

Envrionment:
We have 30+ hosts cluster ceph. 3 mons, 3 mgrs, 330 osds.

Description:
After running one day approximately, mgr stuck and the log of mgr will output no longer.
We analyse the mgr log, and find that it stuck at the function of refresh(host) which is decorated by forall_hosts
'forall_hosts' is implemented by 'multiprocessing.pool.ThreadPool'.

Test:
We modify multiprocessing.pool.ThreadPool(10) to multiprocessing.pool.ThreadPool(1). It won't get stuck anymore.

Guess:
It may be deadlock in multiprocessing

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

you're sure you did not hit #51733 ?

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

Related to Bug #51733: offline host hangs serve loop for 15 mins added

Actions

Copy link

Updated by hongloumeng a almost 3 years ago

Sebastian Wagner wrote:

you're sure you did not hit #51733 ?

I think that the bug is different from #51733. The bug of #51733 happens only when host is offline, but the bug of 51736 happens when mgr runs for a while.
This issue is resolved in the pr of https://github.com/ceph/ceph/pull/42352

Actions

Copy link