Project

General

Profile

Actions

Bug #51733

closed

offline host hangs serve loop for 15 mins

Added by Daniel Pivonka almost 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when a host in your cluster goes offline the next time the serve loop starts _refresh_hosts_and_daemons() will be called and eventually _run_cephadm(gather-facts) will be called cause cephadm doesnt know its offline yet.

in _run_cephadm() _remote_connection() will be called to get a connection to the host.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1166

_remote_connection() calls _get_connection() which will return the current connection if it has one or will open a new connection. if it cant make a connection it then marks the host as offline.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1347
https://github.com/ceph/ceph/blob/64dbe17fdbb27abd89755c61ef01744da5d683cc/src/pybind/mgr/cephadm/module.py#L1301

unfortunately its returning a old current connection to the host that is actually offline and trys to run gather facts on the host through that connection.
it then takes 15 mins for it to error out cause that connection is not going to work cause the host is actually offline. during that time the serve loop is stuck.

once it errors out the next time the serve loop starts the host is marked as offline correctly.

ive attached a log of this happening. vm-03 is the offline host


Files

offlinehostserveloophang.txt (8.86 KB) offlinehostserveloophang.txt Daniel Pivonka, 07/19/2021 08:02 PM

Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #51736: mgr hung forever when execute multiprocessing.pool.ThreadPool accidentallyResolved

Actions
Actions

Also available in: Atom PDF