Bug #51733: offline host hangs serve loop for 15 mins - Orchestrator - Ceph

Actions

Copy link

Bug #51733

closed

offline host hangs serve loop for 15 mins

Added by Daniel Pivonka almost 3 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Adam King

Category:

cephadm

Target version:

% Done:

Source:

Tags:

Backport:

quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

45286

Crash signature (v1):

Crash signature (v2):

Description

when a host in your cluster goes offline the next time the serve loop starts _refresh_hosts_and_daemons() will be called and eventually _run_cephadm(gather-facts) will be called cause cephadm doesnt know its offline yet.

in _run_cephadm() _remote_connection() will be called to get a connection to the host.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1166

_remote_connection() calls _get_connection() which will return the current connection if it has one or will open a new connection. if it cant make a connection it then marks the host as offline.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1347
https://github.com/ceph/ceph/blob/64dbe17fdbb27abd89755c61ef01744da5d683cc/src/pybind/mgr/cephadm/module.py#L1301

unfortunately its returning a old current connection to the host that is actually offline and trys to run gather facts on the host through that connection.
it then takes 15 mins for it to error out cause that connection is not going to work cause the host is actually offline. during that time the serve loop is stuck.

once it errors out the next time the serve loop starts the host is marked as offline correctly.

ive attached a log of this happening. vm-03 is the offline host

Files

offlinehostserveloophang.txt (8.86 KB) offlinehostserveloophang.txt

Daniel Pivonka, 07/19/2021 08:02 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #51733

offline host hangs serve loop for 15 mins

Updated by Sebastian Wagner almost 3 years ago

Updated by Daniel Pivonka almost 3 years ago

Updated by Adam King about 2 years ago

Updated by Adam King about 2 years ago

Updated by Redouane Kachach Elhichou almost 2 years ago

Updated by Redouane Kachach Elhichou almost 2 years ago