Project

General

Profile

Actions

Bug #51733

closed

offline host hangs serve loop for 15 mins

Added by Daniel Pivonka almost 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when a host in your cluster goes offline the next time the serve loop starts _refresh_hosts_and_daemons() will be called and eventually _run_cephadm(gather-facts) will be called cause cephadm doesnt know its offline yet.

in _run_cephadm() _remote_connection() will be called to get a connection to the host.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1166

_remote_connection() calls _get_connection() which will return the current connection if it has one or will open a new connection. if it cant make a connection it then marks the host as offline.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1347
https://github.com/ceph/ceph/blob/64dbe17fdbb27abd89755c61ef01744da5d683cc/src/pybind/mgr/cephadm/module.py#L1301

unfortunately its returning a old current connection to the host that is actually offline and trys to run gather facts on the host through that connection.
it then takes 15 mins for it to error out cause that connection is not going to work cause the host is actually offline. during that time the serve loop is stuck.

once it errors out the next time the serve loop starts the host is marked as offline correctly.

ive attached a log of this happening. vm-03 is the offline host


Files

offlinehostserveloophang.txt (8.86 KB) offlinehostserveloophang.txt Daniel Pivonka, 07/19/2021 08:02 PM

Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #51736: mgr hung forever when execute multiprocessing.pool.ThreadPool accidentallyResolved

Actions
Actions #1

Updated by Sebastian Wagner almost 3 years ago

  • Related to Bug #51736: mgr hung forever when execute multiprocessing.pool.ThreadPool accidentally added
Actions #2

Updated by Daniel Pivonka almost 3 years ago

only is happening is host is not gracefully shutdown

Actions #3

Updated by Adam King about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Adam King
  • Pull request ID set to 45286
Actions #4

Updated by Adam King about 2 years ago

  • Status changed from In Progress to Pending Backport
  • Tags set to quincy
Actions #5

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Tags deleted (quincy)
  • Backport set to quincy,pacific
Actions #6

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF