Bug #51733
closedoffline host hangs serve loop for 15 mins
0%
Description
when a host in your cluster goes offline the next time the serve loop starts _refresh_hosts_and_daemons() will be called and eventually _run_cephadm(gather-facts) will be called cause cephadm doesnt know its offline yet.
in _run_cephadm() _remote_connection() will be called to get a connection to the host.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1166
_remote_connection() calls _get_connection() which will return the current connection if it has one or will open a new connection. if it cant make a connection it then marks the host as offline.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1347
https://github.com/ceph/ceph/blob/64dbe17fdbb27abd89755c61ef01744da5d683cc/src/pybind/mgr/cephadm/module.py#L1301
unfortunately its returning a old current connection to the host that is actually offline and trys to run gather facts on the host through that connection.
it then takes 15 mins for it to error out cause that connection is not going to work cause the host is actually offline. during that time the serve loop is stuck.
once it errors out the next time the serve loop starts the host is marked as offline correctly.
ive attached a log of this happening. vm-03 is the offline host
Files