Bug #54024: mgr/cephadm: timeouts for ssh/binary commands - Orchestrator - Ceph

Actions

Copy link

Bug #54024

closed

mgr/cephadm: timeouts for ssh/binary commands

Added by Adam King about 2 years ago. Updated 11 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Adam King

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

reef, quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

50722

Crash signature (v1):

Crash signature (v2):

Description

Some thoughts from orch weekly

Timeouts (ssh commands in mgr module, commands in binary)
* how do we gracefully recover when an operation is blocked on a host
  * https://tracker.ceph.com/issues/53846
* ssh has a 15 min timeout: https://tracker.ceph.com/issues/51733
* asyncssh: connection.run has a timeout:
  * https://github.com/ronf/asyncssh/blob/215dbf63fd82270716814de63e045c512d0e5b72/asyncssh/connection.py#L4014 
* how long should we wait?
  * ceph-volume ls on dense nodes
    * done from the cephadm agent
  * downlaoding container images though slow internet connections
    * can we avoid that? https://tracker.ceph.com/issues/53276
* reproduce: artificially create a stale global cephadm lock
* make ssh run command timeout configurable in case a cluster actually runs into those timeouts?
  * make timeout 15 mins? or 5 mins?

Decision there was ultimately to try to pass the --timeout flag the cpehadm binary offers to see if it would cause the commands to eventually return, then raise a health warning if we see the timeout happen

Related issues 2 (0 open — 2 closed)