Project

General

Profile

Actions

Bug #16826

closed

SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

Added by David Galloway almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I've noticed an abnormally high number of jobs failing due to ssh failures during ceph-cm-ansible runs. I haven't been able to determine a root cause or common denominator yet but am able to occasionally reproduce the issue.

Sentry link to the issue: http://sentry.ceph.com/sepia/teuthology/issues/736/
Note that some failures are being reported to Sentry as other problems but the ssh failure is the root cause.

Other observations:
  • The CPU load on teuthology.front is crazy high pretty much 24/7 since switching to Ansible v2.0 (176.76, 176.48, 171.36 right now even with a reduced number of workers)
  • With the load so high, the ansible run prior to the actual teuthology test takes an unacceptably long time
    • Example: I just ran a simple job with 6 smithi (1 CentOS, 5 Ubuntu) on systems that already had successful ansible runs on them and it ran from 2016-07-26 18:36:39 till 2016-07-26 18:45:59
  • The ansible run seems to fail pretty reliably on the same four tasks
    • Updating Zack's or Andrew's pubkey
    • Ensure the sudo group exists
    • Remove /etc/ceph
    • Install nrpe package and dependencies (Ubuntu)
  • Machine type doesn't matter (happens on smithi, mira, and VPS)
  • I've checked whether the systems were rebooted or renewed DHCP lease at the time of the failures and neither are the case

See attached file listing all the "Unreachable" failures from the past 24 hours. I included 10 lines of context prior to the failure (See line 1).


Files

24h.txt (149 KB) 24h.txt "Unreachable" failures from the past 24 hours David Galloway, 07/27/2016 01:47 AM
Actions

Also available in: Atom PDF