Project

General

Profile

Actions

Bug #16826

closed

SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

Added by David Galloway almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I've noticed an abnormally high number of jobs failing due to ssh failures during ceph-cm-ansible runs. I haven't been able to determine a root cause or common denominator yet but am able to occasionally reproduce the issue.

Sentry link to the issue: http://sentry.ceph.com/sepia/teuthology/issues/736/
Note that some failures are being reported to Sentry as other problems but the ssh failure is the root cause.

Other observations:
  • The CPU load on teuthology.front is crazy high pretty much 24/7 since switching to Ansible v2.0 (176.76, 176.48, 171.36 right now even with a reduced number of workers)
  • With the load so high, the ansible run prior to the actual teuthology test takes an unacceptably long time
    • Example: I just ran a simple job with 6 smithi (1 CentOS, 5 Ubuntu) on systems that already had successful ansible runs on them and it ran from 2016-07-26 18:36:39 till 2016-07-26 18:45:59
  • The ansible run seems to fail pretty reliably on the same four tasks
    • Updating Zack's or Andrew's pubkey
    • Ensure the sudo group exists
    • Remove /etc/ceph
    • Install nrpe package and dependencies (Ubuntu)
  • Machine type doesn't matter (happens on smithi, mira, and VPS)
  • I've checked whether the systems were rebooted or renewed DHCP lease at the time of the failures and neither are the case

See attached file listing all the "Unreachable" failures from the past 24 hours. I included 10 lines of context prior to the failure (See line 1).


Files

24h.txt (149 KB) 24h.txt "Unreachable" failures from the past 24 hours David Galloway, 07/27/2016 01:47 AM
Actions #1

Updated by David Galloway almost 8 years ago

David Galloway wrote:

  • With the load so high, the ansible run prior to the actual teuthology test takes an unacceptably long time
    • Example: I just ran a simple job with 6 smithi (1 CentOS, 5 Ubuntu) on systems that already had successful ansible runs on them and it ran from 2016-07-26 18:36:39 till 2016-07-26 18:45:59

Maybe ~9 minutes isn't horrible but ansible definitely ran much faster when the queue was paused.

Actions #2

Updated by Dan Mick almost 8 years ago

It looks like the default ansible.cfg ssh connection timeout is 10s, which seems pretty short, especially if the originating host may be slow. Perhaps we could bump that up as a test

Actions #3

Updated by Dan Mick almost 8 years ago

looks like ansible-playbook takes a -T/--timeout that might be an easy way to play with this

Actions #4

Updated by David Galloway almost 8 years ago

Dan Mick wrote:

It looks like the default ansible.cfg ssh connection timeout is 10s, which seems pretty short, especially if the originating host may be slow. Perhaps we could bump that up as a test

ansible.cfg is shipped with ceph-cm-ansible. The timeout is 120s

Actions #5

Updated by Dan Mick over 7 years ago

Just to be clear, we're pretty sure ssh is not connecting/reconnecting at the time of the failures, so the above is just a red herring in several ways.

Actions #6

Updated by Zack Cerza over 7 years ago

Attempts at getting ansible to give us more info:
https://github.com/ceph/teuthology/pull/919

Actions #9

Updated by Zack Cerza over 7 years ago

  • Status changed from 12 to Resolved
  • Assignee changed from David Galloway to Zack Cerza

The retries in the above PR appear to have resolved the issue. None in the last ~26h.

Actions

Also available in: Atom PDF