Bug #16826: SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh - sepia - Ceph

Actions

Copy link

Bug #16826

closed

SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

Added by David Galloway almost 8 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

Zack Cerza

Category:

Infrastructure Service

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

I've noticed an abnormally high number of jobs failing due to ssh failures during ceph-cm-ansible runs. I haven't been able to determine a root cause or common denominator yet but am able to occasionally reproduce the issue.

Sentry link to the issue: http://sentry.ceph.com/sepia/teuthology/issues/736/
Note that some failures are being reported to Sentry as other problems but the ssh failure is the root cause.

Other observations:

The CPU load on teuthology.front is crazy high pretty much 24/7 since switching to Ansible v2.0 (176.76, 176.48, 171.36 right now even with a reduced number of workers)
With the load so high, the ansible run prior to the actual teuthology test takes an unacceptably long time
- Example: I just ran a simple job with 6 smithi (1 CentOS, 5 Ubuntu) on systems that already had successful ansible runs on them and it ran from 2016-07-26 18:36:39 till 2016-07-26 18:45:59
The ansible run seems to fail pretty reliably on the same four tasks
- Updating Zack's or Andrew's pubkey
- Ensure the sudo group exists
- Remove /etc/ceph
- Install nrpe package and dependencies (Ubuntu)
Machine type doesn't matter (happens on smithi, mira, and VPS)
I've checked whether the systems were rebooted or renewed DHCP lease at the time of the failures and neither are the case

See attached file listing all the "Unreachable" failures from the past 24 hours. I included 10 lines of context prior to the failure (See line 1).

Files

24h.txt (149 KB) 24h.txt

"Unreachable" failures from the past 24 hours

David Galloway, 07/27/2016 01:47 AM

Actions

Copy link

Updated by David Galloway almost 8 years ago

David Galloway wrote:

With the load so high, the ansible run prior to the actual teuthology test takes an unacceptably long time

Example: I just ran a simple job with 6 smithi (1 CentOS, 5 Ubuntu) on systems that already had successful ansible runs on them and it ran from 2016-07-26 18:36:39 till 2016-07-26 18:45:59

Maybe ~9 minutes isn't horrible but ansible definitely ran much faster when the queue was paused.

Actions

Copy link

Updated by Dan Mick almost 8 years ago

It looks like the default ansible.cfg ssh connection timeout is 10s, which seems pretty short, especially if the originating host may be slow. Perhaps we could bump that up as a test

Actions

Copy link

Updated by Dan Mick almost 8 years ago

looks like ansible-playbook takes a -T/--timeout that might be an easy way to play with this

Actions

Copy link

Updated by David Galloway almost 8 years ago

Dan Mick wrote:

It looks like the default ansible.cfg ssh connection timeout is 10s, which seems pretty short, especially if the originating host may be slow. Perhaps we could bump that up as a test

ansible.cfg is shipped with ceph-cm-ansible. The timeout is 120s

Actions

Copy link

Updated by Dan Mick over 7 years ago

Just to be clear, we're pretty sure ssh is not connecting/reconnecting at the time of the failures, so the above is just a red herring in several ways.

Actions

Copy link