Actions
Bug #16826
closedSSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
I've noticed an abnormally high number of jobs failing due to ssh failures during ceph-cm-ansible runs. I haven't been able to determine a root cause or common denominator yet but am able to occasionally reproduce the issue.
Sentry link to the issue: http://sentry.ceph.com/sepia/teuthology/issues/736/
Note that some failures are being reported to Sentry as other problems but the ssh failure is the root cause.
- The CPU load on teuthology.front is crazy high pretty much 24/7 since switching to Ansible v2.0 (176.76, 176.48, 171.36 right now even with a reduced number of workers)
- With the load so high, the ansible run prior to the actual teuthology test takes an unacceptably long time
- Example: I just ran a simple job with 6 smithi (1 CentOS, 5 Ubuntu) on systems that already had successful ansible runs on them and it ran from 2016-07-26 18:36:39 till 2016-07-26 18:45:59
- The ansible run seems to fail pretty reliably on the same four tasks
- Updating Zack's or Andrew's pubkey
- Ensure the sudo group exists
- Remove /etc/ceph
- Install nrpe package and dependencies (Ubuntu)
- Machine type doesn't matter (happens on smithi, mira, and VPS)
- I've checked whether the systems were rebooted or renewed DHCP lease at the time of the failures and neither are the case
See attached file listing all the "Unreachable" failures from the past 24 hours. I included 10 lines of context prior to the failure (See line 1).
Files
Actions