Project

General

Profile

Actions

Bug #62650

closed

Various SSH errors are preventing many jobs from completing properly

Added by Zack Cerza 8 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

We've been having more and more issues with SSH errors recently:
https://sentry.ceph.com/organizations/ceph/issues/?end=2023-08-30T23%3A59%3A59&query=paramiko&start=2023-08-22T00%3A00%3A00&utc=true

I found a fix for the AttributeError: https://sentry.ceph.com/share/issue/e9092ab6059e4ea299350022b9b2cb52/ https://github.com/ceph/teuthology/pull/1886 - but there's clearly more going on at this point.

This issue alone has occurred over 600 times in the last 24h: https://sentry.ceph.com/share/issue/ef95cc1bf37f4e89a849c9a1c5e26a6b/

I noticed that all of the hosts it affected were CentOS 9.Stream, and I've narrowed this particular issue down to an SSH key incompatibility.

Actions #1

Updated by Zack Cerza 8 months ago

Adding a new key for teuthworker: https://github.com/ceph/keys/pull/431
Fixing keys.git's broken update script: https://github.com/ceph/keys/pull/432

Actions #2

Updated by Zack Cerza 8 months ago

A freshly reimaged node didn't have the new key present, and it's in @all: https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub

It's possible something FOG-related needs to be updated to close the loop here.

Actions #3

Updated by adam kraitman 8 months ago

Hey Zack the testnodes ansible role adds that new ssh key

Actions #4

Updated by Zack Cerza 8 months ago

adam kraitman wrote:

Hey Zack the testnodes ansible role adds that new ssh key

Right. The issue is that the jobs are failing before they can even run ansible.

Actions #5

Updated by Zack Cerza 8 months ago

While testing a teuthology branch that intends to improve error handling and logging during SSH connection attempts, I ran a journalctl -f on a testnode while a job was about to fail to the `EOFError` issue.

http://qa-proxy.ceph.com/teuthology/zack-2023-08-31_18:49:42-smoke-main-distro-default-smithi/7385977/teuthology.log

<snip>
2023-08-31T18:58:34.920 DEBUG:teuthology.orchestra.connection:{'hostname': 'smithi033.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60}
2023-08-31T18:58:34.938 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T18:58:38.959 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T18:58:45.983 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T18:58:56.011 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T18:59:09.039 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T18:59:25.072 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T18:59:44.109 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T19:00:06.149 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T19:00:31.192 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T19:00:59.236 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError
2023-08-31T19:00:59.237 ERROR:teuthology.run_tasks:Saw exception from tasks.
<snip>

And from the journal:

Aug 31 19:00:06 smithi033 sshd[6847]: rexec line 27: Deprecated option UsePrivilegeSeparation
Aug 31 19:00:06 smithi033 sshd[6847]: main: sshd: ssh-rsa algorithm is disabled
Aug 31 19:00:06 smithi033 sshd[6847]: fatal: mm_answer_sign: sign: error in libcrypto
Aug 31 19:00:25 smithi033 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Aug 31 19:00:31 smithi033 sshd[6852]: rexec line 27: Deprecated option UsePrivilegeSeparation
Aug 31 19:00:31 smithi033 sshd[6852]: main: sshd: ssh-rsa algorithm is disabled
Aug 31 19:00:31 smithi033 sshd[6852]: fatal: mm_answer_sign: sign: error in libcrypto
Aug 31 19:00:59 smithi033 sshd[6854]: rexec line 27: Deprecated option UsePrivilegeSeparation
Aug 31 19:00:59 smithi033 sshd[6854]: main: sshd: ssh-rsa algorithm is disabled
Aug 31 19:00:59 smithi033 sshd[6854]: fatal: mm_answer_sign: sign: error in libcrypto

What I'd done yesterday was generate a new ed25519 key for teuthworker; it may need to be pulled into the FOG image for centos 9 to become effective, though.

Actions #6

Updated by Dan Mick 8 months ago

I've recreated the fog image with the new teuthworker key included. A nop test on centos9 ran as teuthworker@teuthology (archive dir left on teuthology for examination if necessary).

Actions #7

Updated by Laura Flores 6 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF