Bug #62650
closedVarious SSH errors are preventing many jobs from completing properly
0%
Description
We've been having more and more issues with SSH errors recently:
https://sentry.ceph.com/organizations/ceph/issues/?end=2023-08-30T23%3A59%3A59&query=paramiko&start=2023-08-22T00%3A00%3A00&utc=true
I found a fix for the AttributeError: https://sentry.ceph.com/share/issue/e9092ab6059e4ea299350022b9b2cb52/ https://github.com/ceph/teuthology/pull/1886 - but there's clearly more going on at this point.
This issue alone has occurred over 600 times in the last 24h: https://sentry.ceph.com/share/issue/ef95cc1bf37f4e89a849c9a1c5e26a6b/
I noticed that all of the hosts it affected were CentOS 9.Stream, and I've narrowed this particular issue down to an SSH key incompatibility.
Updated by Zack Cerza 8 months ago
Adding a new key for teuthworker: https://github.com/ceph/keys/pull/431
Fixing keys.git's broken update script: https://github.com/ceph/keys/pull/432
Updated by Zack Cerza 8 months ago
A freshly reimaged node didn't have the new key present, and it's in @all: https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub
It's possible something FOG-related needs to be updated to close the loop here.
Updated by adam kraitman 8 months ago
Hey Zack the testnodes ansible role adds that new ssh key
Updated by Zack Cerza 8 months ago
adam kraitman wrote:
Hey Zack the testnodes ansible role adds that new ssh key
Right. The issue is that the jobs are failing before they can even run ansible.
Updated by Zack Cerza 8 months ago
While testing a teuthology branch that intends to improve error handling and logging during SSH connection attempts, I ran a journalctl -f
on a testnode while a job was about to fail to the `EOFError` issue.
<snip> 2023-08-31T18:58:34.920 DEBUG:teuthology.orchestra.connection:{'hostname': 'smithi033.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60} 2023-08-31T18:58:34.938 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T18:58:38.959 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T18:58:45.983 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T18:58:56.011 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T18:59:09.039 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T18:59:25.072 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T18:59:44.109 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T19:00:06.149 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T19:00:31.192 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T19:00:59.236 ERROR:teuthology.orchestra.connection:Error authenticating with smithi033.front.sepia.ceph.com: EOFError 2023-08-31T19:00:59.237 ERROR:teuthology.run_tasks:Saw exception from tasks. <snip>
And from the journal:
Aug 31 19:00:06 smithi033 sshd[6847]: rexec line 27: Deprecated option UsePrivilegeSeparation Aug 31 19:00:06 smithi033 sshd[6847]: main: sshd: ssh-rsa algorithm is disabled Aug 31 19:00:06 smithi033 sshd[6847]: fatal: mm_answer_sign: sign: error in libcrypto Aug 31 19:00:25 smithi033 systemd[1]: systemd-hostnamed.service: Deactivated successfully. Aug 31 19:00:31 smithi033 sshd[6852]: rexec line 27: Deprecated option UsePrivilegeSeparation Aug 31 19:00:31 smithi033 sshd[6852]: main: sshd: ssh-rsa algorithm is disabled Aug 31 19:00:31 smithi033 sshd[6852]: fatal: mm_answer_sign: sign: error in libcrypto Aug 31 19:00:59 smithi033 sshd[6854]: rexec line 27: Deprecated option UsePrivilegeSeparation Aug 31 19:00:59 smithi033 sshd[6854]: main: sshd: ssh-rsa algorithm is disabled Aug 31 19:00:59 smithi033 sshd[6854]: fatal: mm_answer_sign: sign: error in libcrypto
What I'd done yesterday was generate a new ed25519 key for teuthworker; it may need to be pulled into the FOG image for centos 9 to become effective, though.
Updated by Laura Flores 6 months ago
- Status changed from In Progress to Resolved