Bug #15236
closedopenstack: SSHException: Key-exchange timed out
0%
Description
Upstream discussion¶
https://github.com/paramiko/paramiko/pull/705
Description¶
The job filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml msgr/async.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/readwrite.yaml} consistently fails with OpenStack at approximately the same point in the test (~90minutes in).
Steps to reproduce:
filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml msgr/async.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/readwrite.yaml}' teuthology-suite --filter="$filter" --throttle 30 --suite rados --suite-branch wip-12471-resources-hint --distro ubuntu --ceph loic-infernalis --machine-type openstack /home/ubuntu/teuthology/teuthology/openstack/test/openstack.yaml
What happens:
- http://167.114.239.68:8081/ubuntu-2015-10-16_21:02:38-rados-loic-infernalis---basic-openstack/1132/ (8GB RAM, 3 x 30GB disk)
- http://167.114.239.68:8081/ubuntu-2015-10-16_23:07:49-rados-loic-infernalis---basic-openstack/1133/ (30GB RAM, 3 x 30GB disk)
- http://167.114.239.68:8081/ubuntu-2015-10-18_08:55:56-rados-loic-infernalis---basic-openstack/1134/ (8GB RAM, 3 x 200GB disk)
What is expected:
filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr/async.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/readwrite.yaml}' teuthology-suite --priority 101 --suite rados --filter="$filter" --suite-branch master --distro ubuntu --email loic@dachary.org --ceph loic-infernalis --machine-type vps
Updated by Loïc Dachary about 8 years ago
- Has duplicate Bug #8433: SSHException: Key-exchange timed out waiting for key negotiation added
Updated by Loïc Dachary about 8 years ago
cw ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x --suite-branch master --ceph wip-14828-with-fsmap-fix --ceph-git-url https://github.com/SUSE/ceph --filter 'upgrade:infernalis-x/stress-split/{0-cluster/start.yaml 1-infernalis-install/infernalis.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/centos_7.2.yaml}'
Updated by Loïc Dachary about 8 years ago
paramiko/transport.py
def _send_user_message(self, data):
"""
send a message, but block if we're in key negotiation. this is used
for user-initiated requests.
"""
start = time.time()
while True:
self.clear_to_send.wait(0.1)
if not self.active:
self._log(DEBUG, 'Dropping user packet because connection is dead.')
return
self.clear_to_send_lock.acquire()
if self.clear_to_send.is_set():
break
self.clear_to_send_lock.release()
if time.time() > start + self.clear_to_send_timeout:
raise SSHException('Key-exchange timed out waiting for key negotiation')
try:
self._send_message(data)
finally:
self.clear_to_send_lock.release()
and self.clear_to_send_timeout = 30.0 which looks like a long enough timeout.
Updated by Loïc Dachary about 8 years ago
The patch that inspired the addition of this timeout https://bugs.launchpad.net/paramiko/+bug/378511 suggests that it would not be useful to increase the timeout because it would block forever.
Updated by Loïc Dachary about 8 years ago
http://167.114.246.217:8081/ubuntu-2016-03-17_14:04:33-upgrade:infernalis-x-jewel---basic-openstack/
running again but with the following in teuthology/orchestra/monkey.py
def patch_100_clear_to_send_timeout():
"""
See https://github.com/paramiko/paramiko/pull/705 for more information
"""
from paramiko.transport import Transport
from paramiko.common import DEFAULT_WINDOW_SIZE, DEFAULT_MAX_PACKET_SIZE
old_init = Transport.__init__
def new_init(self,
sock,
default_window_size=DEFAULT_WINDOW_SIZE,
default_max_packet_size=DEFAULT_MAX_PACKET_SIZE,
gss_kex=False,
gss_deleg_creds=True):
old_init(self,
sock,
default_window_size,
default_max_packet_size,
gss_kex,
gss_deleg_creds)
self.clear_to_send_timeout = 300.0
Transport.__init__ = new_init
Updated by Loïc Dachary about 8 years ago
2016-03-17T17:51:14.405 INFO:tasks.thrashosds.thrasher:Growing pool rbd ... 2016-03-17T17:56:14.801 INFO:tasks.rados.rados.0.target167114250088.stdout:1912: expect (ObjNum 585 snap 183 seq_num 585)
the monkeypatch helps in the sense that 5min pass before it timesout. It does not help in the sense that it's stuck forever. I think the root of the problem is with paramiko which ends up in a deadlock situation, waiting forever, for no good reason.
Updated by Loïc Dachary about 8 years ago
let see how that PR is received https://github.com/paramiko/paramiko/pull/705
Updated by Loïc Dachary about 8 years ago
Trying again by disabling rekeying entirely
def patch_100_trigger_rekey():
from paramiko.packet import Packetizer
Packetizer._trigger_rekey = lambda self: True
- fail http://167.114.246.217:8081/ubuntu-2016-03-18_09:42:16-upgrade:infernalis-x-jewel---basic-openstack/
but does not fail with re-keying
Updated by Loïc Dachary about 8 years ago
Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-18_02:10:02-upgrade:hammer-x-jewel-distro-basic-openstack/
Job: 17686
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-03-18_02:10:02-upgrade:hammer-x-jewel-distro-basic-openstack/17686/teuthology.log
2016-03-18T08:06:50.386 INFO:tasks.rados.rados.0.target066096.stdout:2906: snap_remove snap 295
2016-03-18T08:06:50.484 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 639, in wrapper
return func(self)
File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 686, in do_thrash
timeout=self.config.get('timeout')
File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1615, in wait_for_recovery
while not self.is_recovered():
File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1552, in is_recovered
return self.get_num_active_recovered() == self.get_num_pgs()
File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1480, in get_num_active_recovered
pgs = self.get_pg_stats()
File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1325, in get_pg_stats
out = self.raw_cluster_cmd('pg', 'dump', '--format=json')
File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 810, in raw_cluster_cmd
stdout=StringIO(),
File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 196, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
r.wait()
File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 94, in wait
greenlet.get()
File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 480, in get
self._raise_exception()
File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 171, in _raise_exception
reraise(*self.exc_info)
File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 534, in run
result = self._run(*self.args, **self.kwargs)
File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 228, in copy_file_to
return handler(f, dst)
File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj
buf = fsrc.read(length)
File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/file.py", line 153, in read
new_data = self._read(read_size)
File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 1215, in _read
return self.channel.recv(size)
File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 595, in recv
self.transport._send_user_message(m)
File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 1424, in _send_user_message
raise SSHException('Key-exchange timed out waiting for key negotiation')
SSHException: Key-exchange timed out waiting for key negotiation
Updated by Loïc Dachary about 8 years ago
Running the same test as above but on smithi to assert they succeed on bare metal (or not ;-)
filter='upgrade:infernalis-x/stress-split/{0-cluster/start.yaml 1-infernalis-install/infernalis.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-worklo\ ad/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps\ -many-objects.yaml} distros/centos_7.2.yaml}' teuthology-suite --priority 101 --suite upgrade/infernalis-x --suite-branch master --ceph jewel --filter="$filter" --email loic@dachary.org --machine-type smithi
Updated by Loïc Dachary about 8 years ago
fail http://167.114.246.217:8081/ubuntu-2016-03-18_09:42:16-upgrade:infernalis-x-jewel---basic-openstack/
but does not fail with re-keying
It appears to lose one of the six OSDs.
I was just looking at the yaml, and noticed that there is no openstack
stanza, which means the workers are getting provisioned on default instances. The roles are like this:
roles: - - mon.a - mon.b - mon.c - mds.a - osd.0 - osd.1 - osd.2 - - osd.3 - osd.4 - osd.5 - - client.0
Maybe we need to increase the memory or disk space on one or more of the instances?
Updated by Loïc Dachary about 8 years ago
Trying with another test that look like the one that originally triggered this problem (the exact one does not exist anymore).
filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml msgr/simple.yaml msgr-failures/few.yaml thrashers/pggrow.yaml workloads/small-objects.yaml}' ceph-workbench ceph-qa-suite --verbose --filter="$filter" --suite rados --suite-branch infernalis --ceph infernalis
Updated by Loïc Dachary about 8 years ago
ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x --suite-branch master --ceph wip-14828-with-fsmap-fix --ceph-git-url https://github.com/SUSE/ceph --filter 'upgrade:infernalis-x/stress-split/{0-cluster/start.yaml 1-infernalis-install/infernalis.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/centos_7.2.yaml}'
- fail http://167.114.241.163:8081/ubuntu-2016-03-22_14:36:59-upgrade:infernalis-x-jewel---basic-openstack/
- not enough disk space
Updated by Loïc Dachary about 8 years ago
https://github.com/ceph/ceph-qa-suite/pull/907
ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x --ceph-qa-suite-git-url http://github.com/dachary/ceph-qa-suite.git --suite-branch wip-15236-infernalis-x --ceph jewel --ceph-git-url https://github.com/ceph/ceph
- pass http://167.114.229.211:8081/ubuntu-2016-03-24_11:06:57-upgrade:infernalis-x-jewel---basic-openstack/
except stress-split-x because the logs used more than 40GB, required 100GB at least
Updated by Loïc Dachary about 8 years ago
ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x/stress-split --ceph-qa-suite-git-url http://github.com/dachary/ceph-qa-suite.git --suite-branch wip-15236-infernalis-x --ceph jewel --ceph-git-url https://github.com/ceph/ceph
the centos 7.2 test failed but also failed to copy the logs
2016-03-25T16:39:02.282 DEBUG:teuthology.misc:Transferring archived files from target167114245203:/var/log/ceph to /usr/share/nginx/html/ubuntu-2016-03-25_07:15:12-upgrade:infernalis-x:stress-split-jewel---basic-openstack/1/remote/target167114245203/log 2016-03-25T16:39:02.283 INFO:teuthology.orchestra.run.target167114245203:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 2016-03-25T16:39:02.743 INFO:teuthology.orchestra.run.target167114245203:Running: 'sudo tar cz -f /tmp/tmpDvKGle -C /var/log/ceph -- .' 2016-03-25T16:41:28.822 INFO:teuthology.orchestra.run.target167114245203:Running: 'sudo chmod 0666 /tmp/tmpDvKGle'
stalled, probably after a disk full error that went undetected. The logs are known to be over 40GB big and the machine running the cluster only has 20GB. Even compressed, it may not be enough.
Updated by Loïc Dachary about 8 years ago
- Status changed from In Progress to Fix Under Review
Updated by Loïc Dachary about 8 years ago
ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 100 --suite upgrade/infernalis-x/stress-split --ceph-qa-suite-git-url http://github.com/dachary/ceph-qa-suite.git --suite-branch wip-15236-infernalis-x --ceph jewel --ceph-git-url https://github.com/ceph/ceph
Updated by Loïc Dachary about 8 years ago
- Description updated (diff)
- Status changed from Fix Under Review to 15
Updated by Greg Farnum about 7 years ago
- Status changed from 15 to Closed
The whole run-on-openstack approach has changed since then and I see there's a fix commit.