Bug #15236: openstack: SSHException: Key-exchange timed out - Ceph - Ceph

Actions

Copy link

Bug #15236

closed

openstack: SSHException: Key-exchange timed out

Added by Loïc Dachary about 8 years ago. Updated about 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Upstream discussion¶

https://github.com/paramiko/paramiko/pull/705

Description¶

The job filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml msgr/async.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/readwrite.yaml} consistently fails with OpenStack at approximately the same point in the test (~90minutes in).

Steps to reproduce:

filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml msgr/async.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/readwrite.yaml}'
teuthology-suite --filter="$filter" --throttle 30 --suite rados --suite-branch wip-12471-resources-hint --distro ubuntu --ceph loic-infernalis --machine-type openstack /home/ubuntu/teuthology/teuthology/openstack/test/openstack.yaml

What happens:

http://167.114.239.68:8081/ubuntu-2015-10-16_21:02:38-rados-loic-infernalis---basic-openstack/1132/ (8GB RAM, 3 x 30GB disk)
http://167.114.239.68:8081/ubuntu-2015-10-16_23:07:49-rados-loic-infernalis---basic-openstack/1133/ (30GB RAM, 3 x 30GB disk)
http://167.114.239.68:8081/ubuntu-2015-10-18_08:55:56-rados-loic-infernalis---basic-openstack/1134/ (8GB RAM, 3 x 200GB disk)

What is expected:

filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr/async.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/readwrite.yaml}'
teuthology-suite --priority 101 --suite rados --filter="$filter" --suite-branch master --distro ubuntu --email loic@dachary.org --ceph loic-infernalis --machine-type vps

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

Has duplicate Bug #8433: SSHException: Key-exchange timed out waiting for key negotiation added

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

cw ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x --suite-branch master --ceph wip-14828-with-fsmap-fix --ceph-git-url https://github.com/SUSE/ceph --filter 'upgrade:infernalis-x/stress-split/{0-cluster/start.yaml 1-infernalis-install/infernalis.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/centos_7.2.yaml}'

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

paramiko/transport.py

    def _send_user_message(self, data):
        """ 
        send a message, but block if we're in key negotiation.  this is used
        for user-initiated requests.
        """ 
        start = time.time()
        while True:
            self.clear_to_send.wait(0.1)
            if not self.active:
                self._log(DEBUG, 'Dropping user packet because connection is dead.')
                return
            self.clear_to_send_lock.acquire()
            if self.clear_to_send.is_set():
                break
            self.clear_to_send_lock.release()
            if time.time() > start + self.clear_to_send_timeout:
                raise SSHException('Key-exchange timed out waiting for key negotiation')
        try:
            self._send_message(data)
        finally:
            self.clear_to_send_lock.release()

and self.clear_to_send_timeout = 30.0 which looks like a long enough timeout.

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

The patch that inspired the addition of this timeout https://bugs.launchpad.net/paramiko/+bug/378511 suggests that it would not be useful to increase the timeout because it would block forever.

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

http://167.114.246.217:8081/ubuntu-2016-03-17_14:04:33-upgrade:infernalis-x-jewel---basic-openstack/

running again but with the following in teuthology/orchestra/monkey.py

def patch_100_clear_to_send_timeout():
    """ 
    See https://github.com/paramiko/paramiko/pull/705 for more information
    """ 
    from paramiko.transport import Transport
    from paramiko.common import DEFAULT_WINDOW_SIZE, DEFAULT_MAX_PACKET_SIZE
    old_init = Transport.__init__
    def new_init(self,
                 sock,
                 default_window_size=DEFAULT_WINDOW_SIZE,
                 default_max_packet_size=DEFAULT_MAX_PACKET_SIZE,
                 gss_kex=False,
                 gss_deleg_creds=True):
        old_init(self,
                 sock,
                 default_window_size,
                 default_max_packet_size,
                 gss_kex,
                 gss_deleg_creds)
        self.clear_to_send_timeout = 300.0
    Transport.__init__ = new_init

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

2016-03-17T17:51:14.405 INFO:tasks.thrashosds.thrasher:Growing pool rbd
...
2016-03-17T17:56:14.801 INFO:tasks.rados.rados.0.target167114250088.stdout:1912:  expect (ObjNum 585 snap 183 seq_num 585)

the monkeypatch helps in the sense that 5min pass before it timesout. It does not help in the sense that it's stuck forever. I think the root of the problem is with paramiko which ends up in a deadlock situation, waiting forever, for no good reason.

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

let see how that PR is received https://github.com/paramiko/paramiko/pull/705

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

Trying again by disabling rekeying entirely

def patch_100_trigger_rekey():

    from paramiko.packet import Packetizer
    Packetizer._trigger_rekey = lambda self: True

fail http://167.114.246.217:8081/ubuntu-2016-03-18_09:42:16-upgrade:infernalis-x-jewel---basic-openstack/
but does not fail with re-keying

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-18_02:10:02-upgrade:hammer-x-jewel-distro-basic-openstack/
Job: 17686
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-03-18_02:10:02-upgrade:hammer-x-jewel-distro-basic-openstack/17686/teuthology.log

2016-03-18T08:06:50.386 INFO:tasks.rados.rados.0.target066096.stdout:2906: snap_remove snap 295
2016-03-18T08:06:50.484 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 639, in wrapper
    return func(self)
  File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 686, in do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1615, in wait_for_recovery
    while not self.is_recovered():
  File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1552, in is_recovered
    return self.get_num_active_recovered() == self.get_num_pgs()
  File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1480, in get_num_active_recovered
    pgs = self.get_pg_stats()
  File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 1325, in get_pg_stats
    out = self.raw_cluster_cmd('pg', 'dump', '--format=json')
  File "/home/teuthworker/src/ceph-qa-suite_jewel/tasks/ceph_manager.py", line 810, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 196, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 94, in wait
    greenlet.get()
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 480, in get
    self._raise_exception()
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 171, in _raise_exception
    reraise(*self.exc_info)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 228, in copy_file_to
    return handler(f, dst)
  File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj
    buf = fsrc.read(length)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/file.py", line 153, in read
    new_data = self._read(read_size)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 1215, in _read
    return self.channel.recv(size)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 595, in recv
    self.transport._send_user_message(m)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 1424, in _send_user_message
    raise SSHException('Key-exchange timed out waiting for key negotiation')
SSHException: Key-exchange timed out waiting for key negotiation

Actions

Copy link

#10

Updated by Loïc Dachary about 8 years ago

Running the same test as above but on smithi to assert they succeed on bare metal (or not ;-)

filter='upgrade:infernalis-x/stress-split/{0-cluster/start.yaml 1-infernalis-install/infernalis.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-worklo\
ad/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps\
-many-objects.yaml} distros/centos_7.2.yaml}'
teuthology-suite --priority 101 --suite upgrade/infernalis-x --suite-branch master --ceph jewel --filter="$filter" --email loic@dachary.org --machine-type smithi

pass http://pulpito.ceph.com/loic-2016-03-18_09:49:07-upgrade:infernalis-x-jewel---basic-smithi

Actions

Copy link

#11

Updated by Loïc Dachary about 8 years ago

fail http://167.114.246.217:8081/ubuntu-2016-03-18_09:42:16-upgrade:infernalis-x-jewel---basic-openstack/
but does not fail with re-keying

It appears to lose one of the six OSDs.

I was just looking at the yaml, and noticed that there is no openstack stanza, which means the workers are getting provisioned on default instances. The roles are like this:

  roles:
  - - mon.a
    - mon.b
    - mon.c
    - mds.a
    - osd.0
    - osd.1
    - osd.2
  - - osd.3
    - osd.4
    - osd.5
  - - client.0

Maybe we need to increase the memory or disk space on one or more of the instances?

Actions

Copy link

#12

Updated by Loïc Dachary about 8 years ago

Trying with another test that look like the one that originally triggered this problem (the exact one does not exist anymore).

filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml msgr/simple.yaml msgr-failures/few.yaml thrashers/pggrow.yaml workloads/small-objects.yaml}'
ceph-workbench ceph-qa-suite --verbose --filter="$filter" --suite rados --suite-branch infernalis --ceph infernalis

pass http://167.114.241.163:8081/ubuntu-2016-03-21_10:07:45-rados-infernalis---basic-openstack

Actions

Copy link

#13

Updated by Loïc Dachary about 8 years ago

ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x --suite-branch master --ceph wip-14828-with-fsmap-fix --ceph-git-url https://github.com/SUSE/ceph --filter 'upgrade:infernalis-x/stress-split/{0-cluster/start.yaml 1-infernalis-install/infernalis.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/centos_7.2.yaml}'

fail http://167.114.241.163:8081/ubuntu-2016-03-22_14:36:59-upgrade:infernalis-x-jewel---basic-openstack/
- not enough disk space

Actions

Copy link

#14

Updated by Loïc Dachary about 8 years ago

https://github.com/ceph/ceph-qa-suite/pull/907

ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x --ceph-qa-suite-git-url http://github.com/dachary/ceph-qa-suite.git --suite-branch wip-15236-infernalis-x --ceph jewel --ceph-git-url https://github.com/ceph/ceph

pass http://167.114.229.211:8081/ubuntu-2016-03-24_11:06:57-upgrade:infernalis-x-jewel---basic-openstack/

except stress-split-x because the logs used more than 40GB, required 100GB at least

Actions

Copy link

#15

Updated by Loïc Dachary about 8 years ago

ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 10 --suite upgrade/infernalis-x/stress-split --ceph-qa-suite-git-url http://github.com/dachary/ceph-qa-suite.git --suite-branch wip-15236-infernalis-x --ceph jewel --ceph-git-url https://github.com/ceph/ceph

failed http://167.114.229.211:8081/ubuntu-2016-03-25_07:15:12-upgrade:infernalis-x:stress-split-jewel---basic-openstack/

the centos 7.2 test failed but also failed to copy the logs

2016-03-25T16:39:02.282 DEBUG:teuthology.misc:Transferring archived files from target167114245203:/var/log/ceph to /usr/share/nginx/html/ubuntu-2016-03-25_07:15:12-upgrade:infernalis-x:stress-split-jewel---basic-openstack/1/remote/target167114245203/log
2016-03-25T16:39:02.283 INFO:teuthology.orchestra.run.target167114245203:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 
2016-03-25T16:39:02.743 INFO:teuthology.orchestra.run.target167114245203:Running: 'sudo tar cz -f /tmp/tmpDvKGle -C /var/log/ceph -- .'
2016-03-25T16:41:28.822 INFO:teuthology.orchestra.run.target167114245203:Running: 'sudo chmod 0666 /tmp/tmpDvKGle'

stalled, probably after a disk full error that went undetected. The logs are known to be over 40GB big and the machine running the cluster only has 20GB. Even compressed, it may not be enough.

Actions

Copy link

#16

Updated by Loïc Dachary about 8 years ago

Status changed from In Progress to Fix Under Review

https://github.com/ceph/teuthology/pull/826

Actions

Copy link

#17

Updated by Loïc Dachary about 8 years ago

ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 100 --suite upgrade/infernalis-x/stress-split --ceph-qa-suite-git-url http://github.com/dachary/ceph-qa-suite.git --suite-branch wip-15236-infernalis-x --ceph jewel --ceph-git-url https://github.com/ceph/ceph