Project

General

Profile

Actions

Bug #45438

closed

teuthology/orchestra/connection: connection retry misses some exceptions

Added by Patrick Donnelly almost 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Immediate
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2020-04-06T08:14:40.860 INFO:teuthology.orchestra.console:Performing hard reset of smithi205
2020-04-06T08:14:40.893 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi205.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power reset
2020-04-06T08:14:40.917 INFO:teuthology.orchestra.console:Hard reset for smithi205 completed
...
2020-04-06T08:16:11.025 DEBUG:teuthology.orchestra.remote:timed out
2020-04-06T08:16:11.025 DEBUG:teuthology.misc:waited 60.0049200058
2020-04-06T08:16:11.067 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 766, in gevent._greenlet.Greenlet.run
  File "/home/teuthworker/src/git.ceph.com_ceph_master/qa/tasks/ceph.py", line 162, in invoke_logrotate
    wait=False,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/cluster.py", line 64, in run
    return [remote.run(**kwargs) for remote in remotes]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 202, in run
    self.ensure_online()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 176, in ensure_online
    self.connect()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 72, in connect
    self.ssh = connection.connect(**args)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/connection.py", line 108, in connect
    ssh.connect(**connect_args)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 349, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/util.py", line 283, in retry_on_signal
    return function()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 349, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/_socket2.py", line 249, in connect
    self._wait(self._write_event)
  File "src/gevent/_hub_primitives.py", line 284, in gevent.__hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 289, in gevent.__hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 280, in gevent.__hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 281, in gevent.__hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent.__hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent.__hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 55, in gevent.__hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_waiter.py", line 151, in gevent.__waiter.Waiter.get
  File "src/gevent/_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 64, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/__greenlet_primitives.pxd", line 35, in gevent.__greenlet_primitives._greenlet_switch
timeout: timed out

From: /ceph/teuthology-archive/teuthology-2020-04-06_04:15:02-multimds-master-testing-basic-smithi/4927617/teuthology.log

and

2020-05-06T14:15:42.200 INFO:teuthology.orchestra.console:Performing hard reset of smithi041
2020-05-06T14:15:42.201 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi041.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power reset
2020-05-06T14:15:42.230 INFO:teuthology.orchestra.console:Hard reset for smithi041 completed
...
2020-05-06T14:15:45.204 INFO:teuthology.orchestra.run.smithi041:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-05-06T14:15:45.253 INFO:teuthology.orchestra.run.smithi068:> true
2020-05-06T14:15:45.273 INFO:teuthology.orchestra.run.smithi068:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-05-06T14:15:45.318 INFO:teuthology.orchestra.run.smithi073:> true
2020-05-06T14:15:45.337 INFO:teuthology.orchestra.run.smithi073:> sudo logrotate /etc/logrotate.d/ceph-test.conf
...
2020-05-06T14:16:12.333 INFO:teuthology.misc:Re-opening connections...
2020-05-06T14:16:12.334 INFO:teuthology.misc:trying to connect to ubuntu@smithi041.front.sepia.ceph.com
2020-05-06T14:16:12.336 INFO:teuthology.orchestra.remote:Trying to reconnect to host
2020-05-06T14:16:12.337 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi041.front.sepia.ceph.com', 'timeout': 60}
2020-05-06T14:16:12.543 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.d is failed for ~43s
2020-05-06T14:16:15.450 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi041.front.sepia.ceph.com', 'timeout': 60}
2020-05-06T14:16:19.660 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.d is failed for ~50s
2020-05-06T14:16:25.824 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 766, in gevent._greenlet.Greenlet.run
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuri-testing-2020-05-05-1439/qa/tasks/ceph.py", line 162, in invoke_logrotate
    wait=False,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/cluster.py", line 64, in run
    return [remote.run(**kwargs) for remote in remotes]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 202, in run
    self.ensure_online()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 176, in ensure_online
    self.connect()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 72, in connect
    self.ssh = connection.connect(**args)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/connection.py", line 108, in connect
    ssh.connect(**connect_args)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 368, in connect
    raise NoValidConnectionsError(errors)
NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.21.15.41

From: /ceph/teuthology-archive/yuriw-2020-05-05_20:57:01-multimds-wip-yuri-testing-2020-05-05-1439-distro-basic-smithi/5026248/teuthology.log


Related issues 1 (0 open1 closed)

Related to teuthology - Bug #45255: Teuthology seems to timeout too soon after reboot and downstream tests failResolvedKefu Chai

Actions
Actions #1

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from In Progress to Fix Under Review
Actions #2

Updated by Patrick Donnelly almost 4 years ago

  • Related to Bug #45255: Teuthology seems to timeout too soon after reboot and downstream tests fail added
Actions #3

Updated by Kefu Chai almost 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions #4

Updated by David Galloway almost 4 years ago

Actions #5

Updated by Kyrylo Shatskyy almost 4 years ago

I've tried to reproduce the error.

Running the job against py2 teuthology it produces failure
http://pulpito.ceph.com/kyr-2020-05-14_18:00:59-multimds-wip-yuri-testing-2020-05-05-1439-distro-basic-smithi/

Running it against PR:
https://github.com/ceph/teuthology/pull/1477
which includes the revert of the suspect cause:
https://github.com/ceph/teuthology/commit/478bb3f661621c38ac0b9bb21389cc5b225c318d
makes it passing:
http://pulpito.ceph.com/kyr-2020-05-14_18:02:48-multimds-wip-yuri-testing-2020-05-05-1439-distro-basic-smithi/

Corresponding steps to reproduce was used:
Failing:

teuthology-suite --seed 7575 -s multimds -c wip-yuri-testing-2020-05-05-1439 -m smithi --filter 'multimds/basic/{0-supported-random-distro$/{centos_latest.yaml} begin.yaml clusters/9-mds.yaml conf/{client.yaml mds.yaml mon.yaml osd.yaml} inline/yes.yaml mount/kclient/{mount.yaml overrides/{distro/stock/{k-stock.yaml rhel_8.yaml} ms-die-on-skipped.yaml}} objectstore-ec/filestore-xfs.yaml overrides/{basic/{frag_enable.yaml whitelist_health.yaml whitelist_wrongly_marked_down.yaml} fuse-default-perm-no.yaml} q_check_counter/check_counter.yaml tasks/cephfs_test_snapshots.yaml}' --subset 1/10 --teuthology-branch py2

Passing:

teuthology-suite --seed 7575 -s multimds -c wip-yuri-testing-2020-05-05-1439 -m smithi --filter 'multimds/basic/{0-supported-random-distro$/{centos_latest.yaml} begin.yaml clusters/9-mds.yaml conf/{client.yaml mds.yaml mon.yaml osd.yaml} inline/yes.yaml mount/kclient/{mount.yaml overrides/{distro/stock/{k-stock.yaml rhel_8.yaml} ms-die-on-skipped.yaml}} objectstore-ec/filestore-xfs.yaml overrides/{basic/{frag_enable.yaml whitelist_health.yaml whitelist_wrongly_marked_down.yaml} fuse-default-perm-no.yaml} q_check_counter/check_counter.yaml tasks/cephfs_test_snapshots.yaml}' --subset 1/10 --teuthology-branch refs/pull/1477/merge

Actions #7

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from Resolved to New
  • Assignee changed from Patrick Donnelly to Kyrylo Shatskyy
Actions #8

Updated by Kyrylo Shatskyy almost 4 years ago

Patrick, why is this reopened? Any new failures? Where are the logs?

Actions #9

Updated by Kyrylo Shatskyy almost 4 years ago

The issue is supposed to be resolved after we've merged the backport PR https://github.com/ceph/teuthology/pull/1477.

Actions #10

Updated by Patrick Donnelly almost 4 years ago

Kyrylo Shatskyy wrote:

The issue is supposed to be resolved after we've merged the backport PR https://github.com/ceph/teuthology/pull/1477.

Which commit? AFAIK this is now again broken because my (wrong) fix was reverted.

Actions #11

Updated by Kyrylo Shatskyy almost 4 years ago

Patrick Donnelly wrote:

Kyrylo Shatskyy wrote:

The issue is supposed to be resolved after we've merged the backport PR https://github.com/ceph/teuthology/pull/1477.

Which commit? AFAIK this is now again broken because my (wrong) fix was reverted.

The issue opened against py2 branch.
This commit https://github.com/ceph/teuthology/pull/1477/commits/e9ca1dc68ea4fa8463b7eca321cf74cf1c8a4213 has been merged to py2 within the PR and supposed to fix it.

Your (wrong) fix has never been merged to py2. So it has never been reverted from py2 branch.
So it is reverted from master, so if you have fresh logs for py2 dated after the backport PR merged, it would be great to see it.

As I pointed in previous comments, I was rerunning the tests provided in the description of this issue two times, and results were passing.

Actions #12

Updated by Kyrylo Shatskyy almost 4 years ago

I'm sorry, I was not correct with dates, the backport PR has been merged only on May 15:

6844213 2020-05-15 23:00 +0800 Kefu Chai Merge pull request #1477 from kshtsk/wip-py2-backport-20200514

Actions #13

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from New to Closed

Okay I'll just close this for now then and reopen if necesary.

Actions

Also available in: Atom PDF