Bug #45438
closedteuthology/orchestra/connection: connection retry misses some exceptions
0%
Description
2020-04-06T08:14:40.860 INFO:teuthology.orchestra.console:Performing hard reset of smithi205 2020-04-06T08:14:40.893 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi205.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power reset 2020-04-06T08:14:40.917 INFO:teuthology.orchestra.console:Hard reset for smithi205 completed ... 2020-04-06T08:16:11.025 DEBUG:teuthology.orchestra.remote:timed out 2020-04-06T08:16:11.025 DEBUG:teuthology.misc:waited 60.0049200058 2020-04-06T08:16:11.067 ERROR:teuthology:Uncaught exception (Hub) Traceback (most recent call last): File "src/gevent/greenlet.py", line 766, in gevent._greenlet.Greenlet.run File "/home/teuthworker/src/git.ceph.com_ceph_master/qa/tasks/ceph.py", line 162, in invoke_logrotate wait=False, File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/cluster.py", line 64, in run return [remote.run(**kwargs) for remote in remotes] File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 202, in run self.ensure_online() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 176, in ensure_online self.connect() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 72, in connect self.ssh = connection.connect(**args) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/connection.py", line 108, in connect ssh.connect(**connect_args) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 349, in connect retry_on_signal(lambda: sock.connect(addr)) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/util.py", line 283, in retry_on_signal return function() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 349, in <lambda> retry_on_signal(lambda: sock.connect(addr)) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/_socket2.py", line 249, in connect self._wait(self._write_event) File "src/gevent/_hub_primitives.py", line 284, in gevent.__hub_primitives.wait_on_socket File "src/gevent/_hub_primitives.py", line 289, in gevent.__hub_primitives.wait_on_socket File "src/gevent/_hub_primitives.py", line 280, in gevent.__hub_primitives._primitive_wait File "src/gevent/_hub_primitives.py", line 281, in gevent.__hub_primitives._primitive_wait File "src/gevent/_hub_primitives.py", line 46, in gevent.__hub_primitives.WaitOperationsGreenlet.wait File "src/gevent/_hub_primitives.py", line 46, in gevent.__hub_primitives.WaitOperationsGreenlet.wait File "src/gevent/_hub_primitives.py", line 55, in gevent.__hub_primitives.WaitOperationsGreenlet.wait File "src/gevent/_waiter.py", line 151, in gevent.__waiter.Waiter.get File "src/gevent/_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch File "src/gevent/_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch File "src/gevent/_greenlet_primitives.py", line 64, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch File "src/gevent/__greenlet_primitives.pxd", line 35, in gevent.__greenlet_primitives._greenlet_switch timeout: timed out
From: /ceph/teuthology-archive/teuthology-2020-04-06_04:15:02-multimds-master-testing-basic-smithi/4927617/teuthology.log
and
2020-05-06T14:15:42.200 INFO:teuthology.orchestra.console:Performing hard reset of smithi041 2020-05-06T14:15:42.201 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi041.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power reset 2020-05-06T14:15:42.230 INFO:teuthology.orchestra.console:Hard reset for smithi041 completed ... 2020-05-06T14:15:45.204 INFO:teuthology.orchestra.run.smithi041:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-05-06T14:15:45.253 INFO:teuthology.orchestra.run.smithi068:> true 2020-05-06T14:15:45.273 INFO:teuthology.orchestra.run.smithi068:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-05-06T14:15:45.318 INFO:teuthology.orchestra.run.smithi073:> true 2020-05-06T14:15:45.337 INFO:teuthology.orchestra.run.smithi073:> sudo logrotate /etc/logrotate.d/ceph-test.conf ... 2020-05-06T14:16:12.333 INFO:teuthology.misc:Re-opening connections... 2020-05-06T14:16:12.334 INFO:teuthology.misc:trying to connect to ubuntu@smithi041.front.sepia.ceph.com 2020-05-06T14:16:12.336 INFO:teuthology.orchestra.remote:Trying to reconnect to host 2020-05-06T14:16:12.337 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi041.front.sepia.ceph.com', 'timeout': 60} 2020-05-06T14:16:12.543 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.d is failed for ~43s 2020-05-06T14:16:15.450 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi041.front.sepia.ceph.com', 'timeout': 60} 2020-05-06T14:16:19.660 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.d is failed for ~50s 2020-05-06T14:16:25.824 ERROR:teuthology:Uncaught exception (Hub) Traceback (most recent call last): File "src/gevent/greenlet.py", line 766, in gevent._greenlet.Greenlet.run File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuri-testing-2020-05-05-1439/qa/tasks/ceph.py", line 162, in invoke_logrotate wait=False, File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/cluster.py", line 64, in run return [remote.run(**kwargs) for remote in remotes] File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 202, in run self.ensure_online() File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 176, in ensure_online self.connect() File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 72, in connect self.ssh = connection.connect(**args) File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/connection.py", line 108, in connect ssh.connect(**connect_args) File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 368, in connect raise NoValidConnectionsError(errors) NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.21.15.41
From: /ceph/teuthology-archive/yuriw-2020-05-05_20:57:01-multimds-wip-yuri-testing-2020-05-05-1439-distro-basic-smithi/5026248/teuthology.log
Updated by Patrick Donnelly almost 4 years ago
- Status changed from In Progress to Fix Under Review
Updated by Patrick Donnelly almost 4 years ago
- Related to Bug #45255: Teuthology seems to timeout too soon after reboot and downstream tests fail added
Updated by Kefu Chai almost 4 years ago
- Status changed from Fix Under Review to Resolved
Updated by David Galloway almost 4 years ago
Is this the same thing? https://tracker.ceph.com/issues/45543
Updated by Kyrylo Shatskyy almost 4 years ago
I've tried to reproduce the error.
Running the job against py2 teuthology it produces failure
http://pulpito.ceph.com/kyr-2020-05-14_18:00:59-multimds-wip-yuri-testing-2020-05-05-1439-distro-basic-smithi/
Running it against PR:
https://github.com/ceph/teuthology/pull/1477
which includes the revert of the suspect cause:
https://github.com/ceph/teuthology/commit/478bb3f661621c38ac0b9bb21389cc5b225c318d
makes it passing:
http://pulpito.ceph.com/kyr-2020-05-14_18:02:48-multimds-wip-yuri-testing-2020-05-05-1439-distro-basic-smithi/
Corresponding steps to reproduce was used:
Failing:
teuthology-suite --seed 7575 -s multimds -c wip-yuri-testing-2020-05-05-1439 -m smithi --filter 'multimds/basic/{0-supported-random-distro$/{centos_latest.yaml} begin.yaml clusters/9-mds.yaml conf/{client.yaml mds.yaml mon.yaml osd.yaml} inline/yes.yaml mount/kclient/{mount.yaml overrides/{distro/stock/{k-stock.yaml rhel_8.yaml} ms-die-on-skipped.yaml}} objectstore-ec/filestore-xfs.yaml overrides/{basic/{frag_enable.yaml whitelist_health.yaml whitelist_wrongly_marked_down.yaml} fuse-default-perm-no.yaml} q_check_counter/check_counter.yaml tasks/cephfs_test_snapshots.yaml}' --subset 1/10 --teuthology-branch py2
Passing:
teuthology-suite --seed 7575 -s multimds -c wip-yuri-testing-2020-05-05-1439 -m smithi --filter 'multimds/basic/{0-supported-random-distro$/{centos_latest.yaml} begin.yaml clusters/9-mds.yaml conf/{client.yaml mds.yaml mon.yaml osd.yaml} inline/yes.yaml mount/kclient/{mount.yaml overrides/{distro/stock/{k-stock.yaml rhel_8.yaml} ms-die-on-skipped.yaml}} objectstore-ec/filestore-xfs.yaml overrides/{basic/{frag_enable.yaml whitelist_health.yaml whitelist_wrongly_marked_down.yaml} fuse-default-perm-no.yaml} q_check_counter/check_counter.yaml tasks/cephfs_test_snapshots.yaml}' --subset 1/10 --teuthology-branch refs/pull/1477/merge
Updated by Kyrylo Shatskyy almost 4 years ago
Updated by Patrick Donnelly almost 4 years ago
- Status changed from Resolved to New
- Assignee changed from Patrick Donnelly to Kyrylo Shatskyy
Updated by Kyrylo Shatskyy almost 4 years ago
Patrick, why is this reopened? Any new failures? Where are the logs?
Updated by Kyrylo Shatskyy almost 4 years ago
The issue is supposed to be resolved after we've merged the backport PR https://github.com/ceph/teuthology/pull/1477.
Updated by Patrick Donnelly almost 4 years ago
Kyrylo Shatskyy wrote:
The issue is supposed to be resolved after we've merged the backport PR https://github.com/ceph/teuthology/pull/1477.
Which commit? AFAIK this is now again broken because my (wrong) fix was reverted.
Updated by Kyrylo Shatskyy almost 4 years ago
Patrick Donnelly wrote:
Kyrylo Shatskyy wrote:
The issue is supposed to be resolved after we've merged the backport PR https://github.com/ceph/teuthology/pull/1477.
Which commit? AFAIK this is now again broken because my (wrong) fix was reverted.
The issue opened against py2 branch.
This commit https://github.com/ceph/teuthology/pull/1477/commits/e9ca1dc68ea4fa8463b7eca321cf74cf1c8a4213 has been merged to py2 within the PR and supposed to fix it.
Your (wrong) fix has never been merged to py2. So it has never been reverted from py2 branch.
So it is reverted from master, so if you have fresh logs for py2 dated after the backport PR merged, it would be great to see it.
As I pointed in previous comments, I was rerunning the tests provided in the description of this issue two times, and results were passing.
Updated by Kyrylo Shatskyy almost 4 years ago
I'm sorry, I was not correct with dates, the backport PR has been merged only on May 15:
6844213 2020-05-15 23:00 +0800 Kefu Chai Merge pull request #1477 from kshtsk/wip-py2-backport-20200514
Updated by Patrick Donnelly almost 4 years ago
- Status changed from New to Closed
Okay I'll just close this for now then and reopen if necesary.