Bug #58724
openteuthology jobs in "running" status for 15+ hours
0%
Description
Problem:
We had some teuthology jobs from several runs in "running" status for 15+ hours. Each of these jobs had experienced a failure, but teuthology did not mark them as dead after the failure occurred.
Example:
http://pulpito.front.sepia.ceph.com/lflores-2023-02-14_01:12:29-rados-main-distro-default-smithi/
Solution:
The issue was that the jobs really were dead, but the paddles db wasn't aware of that, likely because the dispatcher had died. The solution was to tell paddles that the jobs were actually dead with this command:
teuthology-report -D -r $run
Updated by Laura Flores about 1 year ago
- Status changed from Resolved to New
Reopening this, as we have another instance:
https://pulpito.ceph.com/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/
I tried killing the run with teuthology-kill to no avail.
The running jobs had this traceback:
2023-02-22T10:09:59.841 DEBUG:teuthology.orchestra.run.smithi050:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip --
2023-02-22T10:09:59.844 DEBUG:teuthology.orchestra.run.smithi143:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip --
2023-02-22T10:09:59.885 DEBUG:teuthology.orchestra.run.smithi144:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip --
2023-02-22T10:11:05.227 INFO:tasks.ceph:Archiving logs...
2023-02-22T10:11:05.228 DEBUG:teuthology.misc:Transferring archived files from smithi050:/var/log/ceph to /home/teuthworker/archive/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/7181648/remote/smithi050/log
2023-02-22T10:11:05.229 DEBUG:teuthology.orchestra.run.smithi050:> sudo tar cz -f - -C /var/log/ceph -- .
2023-02-22T10:11:19.775 DEBUG:teuthology.misc:Transferring archived files from smithi143:/var/log/ceph to /home/teuthworker/archive/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/7181648/remote/smithi143/log
2023-02-22T10:11:19.779 DEBUG:teuthology.orchestra.run.smithi143:> sudo tar cz -f - -C /var/log/ceph -- .
2023-02-22T10:11:31.600 DEBUG:teuthology.misc:Transferring archived files from smithi144:/var/log/ceph to /home/teuthworker/archive/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/7181648/remote/smithi144/log
2023-02-22T10:11:31.605 DEBUG:teuthology.orchestra.run.smithi144:> sudo tar cz -f - -C /var/log/ceph -- .
2023-02-22T10:11:43.264 DEBUG:teuthology.run_tasks:Unwinding manager install
2023-02-22T10:11:43.409 INFO:teuthology.task.install.util:Removing shipped files: /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits...
2023-02-22T10:11:43.410 DEBUG:teuthology.orchestra.run.smithi050:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-22T10:11:43.412 DEBUG:teuthology.orchestra.run.smithi143:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-22T10:11:43.414 DEBUG:teuthology.orchestra.run.smithi144:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-22T10:12:11.009 ERROR:teuthology.run_tasks:Manager failed: install
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/run_tasks.py", line 188, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/task/install/__init__.py", line 619, in task
yield
File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/contextutil.py", line 55, in nested
raise exc[1]
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/contextutil.py", line 47, in nested
if exit(*exc):
File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/task/install/__init__.py", line 222, in install
remove_packages(ctx, config, package_list)
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/task/install/__init__.py", line 105, in remove_packages
if not remote.is_reimageable or cleanup:
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/orchestra/remote.py", line 471, in is_reimageable
return self.machine_type in self._reimage_types
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/orchestra/remote.py", line 463, in machine_type
remote_info = teuthology.lock.query.get_status(self.hostname)
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/lock/query.py", line 20, in get_status
while proceed():
File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/contextutil.py", line 133, in __call__
raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'get_status smithi050.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds
They timed out waiting for status. Zack looked into the state of paddles, and noticed it was down.