Project

General

Profile

Actions

Bug #58724

open

teuthology jobs in "running" status for 15+ hours

Added by Laura Flores about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Problem:
We had some teuthology jobs from several runs in "running" status for 15+ hours. Each of these jobs had experienced a failure, but teuthology did not mark them as dead after the failure occurred.

Example:
http://pulpito.front.sepia.ceph.com/lflores-2023-02-14_01:12:29-rados-main-distro-default-smithi/

Solution:
The issue was that the jobs really were dead, but the paddles db wasn't aware of that, likely because the dispatcher had died. The solution was to tell paddles that the jobs were actually dead with this command:

teuthology-report -D -r $run
Actions #1

Updated by Laura Flores about 1 year ago

  • Status changed from New to Resolved
Actions #2

Updated by Laura Flores about 1 year ago

  • Status changed from Resolved to New

Reopening this, as we have another instance:
https://pulpito.ceph.com/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/

I tried killing the run with teuthology-kill to no avail.

The running jobs had this traceback:

2023-02-22T10:09:59.841 DEBUG:teuthology.orchestra.run.smithi050:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip --
2023-02-22T10:09:59.844 DEBUG:teuthology.orchestra.run.smithi143:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip --
2023-02-22T10:09:59.885 DEBUG:teuthology.orchestra.run.smithi144:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip --
2023-02-22T10:11:05.227 INFO:tasks.ceph:Archiving logs...
2023-02-22T10:11:05.228 DEBUG:teuthology.misc:Transferring archived files from smithi050:/var/log/ceph to /home/teuthworker/archive/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/7181648/remote/smithi050/log
2023-02-22T10:11:05.229 DEBUG:teuthology.orchestra.run.smithi050:> sudo tar cz -f - -C /var/log/ceph -- .
2023-02-22T10:11:19.775 DEBUG:teuthology.misc:Transferring archived files from smithi143:/var/log/ceph to /home/teuthworker/archive/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/7181648/remote/smithi143/log
2023-02-22T10:11:19.779 DEBUG:teuthology.orchestra.run.smithi143:> sudo tar cz -f - -C /var/log/ceph -- .
2023-02-22T10:11:31.600 DEBUG:teuthology.misc:Transferring archived files from smithi144:/var/log/ceph to /home/teuthworker/archive/yuriw-2023-02-20_23:16:20-rados-wip-yuri11-testing-2023-02-20-1329-distro-default-smithi/7181648/remote/smithi144/log
2023-02-22T10:11:31.605 DEBUG:teuthology.orchestra.run.smithi144:> sudo tar cz -f - -C /var/log/ceph -- .
2023-02-22T10:11:43.264 DEBUG:teuthology.run_tasks:Unwinding manager install
2023-02-22T10:11:43.409 INFO:teuthology.task.install.util:Removing shipped files: /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits...
2023-02-22T10:11:43.410 DEBUG:teuthology.orchestra.run.smithi050:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-22T10:11:43.412 DEBUG:teuthology.orchestra.run.smithi143:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-22T10:11:43.414 DEBUG:teuthology.orchestra.run.smithi144:> sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits
2023-02-22T10:12:11.009 ERROR:teuthology.run_tasks:Manager failed: install
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/run_tasks.py", line 188, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/task/install/__init__.py", line 619, in task
    yield
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/contextutil.py", line 55, in nested
    raise exc[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/contextutil.py", line 47, in nested
    if exit(*exc):
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/task/install/__init__.py", line 222, in install
    remove_packages(ctx, config, package_list)
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/task/install/__init__.py", line 105, in remove_packages
    if not remote.is_reimageable or cleanup:
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/orchestra/remote.py", line 471, in is_reimageable
    return self.machine_type in self._reimage_types
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/orchestra/remote.py", line 463, in machine_type
    remote_info = teuthology.lock.query.get_status(self.hostname)
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/lock/query.py", line 20, in get_status
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_teuthology_fbbadb5ff5cfccce0d20e136f8956e65ec955359/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'get_status smithi050.front.sepia.ceph.com' reached maximum tries (10) after waiting for 32.5 seconds

They timed out waiting for status. Zack looked into the state of paddles, and noticed it was down.

Actions

Also available in: Atom PDF