Project

General

Profile

Bug #43291

teuthology run gets stuck with first_in_suite or last_in_suite jobs in queued state

Added by Kyrylo Shatskyy over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

On some system runs are getting stuck because of one of the 'fake' jobs
like first-in-suite and last-in-suite remains in a queued state and
remain displayed in pulpito. The run itself can have passed all the jobs.

There is a suggestion that when all jobs a got queued and workers starts
process jobs one by one too fast. This can happen because of paddles processes
adding jobs a little bit slower then worker takes them from beanstalk pipe.
The first-in-suite and last-in-suite jobs are special purpose jobs and
they are self deleting as fast as possible after worker starts processing them,
and if there is no a record in the paddles database, the delete request just fails.

One of the solution would be adding limited loop poking the paddles for existence
of corresponding job for these two.

Temporary workaround: it is found that to unlock run it is as easy as login to
teuthology host (as, for example, worker) and run teuthology-kill against it.

(Update) Another issue is that workers who was in charge to process the job breaks
due to unhandled exception, corresponding stack trace below:

2020-05-14T11:12:51.410 INFO:teuthology.worker:Reserved job 33
2020-05-14T11:12:51.410 INFO:teuthology.worker:Config is: description: null
email: null
first_in_suite: true
last_in_suite: false
machine_type: ovh
name: runner-2020-05-14_11:12:44-powercycle-wip-yuri-octopus_15.2.2_RC0-distro-basic-ovh
owner: scheduled_runner@teuth-kyr
priority: 1000
seed: '657'
tube: ovh
verbose: false

2020-05-14T11:12:51.413 INFO:teuthology.repo_utils:/home/worker/src/github.com_ceph_teuthology_master was just updated; assuming it is current
2020-05-14T11:12:51.414 INFO:teuthology.repo_utils:Resetting repo at /home/worker/src/github.com_ceph_teuthology_master to branch origin/master
2020-05-14T11:12:51.421 INFO:teuthology.repo_utils:Skipping bootstrap as it was already done in the last 60s
2020-05-14T11:12:51.421 INFO:teuthology.repo_utils:/home/worker/src/github.com_ceph_ceph_master was just updated; assuming it is current
2020-05-14T11:12:51.421 INFO:teuthology.repo_utils:Resetting repo at /home/worker/src/github.com_ceph_ceph_master to branch origin/master
2020-05-14T11:12:51.523 CRITICAL:teuthology:Uncaught exception
Traceback (most recent call last):
  File "/home/worker/src/teuthology_master/virtualenv/bin/teuthology-worker", line 11, in <module>
    load_entry_point('teuthology', 'console_scripts', 'teuthology-worker')()
  File "/home/worker/src/teuthology_master/scripts/worker.py", line 7, in main
    teuthology.worker.main(parse_args())
  File "/home/worker/src/teuthology_master/teuthology/worker.py", line 126, in main
    ctx.verbose,
  File "/home/worker/src/teuthology_master/teuthology/worker.py", line 195, in run_job
    report.try_delete_jobs(job_config['name'], job_config['job_id'])
  File "/home/worker/src/teuthology_master/teuthology/report.py", line 532, in try_delete_jobs
    got_jobs = reporter.get_jobs(run_name, fields=['job_id'])
  File "/home/worker/src/teuthology_master/teuthology/report.py", line 366, in get_jobs
    response.raise_for_status()
  File "/home/worker/src/teuthology_master/virtualenv/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8080/runs/runner-2020-05-14_11:12:44-powercycle-wip-yuri-octopus_15.2.2_RC0-distro-basic-ovh/jobs/?fields=job_id

History

#1 Updated by Kyrylo Shatskyy over 4 years ago

  • Subject changed from teuthology run get stuck with first_in_suite or laste_in_suite jobs in queued state to teuthology run gets stuck with first_in_suite or laste_in_suite jobs in queued state

#2 Updated by Kyrylo Shatskyy almost 4 years ago

Another possible fix is to do NOT add this kind of job to paddles at all so there is no need to delete them, another advantage of this that they will not be shown in pulpito for while they are queued.

#3 Updated by Kyrylo Shatskyy almost 4 years ago

  • Subject changed from teuthology run gets stuck with first_in_suite or laste_in_suite jobs in queued state to teuthology run gets stuck with first_in_suite or last_in_suite jobs in queued state

#5 Updated by Kyrylo Shatskyy almost 4 years ago

  • Description updated (diff)

#6 Updated by Kyrylo Shatskyy almost 4 years ago

In order to resolve it correctly we need to go with too phases.
1) In general we do not need submit these jobs to paddles at all
2) If users are still exploiting old version of scheduler and switching workers to not cleaning those extra jobs, the runs will hung.

So, we need to go with to stages:
1. Add a patch which make not break worker if there is no first-in-suite and last-in-suite jobs.
2. Restart workers.
3. Add patch to not schedule those jobs. Ask people to switch to the recent teuthology.
4. Delete 'cleaning jobs' code sometimes in the future.

#7 Updated by Kyrylo Shatskyy over 3 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF