Bug #43291
teuthology run gets stuck with first_in_suite or last_in_suite jobs in queued state
0%
Description
On some system runs are getting stuck because of one of the 'fake' jobs
like first-in-suite and last-in-suite remains in a queued state and
remain displayed in pulpito. The run itself can have passed all the jobs.
There is a suggestion that when all jobs a got queued and workers starts
process jobs one by one too fast. This can happen because of paddles processes
adding jobs a little bit slower then worker takes them from beanstalk pipe.
The first-in-suite and last-in-suite jobs are special purpose jobs and
they are self deleting as fast as possible after worker starts processing them,
and if there is no a record in the paddles database, the delete request just fails.
One of the solution would be adding limited loop poking the paddles for existence
of corresponding job for these two.
Temporary workaround: it is found that to unlock run it is as easy as login to
teuthology host (as, for example, worker) and run teuthology-kill against it.
(Update) Another issue is that workers who was in charge to process the job breaks
due to unhandled exception, corresponding stack trace below:
2020-05-14T11:12:51.410 INFO:teuthology.worker:Reserved job 33 2020-05-14T11:12:51.410 INFO:teuthology.worker:Config is: description: null email: null first_in_suite: true last_in_suite: false machine_type: ovh name: runner-2020-05-14_11:12:44-powercycle-wip-yuri-octopus_15.2.2_RC0-distro-basic-ovh owner: scheduled_runner@teuth-kyr priority: 1000 seed: '657' tube: ovh verbose: false 2020-05-14T11:12:51.413 INFO:teuthology.repo_utils:/home/worker/src/github.com_ceph_teuthology_master was just updated; assuming it is current 2020-05-14T11:12:51.414 INFO:teuthology.repo_utils:Resetting repo at /home/worker/src/github.com_ceph_teuthology_master to branch origin/master 2020-05-14T11:12:51.421 INFO:teuthology.repo_utils:Skipping bootstrap as it was already done in the last 60s 2020-05-14T11:12:51.421 INFO:teuthology.repo_utils:/home/worker/src/github.com_ceph_ceph_master was just updated; assuming it is current 2020-05-14T11:12:51.421 INFO:teuthology.repo_utils:Resetting repo at /home/worker/src/github.com_ceph_ceph_master to branch origin/master 2020-05-14T11:12:51.523 CRITICAL:teuthology:Uncaught exception Traceback (most recent call last): File "/home/worker/src/teuthology_master/virtualenv/bin/teuthology-worker", line 11, in <module> load_entry_point('teuthology', 'console_scripts', 'teuthology-worker')() File "/home/worker/src/teuthology_master/scripts/worker.py", line 7, in main teuthology.worker.main(parse_args()) File "/home/worker/src/teuthology_master/teuthology/worker.py", line 126, in main ctx.verbose, File "/home/worker/src/teuthology_master/teuthology/worker.py", line 195, in run_job report.try_delete_jobs(job_config['name'], job_config['job_id']) File "/home/worker/src/teuthology_master/teuthology/report.py", line 532, in try_delete_jobs got_jobs = reporter.get_jobs(run_name, fields=['job_id']) File "/home/worker/src/teuthology_master/teuthology/report.py", line 366, in get_jobs response.raise_for_status() File "/home/worker/src/teuthology_master/virtualenv/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8080/runs/runner-2020-05-14_11:12:44-powercycle-wip-yuri-octopus_15.2.2_RC0-distro-basic-ovh/jobs/?fields=job_id
History
#1 Updated by Kyrylo Shatskyy over 4 years ago
- Subject changed from teuthology run get stuck with first_in_suite or laste_in_suite jobs in queued state to teuthology run gets stuck with first_in_suite or laste_in_suite jobs in queued state
#2 Updated by Kyrylo Shatskyy almost 4 years ago
Another possible fix is to do NOT add this kind of job to paddles at all so there is no need to delete them, another advantage of this that they will not be shown in pulpito for while they are queued.
#3 Updated by Kyrylo Shatskyy almost 4 years ago
- Subject changed from teuthology run gets stuck with first_in_suite or laste_in_suite jobs in queued state to teuthology run gets stuck with first_in_suite or last_in_suite jobs in queued state
#4 Updated by Kyrylo Shatskyy almost 4 years ago
Probable fix https://github.com/ceph/teuthology/pull/1472
#5 Updated by Kyrylo Shatskyy almost 4 years ago
- Description updated (diff)
#6 Updated by Kyrylo Shatskyy almost 4 years ago
In order to resolve it correctly we need to go with too phases.
1) In general we do not need submit these jobs to paddles at all
2) If users are still exploiting old version of scheduler and switching workers to not cleaning those extra jobs, the runs will hung.
So, we need to go with to stages:
1. Add a patch which make not break worker if there is no first-in-suite and last-in-suite jobs.
2. Restart workers.
3. Add patch to not schedule those jobs. Ask people to switch to the recent teuthology.
4. Delete 'cleaning jobs' code sometimes in the future.
#7 Updated by Kyrylo Shatskyy over 3 years ago
- Status changed from New to Resolved