Bug #3767
teuthology: stale jobs detected
0%
Description
In the nightly runs, we currently see "stale jobs detected" which means,the initial check by teuthology to find if the test machine is clean[there is no /tmp/cephtest directory] for a new test to be run failed , and so it stops right there without proceeding further.
proposed fix:
we can have a fix to retry the test by locking a different machine, if the current set of one or more test machines are not cleaned up well.
Related issues
History
#1 Updated by Sage Weil about 11 years ago
- Project changed from Ceph to teuthology
#2 Updated by Sam Lang about 11 years ago
- Status changed from New to Resolved
This should be resolved by a set of changes made for #3782 (commit: ace4cb07b2de99644c63f3ab90c21a663a384e69), which gives each run a separate test directory based on the job name.
#3 Updated by Sam Lang about 11 years ago
The teuthology config is currently still putting everything in /tmp/cephtest, so we'll still be seeing stale jobs. Once the config changes, those errors should go away. The change in the config is (semi) dependent on getting the ipmi tested/working on teuthology.
#4 Updated by Tamilarasi muthamizhan almost 11 years ago
- Status changed from Resolved to In Progress
- Assignee set to Sam Lang
waiting for the config change to go in.
#5 Updated by caleb miles almost 11 years ago
Might it also be possible to archive the test in a lost+found directory somewhere and nuke the temp files because looking for more machines might not be feasible for manual teuthology runs.
#6 Updated by Sam Lang almost 11 years ago
- Status changed from In Progress to Fix Under Review
#7 Updated by Sam Lang almost 11 years ago
- Status changed from Fix Under Review to 7
Need to change the config on teuthology and test out these changes.
#8 Updated by Sam Lang almost 11 years ago
I committed some changes last week to teuthology that sets the test directory for a teuthology run submitted through teuthology-schedule to:
<testdir>/<jobid>
Where jobid is the number assigned to the job by beanstalkd.
The .teuthology.yaml config on teuthworker@teuthology was also updated to use that path template, so now if you're looking for the results of a job on a specific node, they will be located in that path. For example, the job 12748 was run on plana55, and has test dir:
/home/ubuntu/cephtest/12748
If you use teuthology directly for testing, you won't get a job id. Instead, you will get a short string that represents your job:
/home/ubuntu/cephtest/sl1304150955
which is the first two letters of your username, then the date format %y%m%d%H%M
Note that you probably have the config option 'test_path' set in your .teuthology.yaml, which overrides this setting. If you want the above, you should remove 'test_path' and add:
test_base_dir: /home/ubuntu/cephtest
This resolves #3767. Test directories will not get deleted if a job fails, but instead of causing the next run assigned to that node to fail with a 'Stale jobs detected' error, the run will proceed on that node, but display a warning that stale test sub-directories exist and need to be cleaned up.
#9 Updated by Sam Lang almost 11 years ago
- Status changed from 7 to Resolved