Project

General

Profile

Bug #3767

teuthology: stale jobs detected

Added by Tamilarasi muthamizhan about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

In the nightly runs, we currently see "stale jobs detected" which means,the initial check by teuthology to find if the test machine is clean[there is no /tmp/cephtest directory] for a new test to be run failed , and so it stops right there without proceeding further.

proposed fix:
we can have a fix to retry the test by locking a different machine, if the current set of one or more test machines are not cleaned up well.


Related issues

Related to teuthology - Feature #3782: cephmanager task: power cycle targets via ipmi Resolved
Related to teuthology - Feature #4566: add an option to force nuke and restart test when stale jobs are found New

History

#1 Updated by Sage Weil about 11 years ago

  • Project changed from Ceph to teuthology

#2 Updated by Sam Lang about 11 years ago

  • Status changed from New to Resolved

This should be resolved by a set of changes made for #3782 (commit: ace4cb07b2de99644c63f3ab90c21a663a384e69), which gives each run a separate test directory based on the job name.

#3 Updated by Sam Lang about 11 years ago

The teuthology config is currently still putting everything in /tmp/cephtest, so we'll still be seeing stale jobs. Once the config changes, those errors should go away. The change in the config is (semi) dependent on getting the ipmi tested/working on teuthology.

#4 Updated by Tamilarasi muthamizhan almost 11 years ago

  • Status changed from Resolved to In Progress
  • Assignee set to Sam Lang

waiting for the config change to go in.

#5 Updated by caleb miles almost 11 years ago

Might it also be possible to archive the test in a lost+found directory somewhere and nuke the temp files because looking for more machines might not be feasible for manual teuthology runs.

#6 Updated by Sam Lang almost 11 years ago

  • Status changed from In Progress to Fix Under Review

#7 Updated by Sam Lang almost 11 years ago

  • Status changed from Fix Under Review to 7

Need to change the config on teuthology and test out these changes.

#8 Updated by Sam Lang almost 11 years ago

I committed some changes last week to teuthology that sets the test directory for a teuthology run submitted through teuthology-schedule to:

<testdir>/<jobid>

Where jobid is the number assigned to the job by beanstalkd.

The .teuthology.yaml config on teuthworker@teuthology was also updated to use that path template, so now if you're looking for the results of a job on a specific node, they will be located in that path. For example, the job 12748 was run on plana55, and has test dir:

/home/ubuntu/cephtest/12748

If you use teuthology directly for testing, you won't get a job id. Instead, you will get a short string that represents your job:

/home/ubuntu/cephtest/sl1304150955

which is the first two letters of your username, then the date format %y%m%d%H%M

Note that you probably have the config option 'test_path' set in your .teuthology.yaml, which overrides this setting. If you want the above, you should remove 'test_path' and add:

test_base_dir: /home/ubuntu/cephtest

This resolves #3767. Test directories will not get deleted if a job fails, but instead of causing the next run assigned to that node to fail with a 'Stale jobs detected' error, the run will proceed on that node, but display a warning that stale test sub-directories exist and need to be cleaned up.

#9 Updated by Sam Lang almost 11 years ago

  • Status changed from 7 to Resolved

Also available in: Atom PDF