Feature #10945
closedEnable teuthology to re-run only failed jobs
0%
Description
Currently there is no a simple way to do so.
It will help a lot and also will use our resources more efficiently if we were able to do so.
Updated by Loïc Dachary about 9 years ago
- Project changed from Ceph to teuthology
The simpler way is to use the --filter argument of teuthology-suite with the value of the description: field found in the config.yaml file. For instance, running the rados failed jobs http://tracker.ceph.com/issues/10641#rados failed jobs:
$ ./virtualenv/bin/teuthology-suite --priority 101 --suite rados --filter 'rados/multimon/{clusters/21.yaml msgr-failures/many.yaml tasks/mon_clock_with_skews.yaml},rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/small-objects.yaml},rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/ec-small-objects.yaml},rados/verify/{1thrash/none.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml tasks/mon_recovery.yaml validater/valgrind.yaml},rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/cache-agent-small.yaml}' --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --email loic@dachary.org --owner loic@dachary.org --ceph firefly-backports
2015-02-28 15:58:08,474.474 INFO:teuthology.suite:ceph sha1: e54834bfac3c38562987730b317cb1944a96005b
2015-02-28 15:58:08,969.969 INFO:teuthology.suite:ceph version: 0.80.8-75-ge54834b-1precise
2015-02-28 15:58:09,606.606 INFO:teuthology.suite:teuthology branch: master
2015-02-28 15:58:10,407.407 INFO:teuthology.suite:ceph-qa-suite branch: firefly
2015-02-28 15:58:10,409.409 INFO:teuthology.repo_utils:Fetching from upstream into /home/loic/src/ceph-qa-suite_firefly
2015-02-28 15:58:11,522.522 INFO:teuthology.repo_utils:Resetting repo at /home/loic/src/ceph-qa-suite_firefly to branch firefly
2015-02-28 15:58:12,393.393 INFO:teuthology.suite:Suite rados in /home/loic/src/ceph-qa-suite_firefly/suites/rados generated 693 jobs (not yet filtered)
2015-02-28 15:58:12,419.419 INFO:teuthology.suite:Scheduling rados/multimon/{clusters/21.yaml msgr-failures/many.yaml tasks/mon_clock_with_skews.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783145
2015-02-28 15:58:14,199.199 INFO:teuthology.suite:Scheduling rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/cache-agent-small.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783146
2015-02-28 15:58:15,650.650 INFO:teuthology.suite:Scheduling rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/small-objects.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783147
2015-02-28 15:58:16,837.837 INFO:teuthology.suite:Scheduling rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/ec-small-objects.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783148
2015-02-28 15:58:18,421.421 INFO:teuthology.suite:Scheduling rados/verify/{1thrash/none.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml tasks/mon_recovery.yaml validater/valgrind.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783149
2015-02-28 15:58:19,729.729 INFO:teuthology.suite:Suite rados in /home/loic/src/ceph-qa-suite_firefly/suites/rados scheduled 5 jobs.
2015-02-28 15:58:19,729.729 INFO:teuthology.suite:Suite rados in /home/loic/src/ceph-qa-suite_firefly/suites/rados -- 688 jobs were filtered out.
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783150
Creates the http://pulpito.ceph.com/loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi/ run with just 5 jobs.
Updated by Andrew Schoen about 9 years ago
Nice use of --filter, Loic. I'd think we could probably make a simple call to paddles, get the jobs that have failed and then build that --filter string using their descriptions.
That json output should give us everything we'd need.
Updated by Loïc Dachary about 9 years ago
Here is a solution:
run=loic-2015-03-03_12:46:38-rgw-firefly-backports---basic-multi eval filter=$(curl --silent http://paddles.front.sepia.ceph.com/runs/$run/ | jq '.jobs[] | select(.success == false) | .description' | while read description ; do echo -n $description, ; done | sed -e 's/,$//') ./virtualenv/bin/teuthology-suite --filter="$filter" --priority 101 --suite rgw --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --email loic@dachary.org --owner loic@dachary.org --ceph firefly-backports
Which is explained in more details at http://dachary.org/?p=3575
Updated by Andrew Schoen about 9 years ago
Loic, if you change http://paddles.front.sepia.ceph.com/runs/$run/ to http://paddles.front.sepia.ceph.com/runs/$run/jobs/?status=fail then you won't need to do any additional filtering of the jobs in your script.
Updated by Loïc Dachary about 9 years ago
ah, great !
run=loic-2015-03-03_12:46:38-rgw-firefly-backports---basic-multi eval filter=$(curl --silent http://paddles.front.sepia.ceph.com/runs/$run/jobs/?status=fail | jq '.[].description' | while read description ; do echo -n $description, ; done | sed -e 's/,$//') ./virtualenv/bin/teuthology-suite --filter="$filter" --priority 101 --suite rgw --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --email loic@dachary.org --owner loic@dachary.org --ceph firefly-backports
Updated by Yuri Weinstein about 9 years ago
I tried seemingly successfully to filter-out jobs that passed:
ubuntu@teuthology:/a$ run=teuthology-2015-03-03_09:46:42-rados-firefly-distro-basic-multi ubuntu@teuthology:/a$ eval filter=$(curl --silent http://paddles.front.sepia.ceph.com/runs/$run/jobs/?status=pass | jq '.[].description' | while read description ; do echo -n $description, ; done | sed -e 's/,$//') ubuntu@teuthology:/a$ /home/ubuntu/bin/teuthology-suite --filter-out="$filter" --priority 90 --suite rados --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --ceph firefly
Run http://pulpito.front.sepia.ceph.com/teuthology-2015-03-03_09:46:42-rados-firefly-distro-basic-multi/
had 24 failed , 6 dead and 2 running and 32 were scheduled this way http://pulpito.front.sepia.ceph.com/ubuntu-2015-03-04_09:46:36-rados-firefly---basic-multi/
Updated by Yuri Weinstein about 9 years ago
Straight forward scenario works fine, however consider this.
- a run for a suite #1 has 600 total and 20 failed
- using the steps above we rerun @failed only" and got a run #2
- in the run #2 we got 3 failed jobs and want to rerun only those 3
- however, the steps above would not for for this case, not sure why thou (?)
Updated by Yuri Weinstein about 9 years ago
If we have scenario when we run "failed only" on run 1 and then run "failed only" on run 2 etc.
Loic asked: "But if that was automatic and recursive, when would it stop ?"
One option would be:
- implement manual rerun not automated
- run "failed only" is to be optional in final implementation
- run "failed only" in recursive automated way - would stop after preset number of tries, e.g. "teuthology-suite --re-run 2" will mean to do it 2 times
Updated by Yuri Weinstein over 7 years ago
- Has duplicate Bug #17439: rerunning of failed tests based on job name added
Updated by Yuri Weinstein over 7 years ago
- Related to Feature #14378: Consider adding re-run option with interactive-on-error on option added
Updated by Zack Cerza over 7 years ago
- Status changed from New to 12
- Assignee set to Zack Cerza
Updated by Zack Cerza over 7 years ago
- Status changed from 12 to In Progress
PR open, would love feedback:
https://github.com/ceph/teuthology/pull/963
Updated by Zack Cerza over 7 years ago
- Status changed from In Progress to Resolved