Feature #10945: Enable teuthology to re-run only failed jobs - teuthology - Ceph

Actions

Copy link

Feature #10945

closed

Enable teuthology to re-run only failed jobs

Added by Yuri Weinstein about 9 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Zack Cerza

Category:

% Done:

Source:

Q/A

Tags:

Backport:

Reviewed:

Affected Versions:

Description

Currently there is no a simple way to do so.
It will help a lot and also will use our resources more efficiently if we were able to do so.

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Project changed from Ceph to teuthology

The simpler way is to use the --filter argument of teuthology-suite with the value of the description: field found in the config.yaml file. For instance, running the rados failed jobs http://tracker.ceph.com/issues/10641#rados failed jobs:

$ ./virtualenv/bin/teuthology-suite --priority 101 --suite rados --filter 'rados/multimon/{clusters/21.yaml msgr-failures/many.yaml tasks/mon_clock_with_skews.yaml},rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/small-objects.yaml},rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/ec-small-objects.yaml},rados/verify/{1thrash/none.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml tasks/mon_recovery.yaml validater/valgrind.yaml},rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/cache-agent-small.yaml}' --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --email loic@dachary.org --owner loic@dachary.org --ceph firefly-backports
2015-02-28 15:58:08,474.474 INFO:teuthology.suite:ceph sha1: e54834bfac3c38562987730b317cb1944a96005b
2015-02-28 15:58:08,969.969 INFO:teuthology.suite:ceph version: 0.80.8-75-ge54834b-1precise
2015-02-28 15:58:09,606.606 INFO:teuthology.suite:teuthology branch: master
2015-02-28 15:58:10,407.407 INFO:teuthology.suite:ceph-qa-suite branch: firefly
2015-02-28 15:58:10,409.409 INFO:teuthology.repo_utils:Fetching from upstream into /home/loic/src/ceph-qa-suite_firefly
2015-02-28 15:58:11,522.522 INFO:teuthology.repo_utils:Resetting repo at /home/loic/src/ceph-qa-suite_firefly to branch firefly
2015-02-28 15:58:12,393.393 INFO:teuthology.suite:Suite rados in /home/loic/src/ceph-qa-suite_firefly/suites/rados generated 693 jobs (not yet filtered)
2015-02-28 15:58:12,419.419 INFO:teuthology.suite:Scheduling rados/multimon/{clusters/21.yaml msgr-failures/many.yaml tasks/mon_clock_with_skews.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783145
2015-02-28 15:58:14,199.199 INFO:teuthology.suite:Scheduling rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/cache-agent-small.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783146
2015-02-28 15:58:15,650.650 INFO:teuthology.suite:Scheduling rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/morepggrow.yaml workloads/small-objects.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783147
2015-02-28 15:58:16,837.837 INFO:teuthology.suite:Scheduling rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/ec-small-objects.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783148
2015-02-28 15:58:18,421.421 INFO:teuthology.suite:Scheduling rados/verify/{1thrash/none.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml tasks/mon_recovery.yaml validater/valgrind.yaml}
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783149
2015-02-28 15:58:19,729.729 INFO:teuthology.suite:Suite rados in /home/loic/src/ceph-qa-suite_firefly/suites/rados scheduled 5 jobs.
2015-02-28 15:58:19,729.729 INFO:teuthology.suite:Suite rados in /home/loic/src/ceph-qa-suite_firefly/suites/rados -- 688 jobs were filtered out.
Job scheduled with name loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi and ID 783150

Creates the http://pulpito.ceph.com/loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi/ run with just 5 jobs.

Actions

Copy link

Updated by Andrew Schoen about 9 years ago

Nice use of --filter, Loic. I'd think we could probably make a simple call to paddles, get the jobs that have failed and then build that --filter string using their descriptions.

http://paddles.front.sepia.ceph.com/runs/loic-2015-02-28_15:58:07-rados-firefly-backports---basic-multi/

That json output should give us everything we'd need.

Actions

Copy link

Updated by Zack Cerza about 9 years ago

Also:
http://paddles.front.sepia.ceph.com/runs/teuthology-2015-03-01_23:18:01-multimds-hammer-testing-basic-multi/jobs/?status=fail

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Here is a solution:

run=loic-2015-03-03_12:46:38-rgw-firefly-backports---basic-multi
eval filter=$(curl --silent http://paddles.front.sepia.ceph.com/runs/$run/ | jq '.jobs[] | select(.success == false) | .description' | while read description ; do echo -n $description, ; done | sed -e 's/,$//')
./virtualenv/bin/teuthology-suite --filter="$filter" --priority 101 --suite rgw --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --email loic@dachary.org --owner loic@dachary.org  --ceph firefly-backports

Which is explained in more details at http://dachary.org/?p=3575

Actions

Copy link

Updated by Andrew Schoen about 9 years ago

Loic, if you change http://paddles.front.sepia.ceph.com/runs/$run/ to http://paddles.front.sepia.ceph.com/runs/$run/jobs/?status=fail then you won't need to do any additional filtering of the jobs in your script.

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

ah, great !

run=loic-2015-03-03_12:46:38-rgw-firefly-backports---basic-multi
eval filter=$(curl --silent http://paddles.front.sepia.ceph.com/runs/$run/jobs/?status=fail | jq '.[].description' | while read description ; do echo -n $description, ; done | sed -e 's/,$//')
./virtualenv/bin/teuthology-suite --filter="$filter" --priority 101 --suite rgw --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --email loic@dachary.org --owner loic@dachary.org  --ceph firefly-backports

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

I tried seemingly successfully to filter-out jobs that passed:

ubuntu@teuthology:/a$ run=teuthology-2015-03-03_09:46:42-rados-firefly-distro-basic-multi
ubuntu@teuthology:/a$ eval filter=$(curl --silent http://paddles.front.sepia.ceph.com/runs/$run/jobs/?status=pass | jq '.[].description' | while read description ; do echo -n $description, ; done | sed -e 's/,$//')
ubuntu@teuthology:/a$ /home/ubuntu/bin/teuthology-suite --filter-out="$filter" --priority 90 --suite rados --suite-branch firefly --machine-type plana,burnupi,mira --distro ubuntu --ceph firefly

Run http://pulpito.front.sepia.ceph.com/teuthology-2015-03-03_09:46:42-rados-firefly-distro-basic-multi/ had 24 failed , 6 dead and 2 running and 32 were scheduled this way http://pulpito.front.sepia.ceph.com/ubuntu-2015-03-04_09:46:36-rados-firefly---basic-multi/

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

Straight forward scenario works fine, however consider this.

- a run for a suite #1 has 600 total and 20 failed
- using the steps above we rerun @failed only" and got a run #2
- in the run #2 we got 3 failed jobs and want to rerun only those 3
- however, the steps above would not for for this case, not sure why thou (?)

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

If we have scenario when we run "failed only" on run 1 and then run "failed only" on run 2 etc.
Loic asked: "But if that was automatic and recursive, when would it stop ?"

One option would be:

- implement manual rerun not automated
- run "failed only" is to be optional in final implementation
- run "failed only" in recursive automated way - would stop after preset number of tries, e.g. "teuthology-suite --re-run 2" will mean to do it 2 times

Actions

Copy link

#10