Project

General

Profile

Bug #35951

Recently merged "$" feature broke --filter and --rerun

Added by Nathan Cutler over 5 years ago. Updated almost 4 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

teuthology-suite --ceph wip-sage3-testing-2018-09-10-1637 --machine-type smithi --dry-run --suite rados/thrash --filter="{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/crush-compat.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/bluestore-bitmap.yaml rados.yaml rocksdb.yaml supported-random-distro$/{ubuntu_16.04.yaml} thrashers/none.yaml thrashosds-health.yaml workloads/rados_api_tests.yaml}"

The same command, with newlines for readability:

teuthology-suite --ceph wip-sage3-testing-2018-09-10-1637 --machine-type smithi --dry-run 
--suite rados/thrash --filter="{0-size-min-size-overrides/2-size-2-min-size.yaml 
1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/normal.yaml 
ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/crush-compat.yaml
msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/bluestore-bitmap.yaml rados.yaml
rocksdb.yaml supported-random-distro$/{ubuntu_16.04.yaml} thrashers/none.yaml thrashosds-health.yaml
workloads/rados_api_tests.yaml}"

The above combination of --suite and --filter should match this job: http://pulpito.ceph.com/sage-2018-09-11_18:22:38-rados-wip-sage3-testing-2018-09-10-1637-distro-basic-smithi/3005949/

But it matches 0 jobs due to presence of the magic "$"

Presumably the fix is to disable the "magicness" of "$" when it appears in the filter string.


Related issues

Duplicates teuthology - Bug #45119: problems rerunning failed jobs Won't Fix

History

#1 Updated by Nathan Cutler over 5 years ago

  • Description updated (diff)

#2 Updated by Zack Cerza over 5 years ago

  • Assignee set to Anonymous

#3 Updated by Yuri Weinstein over 5 years ago

  • Priority changed from Normal to Urgent

@Warren this seems pretty annoying pls take a look

#4 Updated by Anonymous over 5 years ago

Yuri: I pushed a change in wip-wusui-35951. Could you test this to make sure that $ functionality still works? I think that it does but I could be wrong. Thanks.

#5 Updated by Nathan Cutler almost 4 years ago

  • Subject changed from Recently merged "$" feature broke --filter to Recently merged "$" feature broke --filter and --rerun

I looked at this again today, and I believe the bug is still present. What happens is: jobs with the "magic" $ have a random component in their names. For example, in the following job:

"rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}}"

the last part ("{rhel_8.yhaml}") is random. On each invocation of "--suite rados", this part of the job name might be different.

As a result, one can do something like:

teuthology-suite -k distro --ceph wip-badone-testing --machine-type smithi \
 --dry-run --suite rados/singleton \
 --filter="{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}}" 

and it might fail, but then the next time you issue the exact same command, it might succeed. Because the distro part after "supported-random-distro$" is random.

At first I thought this wasn't so serious, but Kyr pointed out that it has ramifications for --rerun. For example, this run had 13 failures in it: bhubbard-2020-04-16_09:57:54-rados-wip-badone-testing-distro-basic-smithi

Hence, the following command was expected to create a run with 13 jobs in it:

teuthology-suite -k distro --ceph wip-badone-testing --machine-type smithi --dry-run \
 --rerun bhubbard-2020-04-16_09:57:54-rados-wip-badone-testing-distro-basic-smithi \
 -R fail

but the resulting run had only 8 jobs. I assume this is due to the "randomness" inherent in the "$". Sometimes the randomly chosen distro matches what was randomly chosen in the original run. Other times, it doesn't match.

I think the only way to fix this is to store in Paddles the random seed that is used to expand "$" yaml components and then make "--rerun" use this same seed.

#6 Updated by Kefu Chai almost 4 years ago

https://github.com/ceph/teuthology/pull/1198 should be able to address this issue. the idea was to persist the seed and subset used to create the batch, and reuse them when rerunning the failed job.

i just double checked the subset, seed and the teuthology version, all of them match.

and i rerun teuthology-suite, it generated a different set of tests, for instance it emitted

rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{ubuntu_latest.yaml}}

rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/bluestore-low-osd-mem-target.yaml rados.yaml supported-random-distro$/{centos_8.yaml}}

and the result is stable after i tested for 10 times.

while in the original batch

rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}}

rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/bluestore-low-osd-mem-target.yaml rados.yaml supported-random-distro$/{centos_8.yaml}}

i think python's random.seed() allows use to have the same (pseudo) random number sequences. but not sure why it fails in this case..

#7 Updated by Nathan Cutler almost 4 years ago

  • Duplicates Bug #45119: problems rerunning failed jobs added

#8 Updated by Nathan Cutler almost 4 years ago

  • Status changed from New to Duplicate

Over at #45119 I just learned that "--rerun" will be fixed by (or as a side effect of?) the migration to Python 3.

Also available in: Atom PDF