Bug #35951
Recently merged "$" feature broke --filter and --rerun
0%
Description
teuthology-suite --ceph wip-sage3-testing-2018-09-10-1637 --machine-type smithi --dry-run --suite rados/thrash --filter="{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/crush-compat.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/bluestore-bitmap.yaml rados.yaml rocksdb.yaml supported-random-distro$/{ubuntu_16.04.yaml} thrashers/none.yaml thrashosds-health.yaml workloads/rados_api_tests.yaml}"
The same command, with newlines for readability:
teuthology-suite --ceph wip-sage3-testing-2018-09-10-1637 --machine-type smithi --dry-run --suite rados/thrash --filter="{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/crush-compat.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/bluestore-bitmap.yaml rados.yaml rocksdb.yaml supported-random-distro$/{ubuntu_16.04.yaml} thrashers/none.yaml thrashosds-health.yaml workloads/rados_api_tests.yaml}"
The above combination of --suite
and --filter
should match this job: http://pulpito.ceph.com/sage-2018-09-11_18:22:38-rados-wip-sage3-testing-2018-09-10-1637-distro-basic-smithi/3005949/
But it matches 0 jobs due to presence of the magic "$"
Presumably the fix is to disable the "magicness" of "$" when it appears in the filter string.
Related issues
History
#1 Updated by Nathan Cutler over 5 years ago
- Description updated (diff)
#2 Updated by Zack Cerza over 5 years ago
- Assignee set to Anonymous
#3 Updated by Yuri Weinstein over 5 years ago
- Priority changed from Normal to Urgent
@Warren this seems pretty annoying pls take a look
#4 Updated by Anonymous over 5 years ago
Yuri: I pushed a change in wip-wusui-35951. Could you test this to make sure that $ functionality still works? I think that it does but I could be wrong. Thanks.
#5 Updated by Nathan Cutler almost 4 years ago
- Subject changed from Recently merged "$" feature broke --filter to Recently merged "$" feature broke --filter and --rerun
I looked at this again today, and I believe the bug is still present. What happens is: jobs with the "magic" $ have a random component in their names. For example, in the following job:
"rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}}"
the last part ("{rhel_8.yhaml}") is random. On each invocation of "--suite rados", this part of the job name might be different.
As a result, one can do something like:
teuthology-suite -k distro --ceph wip-badone-testing --machine-type smithi \ --dry-run --suite rados/singleton \ --filter="{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}}"
and it might fail, but then the next time you issue the exact same command, it might succeed. Because the distro part after "supported-random-distro$" is random.
At first I thought this wasn't so serious, but Kyr pointed out that it has ramifications for --rerun. For example, this run had 13 failures in it: bhubbard-2020-04-16_09:57:54-rados-wip-badone-testing-distro-basic-smithi
Hence, the following command was expected to create a run with 13 jobs in it:
teuthology-suite -k distro --ceph wip-badone-testing --machine-type smithi --dry-run \ --rerun bhubbard-2020-04-16_09:57:54-rados-wip-badone-testing-distro-basic-smithi \ -R fail
but the resulting run had only 8 jobs. I assume this is due to the "randomness" inherent in the "$". Sometimes the randomly chosen distro matches what was randomly chosen in the original run. Other times, it doesn't match.
I think the only way to fix this is to store in Paddles the random seed that is used to expand "$" yaml components and then make "--rerun" use this same seed.
#6 Updated by Kefu Chai almost 4 years ago
https://github.com/ceph/teuthology/pull/1198 should be able to address this issue. the idea was to persist the seed and subset used to create the batch, and reuse them when rerunning the failed job.
i just double checked the subset, seed and the teuthology version, all of them match.
and i rerun teuthology-suite, it generated a different set of tests, for instance it emitted
rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{ubuntu_latest.yaml}} rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/bluestore-low-osd-mem-target.yaml rados.yaml supported-random-distro$/{centos_8.yaml}}
and the result is stable after i tested for 10 times.
while in the original batch
rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}} rados/singleton/{all/rebuild-mondb.yaml msgr-failures/few.yaml msgr/async-v2only.yaml objectstore/bluestore-low-osd-mem-target.yaml rados.yaml supported-random-distro$/{centos_8.yaml}}
i think python's random.seed() allows use to have the same (pseudo) random number sequences. but not sure why it fails in this case..
#7 Updated by Nathan Cutler almost 4 years ago
- Duplicates Bug #45119: problems rerunning failed jobs added
#8 Updated by Nathan Cutler almost 4 years ago
- Status changed from New to Duplicate
Over at #45119 I just learned that "--rerun" will be fixed by (or as a side effect of?) the migration to Python 3.