Bug #55815
closedRERUN is broken does not schedule correct number of jobs
0%
Description
Variables:
SHA1=18d575f5af7790222ee9d36af1d4518581525769 CEPH_BRANCH=wip-yuri-testing-2022-05-31-1642-quincy CEPH_QA_MAIL="ceph-qa@ceph.io" CEPH_REPO=https://github.com/ceph/ceph-ci.git SUITE_REPO=https://github.com/ceph/ceph-ci.git LIMIT=10000 DISTRO=distro TEUTH=master MACHINE_NAME=smithi PRIO=71
Rerun command line:
RERUN=yuriw-2022-06-01_02:28:14-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo $CEPH_REPO --ceph-repo $CEPH_REPO -p $PRIO -R fail,dead,running,waiting --force-priority -k $DISTRO
Output:
teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo $CEPH_REPO --ceph-repo $CEPH_REPO -p $PRIO -R fail,dead,running,waiting --force-priority -k $DISTRO 2022-06-01 14:00:31,155.155 INFO:teuthology.suite:Using random seed=7374 2022-06-01 14:00:31,156.156 INFO:teuthology.suite.run:kernel sha1: distro 2022-06-01 14:00:32,547.547 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/ceph-ci wip-yuri-testing-2022-05-31-1642-quincy -> 18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:00:32,547.547 INFO:teuthology.suite.run:ceph sha1: 18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:00:32,547.547 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro centos: arch=x86_64, release=centos/7, pkg_type=rpm 2022-06-01 14:00:32,548.548 INFO:teuthology.suite.util:container build centos/8, checking for build_complete 2022-06-01 14:00:32,548.548 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:00:32,890.890 DEBUG:teuthology.packaging:looking for centos/8 x86_64 default 2022-06-01 14:00:32,890.890 DEBUG:teuthology.packaging:build: centos/8 arm64 default 2022-06-01 14:00:32,891.891 DEBUG:teuthology.packaging:build: centos/8 x86_64 crimson 2022-06-01 14:00:32,891.891 DEBUG:teuthology.packaging:build: centos/8 x86_64 default 2022-06-01 14:00:32,891.891 INFO:teuthology.suite.run:ceph version: 17.2.0-382.g18d575f5 2022-06-01 14:00:33,136.136 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/ceph-ci.git wip-yuri-testing-2022-05-31-1642-quincy -> 18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:00:33,379.379 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/ceph-ci.git wip-yuri-testing-2022-05-31-1642-quincy -> 18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:00:33,380.380 INFO:teuthology.suite.run:ceph-ci branch: wip-yuri-testing-2022-05-31-1642-quincy 18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:00:33,381.381 DEBUG:teuthology.repo_utils:Setting repo remote to https://github.com/ceph/ceph-ci.git 2022-06-01 14:00:33,388.388 INFO:teuthology.repo_utils:Fetching wip-yuri-testing-2022-05-31-1642-quincy from origin 2022-06-01 14:00:33,983.983 INFO:teuthology.repo_utils:Resetting repo at /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy to origin/wip-yuri-testing-2022-05-31-1642-quincy 2022-06-01 14:00:34,076.076 DEBUG:teuthology.suite.run:Check file /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/.teuthology_branch exists 2022-06-01 14:00:34,076.076 DEBUG:teuthology.suite.run:Found teuthology branch config file /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/.teuthology_branch 2022-06-01 14:00:34,077.077 DEBUG:teuthology.suite.run:The teuthology branch is overridden with master 2022-06-01 14:00:34,298.298 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/teuthology master -> 1b30281f276b97a8186594b7f92fe1f728418ada 2022-06-01 14:00:34,298.298 INFO:teuthology.suite.run:teuthology branch: master 1b30281f276b97a8186594b7f92fe1f728418ada 2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados 2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:subset = None 2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:no_nested_subset = False 2022-06-01 14:07:45,177.177 INFO:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados generated 1591241 jobs (not yet filtered) 2022-06-01 14:07:50,550.550 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro centos: arch=x86_64, release=centos/7, pkg_type=rpm 2022-06-01 14:07:50,551.551 INFO:teuthology.suite.util:container build centos/8, checking for build_complete 2022-06-01 14:07:50,551.551 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:07:50,891.891 DEBUG:teuthology.packaging:looking for centos/8 x86_64 default 2022-06-01 14:07:50,891.891 DEBUG:teuthology.packaging:build: centos/8 arm64 default 2022-06-01 14:07:50,892.892 DEBUG:teuthology.packaging:build: centos/8 x86_64 crimson 2022-06-01 14:07:50,892.892 DEBUG:teuthology.packaging:build: centos/8 x86_64 default 2022-06-01 14:08:01,180.180 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro ubuntu: arch=x86_64, release=ubuntu/16.04, pkg_type=deb 2022-06-01 14:08:01,180.180 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=ubuntu%2F20.04%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:08:04,242.242 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro centos: arch=x86_64, release=centos/7, pkg_type=rpm 2022-06-01 14:08:04,243.243 INFO:teuthology.suite.util:container build centos/8, checking for build_complete 2022-06-01 14:08:04,243.243 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769 2022-06-01 14:08:04,692.692 DEBUG:teuthology.packaging:looking for centos/8 x86_64 default 2022-06-01 14:08:04,693.693 DEBUG:teuthology.packaging:build: centos/8 arm64 default 2022-06-01 14:08:04,693.693 DEBUG:teuthology.packaging:build: centos/8 x86_64 crimson 2022-06-01 14:08:04,693.693 DEBUG:teuthology.packaging:build: centos/8 x86_64 default Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858526 2022-06-01 14:08:18,583.583 INFO:teuthology.suite.run:Scheduling rados/monthrash/{ceph clusters/3-mons mon_election/classic msgr-failures/mon-delay msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{centos_8} thrashers/sync workloads/pool-create-delete} Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858528 2022-06-01 14:08:20,069.069 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench 3-final cluster/1-node k8s/1.21 net/calico rook/1.7.2} Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858530 2022-06-01 14:08:21,536.536 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/none 3-final cluster/3-node k8s/1.21 net/flannel rook/master} Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858532 2022-06-01 14:08:23,119.119 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench 3-final cluster/1-node k8s/1.21 net/host rook/1.7.2} Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858534 2022-06-01 14:08:24,600.600 INFO:teuthology.suite.run:Scheduling rados/cephadm/osds/{0-distro/rhel_8.4_container_tools_3.0 0-nvme-loop 1-start 2-ops/rmdir-reactivate} Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858536 2022-06-01 14:08:26,090.090 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/none 3-final cluster/3-node k8s/1.21 net/calico rook/master} Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858538 2022-06-01 14:08:27,560.560 INFO:teuthology.suite.run:Scheduling rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async-v1only objectstore/bluestore-comp-snappy rados tasks/mon_recovery validater/valgrind} Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858540 2022-06-01 14:08:29,065.065 INFO:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados scheduled 7 jobs. 2022-06-01 14:08:29,065.065 INFO:teuthology.suite.run:1591234/1591241 jobs were filtered out. Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858545 2022-06-01 14:08:35,962.962 INFO:teuthology.suite.run:Test results viewable at http://pulpito.front.sepia.ceph.com:80/yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi/
Result:
Expected: 15 jobs, got 7
Updated by Yuri Weinstein almost 2 years ago
- Status changed from New to Closed
Closing for now as there are still some refs to 'master' branch, so need more info
Updated by Yuri Weinstein almost 2 years ago
I have a wip built correctly with `main` for itself as well as for teuthology and still can't see RERUN working correctly.
RUN == http://pulpito.front.sepia.ceph.com/yuriw-2022-06-01_23:19:00-rados-wip-yuri8-testing-2022-06-01-1114-distro-default-smithi/
Variables:
SHA1=513a3ce033e61b54e2727a6a27915fd798082922 CEPH_BRANCH=wip-yuri8-testing-2022-06-01-1114 CEPH_QA_MAIL="ceph-qa@ceph.io" CEPH_REPO=https://github.com/ceph/ceph-ci.git SUITE_REPO=https://github.com/ceph/ceph-ci.git LIMIT=10000 DISTRO=distro TEUTH=main MACHINE_NAME=smithi PRIO=71
Command line:
RERUN=yuriw-2022-06-01_23:19:00-rados-wip-yuri8-testing-2022-06-01-1114-distro-default-smithi teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo $CEPH_REPO --ceph-repo $CEPH_REPO -p $PRIO -R fail,dead,running,waiting --force-priority -k $DISTRO -t $TEUTH
Expected 29 got 14
Command-line used to schedule the run:
teuthology-suite -v --ceph-repo $CEPH_REPO --suite-repo $CEPH_REPO -c $CEPH_BRANCH -m $MACHINE_NAME -s rados -k $DISTRO -p $PRIO -e $CEPH_QA_MAIL --suite-branch $CEPH_BRANCH -l $LIMIT -S $SHA1 --force-priority -t $TEUTH --subset 111/120000
Updated by Laura Flores almost 2 years ago
- Status changed from Closed to New
Re-opening this until we know that --rerun has been fixed.
Updated by Zack Cerza almost 2 years ago
I've spent some time debugging this. While I don't yet have a fix or a complete RCA, I can say that the configs generated by build_matrix()
are mismatched. As one example, Yuri's run had a job with this description:rados/singleton/{all/max-pg-per-osd.from-primary mon_election/connectivity msgr-failures/none msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{centos_8}}
In the list of 1591245 generated configs, the only one with all of these fragments:all/max-pg-per-osd.from-primary
mon_election/connectivity
msgr-failures/none
msgr/async-v2only
objectstore/filestore-xfs
Is this:rados/singleton/{all/max-pg-per-osd.from-primary mon_election/connectivity msgr-failures/none msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{rhel_8}}
The only difference is at the very end: rhel_8
vs centos_8
- so the $
operator is the issue here.
What remains confusing to me is that I used the original --seed
value (5707) from Yuri's run and still reproduced this.
Updated by Neha Ojha almost 2 years ago
Why is subset none in the rerun? This making rerun take forever I think.
2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados 2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:subset = None 2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:no_nested_subset = False
Updated by Patrick Donnelly almost 2 years ago
- Status changed from New to Fix Under Review
- Assignee changed from Zack Cerza to Patrick Donnelly
Updated by Zack Cerza almost 2 years ago
- Status changed from Fix Under Review to New
- Assignee changed from Patrick Donnelly to Zack Cerza
I merged 1762, but it doesn't fix the bug.
Updated by Neha Ojha almost 2 years ago
Neha Ojha wrote:
Why is subset none in the rerun? This making rerun take forever I think.
[...]
I applied https://github.com/ceph/teuthology/pull/1762, now subset and seed are getting set correctly.
2022-06-03T14:10:49.750 INFO:teuthology.results:subset: '111/120000' 2022-06-03T14:10:49.750 INFO:teuthology.results:seed: '8384'
Before
2022-06-06 23:09:19,208.208 INFO:teuthology.report:got seed None ... 2022-06-06 23:09:19,208.208 INFO:teuthology.suite:Using random seed=7429 2022-06-06 23:07:19,308.308 DEBUG:teuthology.suite.run:subset = None
After
2022-06-06 23:15:33,149.149 INFO:teuthology.report:got seed 8384 ... 2022-06-06 23:15:34,095.095 DEBUG:teuthology.suite.run:subset = (111, 120000)
Rerun of https://pulpito.ceph.com/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/ generates 12 jobs
Updated by Zack Cerza almost 2 years ago
- Status changed from New to Resolved
- Assignee changed from Zack Cerza to Patrick Donnelly
Neha Ojha wrote:
Rerun of https://pulpito.ceph.com/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/ generates 12 jobs
Hmm, I get 12 from that one too! And, on teuthology.front
I am actually getting 29 with the first run. So this:
I merged 1762, but it doesn't fix the bug.
was wrong. There is a separate issue where passing the correct --seed
and --subset
values behaves incorrectly, but doesn't appear to affect all runs. That'll belong in a separate ticket, though.