Project

General

Profile

Actions

Bug #55815

closed

RERUN is broken does not schedule correct number of jobs

Added by Yuri Weinstein almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Variables:

SHA1=18d575f5af7790222ee9d36af1d4518581525769
CEPH_BRANCH=wip-yuri-testing-2022-05-31-1642-quincy
CEPH_QA_MAIL="ceph-qa@ceph.io" 
CEPH_REPO=https://github.com/ceph/ceph-ci.git
SUITE_REPO=https://github.com/ceph/ceph-ci.git
LIMIT=10000
DISTRO=distro
TEUTH=master
MACHINE_NAME=smithi
PRIO=71

Run:
http://pulpito.front.sepia.ceph.com/yuriw-2022-06-01_02:28:14-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi/

Rerun command line:

RERUN=yuriw-2022-06-01_02:28:14-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi
teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo $CEPH_REPO --ceph-repo $CEPH_REPO -p $PRIO -R fail,dead,running,waiting --force-priority -k $DISTRO

Output:

teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo $CEPH_REPO --ceph-repo $CEPH_REPO -p $PRIO -R fail,dead,running,waiting --force-priority -k $DISTRO
2022-06-01 14:00:31,155.155 INFO:teuthology.suite:Using random seed=7374
2022-06-01 14:00:31,156.156 INFO:teuthology.suite.run:kernel sha1: distro
2022-06-01 14:00:32,547.547 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/ceph-ci wip-yuri-testing-2022-05-31-1642-quincy -> 18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:00:32,547.547 INFO:teuthology.suite.run:ceph sha1: 18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:00:32,547.547 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro centos: arch=x86_64, release=centos/7, pkg_type=rpm
2022-06-01 14:00:32,548.548 INFO:teuthology.suite.util:container build centos/8, checking for build_complete
2022-06-01 14:00:32,548.548 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:00:32,890.890 DEBUG:teuthology.packaging:looking for centos/8 x86_64 default
2022-06-01 14:00:32,890.890 DEBUG:teuthology.packaging:build: centos/8 arm64 default
2022-06-01 14:00:32,891.891 DEBUG:teuthology.packaging:build: centos/8 x86_64 crimson
2022-06-01 14:00:32,891.891 DEBUG:teuthology.packaging:build: centos/8 x86_64 default
2022-06-01 14:00:32,891.891 INFO:teuthology.suite.run:ceph version: 17.2.0-382.g18d575f5
2022-06-01 14:00:33,136.136 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/ceph-ci.git wip-yuri-testing-2022-05-31-1642-quincy -> 18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:00:33,379.379 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/ceph-ci.git wip-yuri-testing-2022-05-31-1642-quincy -> 18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:00:33,380.380 INFO:teuthology.suite.run:ceph-ci branch: wip-yuri-testing-2022-05-31-1642-quincy 18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:00:33,381.381 DEBUG:teuthology.repo_utils:Setting repo remote to https://github.com/ceph/ceph-ci.git
2022-06-01 14:00:33,388.388 INFO:teuthology.repo_utils:Fetching wip-yuri-testing-2022-05-31-1642-quincy from origin
2022-06-01 14:00:33,983.983 INFO:teuthology.repo_utils:Resetting repo at /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy to origin/wip-yuri-testing-2022-05-31-1642-quincy
2022-06-01 14:00:34,076.076 DEBUG:teuthology.suite.run:Check file /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/.teuthology_branch exists
2022-06-01 14:00:34,076.076 DEBUG:teuthology.suite.run:Found teuthology branch config file /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/.teuthology_branch
2022-06-01 14:00:34,077.077 DEBUG:teuthology.suite.run:The teuthology branch is overridden with master
2022-06-01 14:00:34,298.298 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/teuthology master -> 1b30281f276b97a8186594b7f92fe1f728418ada
2022-06-01 14:00:34,298.298 INFO:teuthology.suite.run:teuthology branch: master 1b30281f276b97a8186594b7f92fe1f728418ada
2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados
2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:subset = None
2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:no_nested_subset = False
2022-06-01 14:07:45,177.177 INFO:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados generated 1591241 jobs (not yet filtered)
2022-06-01 14:07:50,550.550 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro centos: arch=x86_64, release=centos/7, pkg_type=rpm
2022-06-01 14:07:50,551.551 INFO:teuthology.suite.util:container build centos/8, checking for build_complete
2022-06-01 14:07:50,551.551 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:07:50,891.891 DEBUG:teuthology.packaging:looking for centos/8 x86_64 default
2022-06-01 14:07:50,891.891 DEBUG:teuthology.packaging:build: centos/8 arm64 default
2022-06-01 14:07:50,892.892 DEBUG:teuthology.packaging:build: centos/8 x86_64 crimson
2022-06-01 14:07:50,892.892 DEBUG:teuthology.packaging:build: centos/8 x86_64 default
2022-06-01 14:08:01,180.180 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro ubuntu: arch=x86_64, release=ubuntu/16.04, pkg_type=deb
2022-06-01 14:08:01,180.180 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=ubuntu%2F20.04%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:08:04,242.242 DEBUG:teuthology.suite.util:Defaults for machine_type smithi distro centos: arch=x86_64, release=centos/7, pkg_type=rpm
2022-06-01 14:08:04,243.243 INFO:teuthology.suite.util:container build centos/8, checking for build_complete
2022-06-01 14:08:04,243.243 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=18d575f5af7790222ee9d36af1d4518581525769
2022-06-01 14:08:04,692.692 DEBUG:teuthology.packaging:looking for centos/8 x86_64 default
2022-06-01 14:08:04,693.693 DEBUG:teuthology.packaging:build: centos/8 arm64 default
2022-06-01 14:08:04,693.693 DEBUG:teuthology.packaging:build: centos/8 x86_64 crimson
2022-06-01 14:08:04,693.693 DEBUG:teuthology.packaging:build: centos/8 x86_64 default
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858526
2022-06-01 14:08:18,583.583 INFO:teuthology.suite.run:Scheduling rados/monthrash/{ceph clusters/3-mons mon_election/classic msgr-failures/mon-delay msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{centos_8} thrashers/sync workloads/pool-create-delete}
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858528
2022-06-01 14:08:20,069.069 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench 3-final cluster/1-node k8s/1.21 net/calico rook/1.7.2}
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858530
2022-06-01 14:08:21,536.536 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/none 3-final cluster/3-node k8s/1.21 net/flannel rook/master}
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858532
2022-06-01 14:08:23,119.119 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench 3-final cluster/1-node k8s/1.21 net/host rook/1.7.2}
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858534
2022-06-01 14:08:24,600.600 INFO:teuthology.suite.run:Scheduling rados/cephadm/osds/{0-distro/rhel_8.4_container_tools_3.0 0-nvme-loop 1-start 2-ops/rmdir-reactivate}
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858536
2022-06-01 14:08:26,090.090 INFO:teuthology.suite.run:Scheduling rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/none 3-final cluster/3-node k8s/1.21 net/calico rook/master}
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858538
2022-06-01 14:08:27,560.560 INFO:teuthology.suite.run:Scheduling rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async-v1only objectstore/bluestore-comp-snappy rados tasks/mon_recovery validater/valgrind}
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858540
2022-06-01 14:08:29,065.065 INFO:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados scheduled 7 jobs.
2022-06-01 14:08:29,065.065 INFO:teuthology.suite.run:1591234/1591241 jobs were filtered out.
Job scheduled with name yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi and ID 6858545
2022-06-01 14:08:35,962.962 INFO:teuthology.suite.run:Test results viewable at http://pulpito.front.sepia.ceph.com:80/yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi/

Result:

http://pulpito.front.sepia.ceph.com:80/yuriw-2022-06-01_14:00:31-rados-wip-yuri-testing-2022-05-31-1642-quincy-distro-default-smithi/

Expected: 15 jobs, got 7

Actions #1

Updated by Yuri Weinstein almost 2 years ago

  • Assignee set to Zack Cerza

@Zack pls take a look

Actions #2

Updated by Yuri Weinstein almost 2 years ago

  • Status changed from New to Closed

Closing for now as there are still some refs to 'master' branch, so need more info

Actions #3

Updated by Yuri Weinstein almost 2 years ago

I have a wip built correctly with `main` for itself as well as for teuthology and still can't see RERUN working correctly.

RUN == http://pulpito.front.sepia.ceph.com/yuriw-2022-06-01_23:19:00-rados-wip-yuri8-testing-2022-06-01-1114-distro-default-smithi/

Variables:

SHA1=513a3ce033e61b54e2727a6a27915fd798082922
CEPH_BRANCH=wip-yuri8-testing-2022-06-01-1114
CEPH_QA_MAIL="ceph-qa@ceph.io" 
CEPH_REPO=https://github.com/ceph/ceph-ci.git
SUITE_REPO=https://github.com/ceph/ceph-ci.git
LIMIT=10000
DISTRO=distro
TEUTH=main
MACHINE_NAME=smithi
PRIO=71

Command line:

RERUN=yuriw-2022-06-01_23:19:00-rados-wip-yuri8-testing-2022-06-01-1114-distro-default-smithi
teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo $CEPH_REPO --ceph-repo $CEPH_REPO -p $PRIO -R fail,dead,running,waiting --force-priority -k $DISTRO -t $TEUTH

Reuslt => http://pulpito.front.sepia.ceph.com/yuriw-2022-06-02_20:34:56-rados-wip-yuri8-testing-2022-06-01-1114-distro-default-smithi

Expected 29 got 14

Command-line used to schedule the run:

teuthology-suite -v --ceph-repo $CEPH_REPO --suite-repo $CEPH_REPO -c $CEPH_BRANCH -m $MACHINE_NAME -s rados -k $DISTRO -p $PRIO -e $CEPH_QA_MAIL --suite-branch $CEPH_BRANCH -l $LIMIT -S $SHA1 --force-priority -t $TEUTH --subset 111/120000
Actions #4

Updated by Laura Flores almost 2 years ago

  • Status changed from Closed to New

Re-opening this until we know that --rerun has been fixed.

Actions #5

Updated by Zack Cerza almost 2 years ago

I've spent some time debugging this. While I don't yet have a fix or a complete RCA, I can say that the configs generated by build_matrix() are mismatched. As one example, Yuri's run had a job with this description:
rados/singleton/{all/max-pg-per-osd.from-primary mon_election/connectivity msgr-failures/none msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{centos_8}}
In the list of 1591245 generated configs, the only one with all of these fragments:
all/max-pg-per-osd.from-primary mon_election/connectivity msgr-failures/none msgr/async-v2only objectstore/filestore-xfs
Is this:
rados/singleton/{all/max-pg-per-osd.from-primary mon_election/connectivity msgr-failures/none msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{rhel_8}}
The only difference is at the very end: rhel_8 vs centos_8 - so the $ operator is the issue here.

What remains confusing to me is that I used the original --seed value (5707) from Yuri's run and still reproduced this.

Actions #6

Updated by Neha Ojha almost 2 years ago

Why is subset none in the rerun? This making rerun take forever I think.

2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:Suite rados in /home/yuriw/src/github.com_ceph_ceph-c_wip-yuri-testing-2022-05-31-1642-quincy/qa/suites/rados
2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:subset = None
2022-06-01 14:00:34,313.313 DEBUG:teuthology.suite.run:no_nested_subset = False
Actions #7

Updated by Patrick Donnelly almost 2 years ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Zack Cerza to Patrick Donnelly
Actions #8

Updated by Zack Cerza almost 2 years ago

  • Status changed from Fix Under Review to New
  • Assignee changed from Patrick Donnelly to Zack Cerza

I merged 1762, but it doesn't fix the bug.

Actions #9

Updated by Neha Ojha almost 2 years ago

Neha Ojha wrote:

Why is subset none in the rerun? This making rerun take forever I think.

[...]

I applied https://github.com/ceph/teuthology/pull/1762, now subset and seed are getting set correctly.

Using https://pulpito.ceph.com/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/

2022-06-03T14:10:49.750 INFO:teuthology.results:subset: '111/120000'
2022-06-03T14:10:49.750 INFO:teuthology.results:seed: '8384'

Before

2022-06-06 23:09:19,208.208 INFO:teuthology.report:got seed None
...
2022-06-06 23:09:19,208.208 INFO:teuthology.suite:Using random seed=7429
2022-06-06 23:07:19,308.308 DEBUG:teuthology.suite.run:subset = None

After

2022-06-06 23:15:33,149.149 INFO:teuthology.report:got seed 8384
...
2022-06-06 23:15:34,095.095 DEBUG:teuthology.suite.run:subset = (111, 120000)

Rerun of https://pulpito.ceph.com/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/ generates 12 jobs

Actions #10

Updated by Zack Cerza almost 2 years ago

  • Status changed from New to Resolved
  • Assignee changed from Zack Cerza to Patrick Donnelly

Neha Ojha wrote:

Rerun of https://pulpito.ceph.com/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/ generates 12 jobs

Hmm, I get 12 from that one too! And, on teuthology.front I am actually getting 29 with the first run. So this:

I merged 1762, but it doesn't fix the bug.

was wrong. There is a separate issue where passing the correct --seed and --subset values behaves incorrectly, but doesn't appear to affect all runs. That'll belong in a separate ticket, though.

Actions

Also available in: Atom PDF