Bug #58914: [ FAILED ] TestClsRbd.group_snap_list_max_read in upgrade:quincy-x-reef - rbd - Ceph

Actions

Copy link

Bug #58914

closed

[ FAILED ] TestClsRbd.group_snap_list_max_read in upgrade:quincy-x-reef

Added by Yuri Weinstein about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Yuri Weinstein

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v18.0.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run: https://pulpito.ceph.com/yuriw-2023-03-03_17:40:19-upgrade:quincy-x-reef-distro-default-smithi/
Jobs: '7192577', '7192593', '7192571', '7192565'
Logs: /a/yuriw-2023-03-03_17:40:19-upgrade:quincy-x-reef-distro-default-smithi/7192565/teuthology.log

2023-03-03T19:53:32.658 INFO:tasks.workunit.client.0.smithi171.stdout:[  FAILED  ] 1 test, listed below:
2023-03-03T19:53:32.658 INFO:tasks.workunit.client.0.smithi171.stdout:[  FAILED  ] TestClsRbd.group_snap_list_max_read
2023-03-03T19:53:32.658 INFO:tasks.workunit.client.0.smithi171.stdout:
2023-03-03T19:53:32.658 INFO:tasks.workunit.client.0.smithi171.stdout: 1 FAILED TEST

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Yuri Weinstein about 1 year ago

Related to Bug #58265: TestClsRbd.group_snap_list_max_read failure during upgrade/parallel tests added

Actions

Copy link

Updated by Laura Flores about 1 year ago

Translation missing: en.field_tag_list set to test-failure

/a/yuriw-2023-03-15_21:14:59-upgrade:pacific-x-quincy-release-distro-default-smithi/7209141

2023-03-15T23:06:47.167 INFO:tasks.workunit.client.0.smithi043.stdout:[ RUN      ] TestClsRbd.group_snap_list_max_read
2023-03-15T23:06:47.167 INFO:journalctl@ceph.mon.c.smithi043.stdout:Mar 15 23:06:21 smithi043 ceph-e076d75a-c382-11ed-9afb-001a4aab830c-mon.c[109215]: cluster 2023-03-15T23:06:19.806318+0000 mgr.y (mgr.14148) 711 : cluster [DBG] pgmap v671: 105 pgs: 9 creating+peering, 19 unknown, 77 active+clean; 2.3 KiB data, 46 MiB used, 715 GiB / 715 GiB avail; 23 KiB/s rd, 7.8 KiB/s wr, 24 op/s
2023-03-15T23:06:47.167 INFO:journalctl@ceph.mon.a.smithi043.stdout:Mar 15 23:06:21 smithi043 ceph-e076d75a-c382-11ed-9afb-001a4aab830c-mon-a[105119]: cluster 2023-03-15T23:06:19.806318+0000 mgr.y (mgr.14148) 711 : cluster [DBG] pgmap v671: 105 pgs: 9 creating+peering, 19 unknown, 77 active+clean; 2.3 KiB data, 46 MiB used, 715 GiB / 715 GiB avail; 23 KiB/s rd, 7.8 KiB/s wr, 24 op/s
2023-03-15T23:06:47.168 INFO:journalctl@ceph.mon.b.smithi121.stdout:Mar 15 23:06:41 smithi121 ceph-e076d75a-c382-11ed-9afb-001a4aab830c-mon.b[105182]: cluster 2023-03-15T23:06:39.811417+0000 mgr.y (mgr.14148) 725 : cluster [DBG] pgmap v684: 73 pgs: 73 active+clean; 2.3 KiB data, 80 MiB used, 715 GiB / 715 GiB avail
2023-03-15T23:06:47.168 INFO:journalctl@ceph.mon.b.smithi121.stdout:Mar 15 23:06:43 smithi121 ceph-e076d75a-c382-11ed-9afb-001a4aab830c-mon.b[105182]: cluster 2023-03-15T23:06:41.812127+0000 mgr.y (mgr.14148) 726 : cluster [DBG] pgmap v685: 73 pgs: 73 active+clean; 2.3 KiB data, 80 MiB used, 715 GiB / 715 GiB avail
2023-03-15T23:06:47.168 INFO:journalctl@ceph.mon.b.smithi121.stdout:Mar 15 23:06:45 smithi121 ceph-e076d75a-c382-11ed-9afb-001a4aab830c-mon.b[105182]: cluster 2023-03-15T23:06:43.812581+0000 mgr.y (mgr.14148) 727 : cluster [DBG] pgmap v686: 73 pgs: 73 active+clean; 2.3 KiB data, 80 MiB used, 715 GiB / 715 GiB avail
2023-03-15T23:06:47.168 INFO:journalctl@ceph.mon.b.smithi121.stdout:Mar 15 23:06:45 smithi121 ceph-e076d75a-c382-11ed-9afb-001a4aab830c-mon.b[105182]: audit 2023-03-15T23:06:44.976050+0000 mon.a (mon.0) 848 : audit [INF] from='mgr.14148 172.21.15.43:0/2120567852' entity='mgr.y' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/y/mirror_snapshot_schedule"}]: dispatch
2023-03-15T23:06:47.168 INFO:journalctl@ceph.mon.b.smithi121.stdout:Mar 15 23:06:45 smithi121 ceph-e076d75a-c382-11ed-9afb-001a4aab830c-mon.b[105182]: audit 2023-03-15T23:06:44.976273+0000 mon.a (mon.0) 849 : audit [INF] from='mgr.14148 172.21.15.43:0/2120567852' entity='mgr.y' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/y/trash_purge_schedule"}]: dispatch
2023-03-15T23:06:47.169 INFO:tasks.workunit.client.0.smithi043.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11-292-gfad62dd9/rpm/el8/BUILD/ceph-16.2.11-292-gfad62dd9/src/test/cls_rbd/test_cls_rbd.cc:2768: Failure
2023-03-15T23:06:47.169 INFO:tasks.workunit.client.0.smithi043.stdout:Expected equality of these values:
2023-03-15T23:06:47.169 INFO:tasks.workunit.client.0.smithi043.stdout:  150U
2023-03-15T23:06:47.169 INFO:tasks.workunit.client.0.smithi043.stdout:    Which is: 150
2023-03-15T23:06:48.313 INFO:tasks.workunit.client.0.smithi043.stdout:  snapshots.size()
2023-03-15T23:06:48.313 INFO:tasks.workunit.client.0.smithi043.stdout:    Which is: 500
2023-03-15T23:06:48.314 INFO:tasks.workunit.client.0.smithi043.stdout:[  FAILED  ] TestClsRbd.group_snap_list_max_read (653 ms)

Actions

Copy link

Updated by Ilya Dryomov about 1 year ago

Looking at the original occurrence (/a/yuriw-2023-03-03_17:40:19-upgrade:quincy-x-reef-distro-default-smithi/7192565/teuthology.log), the issue is that this job ends up running a new test case (one which asserts that a bug -- https://tracker.ceph.com/issues/57066 -- is fixed) against OSDs that don't have the fix.

2023-03-03T19:22:15.292 INFO:teuthology.orchestra.run.smithi171.stdout:osd.0                     smithi171                    running (112s)    10s ago  13m    74.6M    2500M  18.0.0-2694-g33b4b31b  7d5fb8cf3e4b  9fe157abaac6
2023-03-03T19:22:15.292 INFO:teuthology.orchestra.run.smithi171.stdout:osd.1                     smithi171                    running (79s)     10s ago  13m    64.1M    2500M  18.0.0-2694-g33b4b31b  7d5fb8cf3e4b  2eeca83a4812
2023-03-03T19:22:15.292 INFO:teuthology.orchestra.run.smithi171.stdout:osd.2                     smithi171                    running (47s)     10s ago  12m    59.5M    2500M  18.0.0-2694-g33b4b31b  7d5fb8cf3e4b  4660cd6f7fe6
2023-03-03T19:22:15.292 INFO:teuthology.orchestra.run.smithi171.stdout:osd.3                     smithi171                    running (15s)     10s ago  12m    13.9M    2500M  18.0.0-2694-g33b4b31b  7d5fb8cf3e4b  0d2f8957a57e
2023-03-03T19:22:15.293 INFO:teuthology.orchestra.run.smithi171.stdout:osd.4                     smithi181                    running (12m)     42s ago  12m     206M    2500M  17.2.5                 c724f2de2337  3977e8815a43
2023-03-03T19:22:15.293 INFO:teuthology.orchestra.run.smithi171.stdout:osd.5                     smithi181                    running (11m)     42s ago  11m     210M    2500M  17.2.5                 c724f2de2337  126152f454e3
2023-03-03T19:22:15.293 INFO:teuthology.orchestra.run.smithi171.stdout:osd.6                     smithi181                    running (11m)     42s ago  11m     143M    2500M  17.2.5                 c724f2de2337  b4bcd4335a84
2023-03-03T19:22:15.293 INFO:teuthology.orchestra.run.smithi171.stdout:osd.7                     smithi181                    running (11m)     42s ago  11m     172M    2500M  17.2.5                 c724f2de2337  d4901dacb95f

Half of the OSDs is on 17.2.5 tag (which doesn't have the fix), the other half is on the tip of the reef branch. The test binary is coming from the tip of the quincy branch:

2023-03-03T19:53:31.147 INFO:tasks.workunit.client.0.smithi171.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5-1212-ga1b220fc/rpm/el8/BUILD/ceph-17.2.5-1212-ga1b220fc/src/test/cls_rbd/test_cls_rbd.cc:2768: Failure

If the object in question happens to reside on any of 17.2.5 OSDs, the test fails; otherwise it succeeds. Under such conditions, this is expected.

I'm not sure what versions were actually intended to be tested here, but this upgrade suite seems broken to me. Putting the upgrade part aside, one can't run a newer test (17.2.5-1212-ga1b220fc) against older OSDs (17.2.5) and expect that to work. The upgrade part then just introduces some randomness.

Yuri, can you clarify what the intent is? If it's to test an upgrade from the tip of the quincy branch, then the job definition needs to be fixed to deploy respective OSDs. If it's to test an upgrade from the most recent quincy tag, then there is nothing we can do about this failure (but it's still wrong to use a mismatching test binary).

Actions

Copy link

Updated by Laura Flores about 1 year ago

Ilya, the idea of stress-split tests is to run tests with mixed versions to mimic clusters that may have daemons of mixed versions. However, if this RBD test doesn't make sense to test with mixed versions, perhaps it should be run before and after the upgrade sequence, not during.

What do you think?

Actions

Copy link

Updated by Ilya Dryomov about 1 year ago

The issue is not that this test is run during the upgrade (i.e. against a half-upgraded cluster). The way this job is defined, the test would fail the "before" attempt and pass the "after" attempt with your proposal.

What I'm looking for is the precise definition of "mixed versions", or rather just the start version (the end version is the tip of the reef branch, there is no confusion there). Is the start version intended to be the tip of the quincy branch or the most recent quincy tag?

Actions

Copy link

Updated by Laura Flores about 1 year ago

Ilya Dryomov wrote:

The issue is not that this test is run during the upgrade (i.e. against a half-upgraded cluster). The way this job is defined, the test would fail the "before" attempt and pass the "after" attempt with your proposal.

What I'm looking for is the precise definition of "mixed versions", or rather just the start version (the end version is the tip of the reef branch, there is no confusion there). Is the start version intended to be the tip of the quincy branch or the most recent quincy tag?

Ah, I understand. The start version is meant to be the most recent quincy tag on quay.io (see https://github.com/ceph/ceph/blob/eab8a6d1e2c0f0943f27f6872d02fd0eef32b210/qa/suites/upgrade/quincy-x/stress-split/1-start.yaml#L8).

Actions

Copy link

Updated by Yuri Weinstein about 1 year ago

IIRC the way we run these tests is the same as we've been doing for all other releases. What makes this different?

Actions

Copy link

Updated by Yuri Weinstein about 1 year ago

Another instance https://pulpito.ceph.com/yuriw-2023-03-17_15:59:18-upgrade:pacific-x-quincy-release-distro-default-smithi/

and here is the all-green run from 17.2.4

http://pulpito.front.sepia.ceph.com/yuriw-2022-09-12_14:35:15-upgrade:pacific-x-quincy-release-distro-default-smithi/

Actions

Copy link

Updated by Ilya Dryomov about 1 year ago

Laura Flores wrote:

Ah, I understand. The start version is meant to be the most recent quincy tag on quay.io (see https://github.com/ceph/ceph/blob/eab8a6d1e2c0f0943f27f6872d02fd0eef32b210/qa/suites/upgrade/quincy-x/stress-split/1-start.yaml#L8).

Forgive me for being thick, but is it what it's meant to be or did you look at the job definition and just put what happens to be there in English words? Because the same job definition installs test binaries from the tip of the quincy branch and that is wrong.

I keep asking and stressing the difference between the intent and the current job definition for two reasons:

1. Like I said, regardless of everything else, one can't run a newer test (tip of the branch) against older OSDs (some tag on the branch) and expect that to work in all cases. If a bug gets fixed in the interim and a corresponding test gets updated or a new test gets added, it's going to fail.
2. This (the start version being the most recent tag as opposed to the tip of the branch) is not what we did before. If you look at older upgrade suites, e.g. mimic -> nautilus (https://github.com/ceph/ceph/blob/nautilus/qa/suites/upgrade/mimic-x/stress-split/1-ceph-install/mimic.yaml), it installed the tip of mimic.

I think this issue crept in with the move to cephadm.

Actions

Copy link

#10

Updated by Ilya Dryomov about 1 year ago

Yuri Weinstein wrote:

IIRC the way we run these tests is the same as we've been doing for all other releases. What makes this different?

A bug in one of RBD class methods was tracked down. The fix along with an updated test which asserts that the bug is fixed was backported to quincy for 17.2.6 (https://tracker.ceph.com/issues/58152). You are running this new (post 17.2.5) test against old (17.2.5) OSDs that don't have the fix.

Actions

Copy link

#11

Updated by Yuri Weinstein about 1 year ago

Ilya Dryomov wrote:

A bug in one of RBD class methods was tracked down. The fix along with an updated test which asserts that the bug is fixed was backported to quincy for 17.2.6 (https://tracker.ceph.com/issues/58152). You are running this new (post 17.2.5) test against old (17.2.5) OSDs that don't have the fix.

I see, thx @Ilya

Then depending on our desire we can a) keep these failures as "known" and do nothing, or b) backport the fix to 17.2.5

Actions

Copy link

#12

Updated by Ilya Dryomov about 1 year ago

Yuri Weinstein wrote:

Then depending on our desire we can a) keep these failures as "known" and do nothing, or b) backport the fix to 17.2.5

We can't backport anything to 17.2.5 because it's already been released. We can:

a) keep these failures as "known" and do nothing

(they would disappear when 17.2.6 is released), or b) restore the previous upgrade suite behavior where the start version was the tip of the branch. Needless to say that b) would also prevent similar failures in the future.

Actions

Copy link

#13

Updated by Yuri Weinstein about 1 year ago

After @Ilya described the wrong `tags` issue I was confused.
But with @Josh Jones help, here is the summary.

The issue is on `main`, `quincy`, and `reef` branches

When we pull images, they must be pulled from `ceph-ci` simply because in `ceph` the "latest" versions (e.g. "latest-*" is the latest released `tag').

In addition, pulling and installing images like "image: docker.io/ceph/daemon-base:latest-octopus" is incorrect as "docker" repos aren't even updated anymore. We had to use "quay.ceph.io"

Also there was a typo found in the lines