Project

General

Profile

Actions

Bug #55853

closed

test_cls_rgw.sh: failures in 'cls_rgw.index_list' and 'cls_rgw.index_list_delimited`

Added by Laura Flores almost 2 years ago. Updated about 1 year ago.

Status:
Can't reproduce
Priority:
High
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/6862540

2022-06-03T15:45:00.767 INFO:tasks.workunit.client.0.smithi033.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-84-g40883f98/rpm/el8/BUILD/ceph-16.2.9-84-g40883f98/src/test/cls_rgw/test_cls_rgw.cc:439: Failure
2022-06-03T15:45:00.832 INFO:tasks.workunit.client.0.smithi033.stdout:Expected equality of these values:
2022-06-03T15:45:00.833 INFO:tasks.workunit.client.0.smithi033.stdout:  4u
2022-06-03T15:45:00.833 INFO:tasks.workunit.client.0.smithi033.stdout:    Which is: 4
2022-06-03T15:45:00.833 INFO:tasks.workunit.client.0.smithi033.stdout:  m.size()
2022-06-03T15:45:00.833 INFO:tasks.workunit.client.0.smithi033.stdout:    Which is: 0
2022-06-03T15:45:00.833 INFO:tasks.workunit.client.0.smithi033.stdout:[  FAILED  ] cls_rgw.index_list (27 ms)
2022-06-03T15:45:00.833 INFO:tasks.workunit.client.0.smithi033.stdout:[ RUN      ] cls_rgw.index_list_delimited

...

2022-06-03T15:45:41.035 INFO:tasks.workunit.client.0.smithi033.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-84-g40883f98/rpm/el8/BUILD/ceph-16.2.9-84-g40883f98/src/test/cls_rgw/test_cls_rgw.cc:524: Failure
2022-06-03T15:45:41.036 INFO:tasks.workunit.client.0.smithi033.stdout:Expected equality of these values:
2022-06-03T15:45:41.036 INFO:tasks.workunit.client.0.smithi033.stdout:  48u
2022-06-03T15:45:41.036 INFO:tasks.workunit.client.0.smithi033.stdout:    Which is: 48
2022-06-03T15:45:41.036 INFO:tasks.workunit.client.0.smithi033.stdout:  id_entry_map.size()
2022-06-03T15:45:41.037 INFO:tasks.workunit.client.0.smithi033.stdout:    Which is: 0
2022-06-03T15:45:41.037 INFO:tasks.workunit.client.0.smithi033.stdout:We should get 40 top-level entries and the tops of 8 "subdirectories".
2022-06-03T15:45:41.037 INFO:tasks.workunit.client.0.smithi033.stdout:[  FAILED  ] cls_rgw.index_list_delimited (40269 ms)

Actions #1

Updated by Casey Bodley almost 2 years ago

  • Status changed from New to Need More Info

was this run based on a recent main branch, or something else?

we're not seeing failures in the rgw suite. is it possible that the branch being tested had omap listing regressions in it?

Actions #2

Updated by Laura Flores almost 2 years ago

Casey Bodley wrote:

was this run based on a recent main branch, or something else?

we're not seeing failures in the rgw suite. is it possible that the branch being tested had omap listing regressions in it?

Yes, this was based on a recent main branch. The testing trello card is here: https://trello.com/c/MaWPkMXi/1544-wip-yuri7-testing-2022-06-02-1633

  • Edited to add that this was one of Yuri's test branches based on main. There was only one PR added to it though (linked in the Trello card) that would have had no influence on this failure, as it was a change to the telemetry module.
Actions #3

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-09_22:06:32-rados-wip-yuri3-testing-2022-06-09-1314-distro-default-smithi/6871566
/a/yuriw-2022-06-09_22:06:32-rados-wip-yuri3-testing-2022-06-09-1314-distro-default-smithi/6871409

Actions #4

Updated by Laura Flores almost 2 years ago

earliest bad run: http://pulpito.front.sepia.ceph.com/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/
last good run: http://pulpito.front.sepia.ceph.com/yuriw-2022-06-01_23:19:00-rados-wip-yuri8-testing-2022-06-01-1114-distro-default-smithi/

Was not able to reproduce this locally (this was done on the most updated version of main):

ninja vstart -j$(nproc)
ninja -j$(nproc) ceph_test_cls_rgw
RGW=2 ../src/vstart.sh --debug --new -x --localhost --bluestore
./bin/ceph_test_cls_rgw --gtest_filter=*index_list*

Running main() from gmock_main.cc
Note: Google Test filter = *index_list*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from cls_rgw
[ RUN      ] cls_rgw.index_list
[       OK ] cls_rgw.index_list (32 ms)
[ RUN      ] cls_rgw.index_list_delimited
[       OK ] cls_rgw.index_list_delimited (35440 ms)
[----------] 2 tests from cls_rgw (35472 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (37563 ms total)
[  PASSED  ] 2 tests.

Actions #5

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-13_16:36:31-rados-wip-yuri7-testing-2022-06-13-0706-distro-default-smithi/6876615

Actions #7

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-06-23_14:17:25-rados-wip-yuri6-testing-2022-06-22-1419-distro-default-smithi/6894628

Actions #8

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-06-23_14:17:25-rados-wip-yuri6-testing-2022-06-22-1419-distro-default-smithi/6894633

Actions #9

Updated by Laura Flores almost 2 years ago

  • Priority changed from Normal to High
Actions #10

Updated by Casey Bodley almost 2 years ago

it looks like this is coming from an upgrade test. can someone please identify the ceph versions of both this ceph_test_cls_rgw test and the osd(s) it's talking to?

Actions #11

Updated by Laura Flores almost 2 years ago

Kamoltat Sirivadhna wrote:

/a/yuriw-2022-06-23_14:17:25-rados-wip-yuri6-testing-2022-06-22-1419-distro-default-smithi/6894633

Based on this failed test:

2022-06-23T14:58:47.343 INFO:tasks.workunit.client.0.smithi162.stdout:[ RUN      ] cls_rgw.index_suggest_complete
2022-06-23T14:58:47.346 INFO:tasks.workunit.client.0.smithi162.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-336-g3515edfe/rpm/el8/BUILD/ceph-16.2.9-336-g3515edfe/src/test/cls_rgw/test_cls_rgw.cc:406: Failure
2022-06-23T14:58:47.346 INFO:tasks.workunit.client.0.smithi162.stdout:Expected equality of these values:
2022-06-23T14:58:47.347 INFO:tasks.workunit.client.0.smithi162.stdout:  1
2022-06-23T14:58:47.347 INFO:tasks.workunit.client.0.smithi162.stdout:  entries.size()
2022-06-23T14:58:47.347 INFO:tasks.workunit.client.0.smithi162.stdout:    Which is: 0
2022-06-23T14:58:47.347 INFO:tasks.workunit.client.0.smithi162.stdout:[  FAILED  ] cls_rgw.index_suggest_complete (4 ms)

Specifically this line:

2022-06-23T14:58:47.346 INFO:tasks.workunit.client.0.smithi162.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-336-g3515edfe/rpm/el8/BUILD/ceph-16.2.9-336-g3515edfe/src/test/cls_rgw/test_cls_rgw.cc:406: Failure

It looks like the tests are version 16.2.9.

A bit earlier in the teuthology log, we can see that the OSDs are already upgraded to 17.0.0:

2022-06-23T14:48:52.795 INFO:teuthology.orchestra.run.smithi162.stdout:{
2022-06-23T14:48:52.795 INFO:teuthology.orchestra.run.smithi162.stdout:    "mon": {
2022-06-23T14:48:52.795 INFO:teuthology.orchestra.run.smithi162.stdout:        "ceph version 17.0.0-13216-gfad4b1c2 (fad4b1c200ee6a758bd948f031903dd98c630b4c) quincy (dev)": 3
2022-06-23T14:48:52.795 INFO:teuthology.orchestra.run.smithi162.stdout:    },
2022-06-23T14:48:52.796 INFO:teuthology.orchestra.run.smithi162.stdout:    "mgr": {
2022-06-23T14:48:52.796 INFO:teuthology.orchestra.run.smithi162.stdout:        "ceph version 17.0.0-13216-gfad4b1c2 (fad4b1c200ee6a758bd948f031903dd98c630b4c) quincy (dev)": 2
2022-06-23T14:48:52.796 INFO:teuthology.orchestra.run.smithi162.stdout:    },
2022-06-23T14:48:52.796 INFO:teuthology.orchestra.run.smithi162.stdout:    "osd": {
2022-06-23T14:48:52.797 INFO:teuthology.orchestra.run.smithi162.stdout:        "ceph version 17.0.0-13216-gfad4b1c2 (fad4b1c200ee6a758bd948f031903dd98c630b4c) quincy (dev)": 8
2022-06-23T14:48:52.797 INFO:teuthology.orchestra.run.smithi162.stdout:    },
2022-06-23T14:48:52.797 INFO:teuthology.orchestra.run.smithi162.stdout:    "mds": {
2022-06-23T14:48:52.797 INFO:teuthology.orchestra.run.smithi162.stdout:        "ceph version 17.0.0-13216-gfad4b1c2 (fad4b1c200ee6a758bd948f031903dd98c630b4c) quincy (dev)": 2
2022-06-23T14:48:52.797 INFO:teuthology.orchestra.run.smithi162.stdout:    },
2022-06-23T14:48:52.798 INFO:teuthology.orchestra.run.smithi162.stdout:    "overall": {
2022-06-23T14:48:52.798 INFO:teuthology.orchestra.run.smithi162.stdout:        "ceph version 17.0.0-13216-gfad4b1c2 (fad4b1c200ee6a758bd948f031903dd98c630b4c) quincy (dev)": 15
2022-06-23T14:48:52.798 INFO:teuthology.orchestra.run.smithi162.stdout:    }
2022-06-23T14:48:52.798 INFO:teuthology.orchestra.run.smithi162.stdout:}

The mismatched versions could be the issue.

The other recorded instances seem to be following the same pattern.

Actions #12

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-06-30_14:20:05-rados-wip-yuri3-testing-2022-06-28-1737-distro-default-smithi/[6907404, 6907413]

Actions #13

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-06-29_13:30:16-rados-wip-yuri3-testing-2022-06-28-1737-distro-default-smithi/6905612

Actions #14

Updated by Sridhar Seshasayee almost 2 years ago

/a/yuriw-2022-06-29_18:22:37-rados-wip-yuri2-testing-2022-06-29-0820-distro-default-smithi/6906109
/a/yuriw-2022-06-29_18:22:37-rados-wip-yuri2-testing-2022-06-29-0820-distro-default-smithi/6906268

Actions #15

Updated by Casey Bodley almost 2 years ago

  • Status changed from Need More Info to New
  • Assignee set to J. Eric Ivancich
Actions #16

Updated by Laura Flores almost 2 years ago

This time, index_suggest_complete failed in addition to the other two.

/a/nojha-2022-07-14_21:55:41-upgrade:pacific-x-snapshot_key_conversion-distro-default-smithi/6931111

2022-07-14T23:55:55.783 INFO:tasks.workunit.client.0.smithi186.stdout:/build/ceph-16.2.9-490-ge27cc18f/src/test/cls_rgw/test_cls_rgw.cc:406: Failure
2022-07-14T23:55:55.785 INFO:tasks.workunit.client.0.smithi186.stdout:Expected equality of these values:
2022-07-14T23:55:55.785 INFO:tasks.workunit.client.0.smithi186.stdout:  1
2022-07-14T23:55:55.785 INFO:tasks.workunit.client.0.smithi186.stdout:  entries.size()
2022-07-14T23:55:55.785 INFO:tasks.workunit.client.0.smithi186.stdout:    Which is: 0
2022-07-14T23:55:55.786 INFO:tasks.workunit.client.0.smithi186.stdout:[  FAILED  ] cls_rgw.index_suggest_complete (6 ms)

Actions #17

Updated by Aishwarya Mathuria almost 2 years ago

/a/yuriw-2022-07-13_19:41:18-rados-wip-yuri7-testing-2022-07-11-1631-distro-default-smithi/6929404

index_suggest_complete failed here too


2022-07-14T04:31:35.586 INFO:tasks.workunit.client.0.smithi163.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-488-gc5e2739f/rpm/el8/BUILD/ceph-16.2.9-488-gc5e2739f/src/test/cls_rgw/test_cls_rgw.cc:406: Failure
2022-07-14T04:31:35.587 INFO:tasks.workunit.client.0.smithi163.stdout:Expected equality of these values:
2022-07-14T04:31:35.587 INFO:tasks.workunit.client.0.smithi163.stdout:  1
2022-07-14T04:31:35.587 INFO:tasks.workunit.client.0.smithi163.stdout:  entries.size()
2022-07-14T04:31:35.588 INFO:tasks.workunit.client.0.smithi163.stdout:    Which is: 0
2022-07-14T04:31:35.588 INFO:tasks.workunit.client.0.smithi163.stdout:[  FAILED  ] cls_rgw.index_suggest_complete (3 ms)
2022-07-14T04:31:35.588 INFO:tasks.workunit.client.0.smithi163.stdout:[ RUN      ] cls_rgw.index_list
2022-07-14T04:31:35.623 INFO:tasks.workunit.client.0.smithi163.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-488-gc5e2739f/rpm/el8/BUILD/ceph-16.2.9-488-gc5e2739f/src/test/cls_rgw/test_cls_rgw.cc:504: Failure
2022-07-14T04:31:35.623 INFO:tasks.workunit.client.0.smithi163.stdout:Expected equality of these values:
2022-07-14T04:31:35.624 INFO:tasks.workunit.client.0.smithi163.stdout:  4u
2022-07-14T04:31:35.624 INFO:tasks.workunit.client.0.smithi163.stdout:    Which is: 4
2022-07-14T04:31:35.625 INFO:tasks.workunit.client.0.smithi163.stdout:  m.size()
2022-07-14T04:31:35.625 INFO:tasks.workunit.client.0.smithi163.stdout:    Which is: 0
2022-07-14T04:31:35.626 INFO:tasks.workunit.client.0.smithi163.stdout:[  FAILED  ] cls_rgw.index_list (21 ms)
2022-07-14T04:32:16.633 INFO:tasks.workunit.client.0.smithi163.stdout:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-488-gc5e2739f/rpm/el8/BUILD/ceph-16.2.9-488-gc5e2739f/src/test/cls_rgw/test_cls_rgw.cc:589: Failure
2022-07-14T04:32:16.634 INFO:tasks.workunit.client.0.smithi163.stdout:Expected equality of these values:
2022-07-14T04:32:16.635 INFO:tasks.workunit.client.0.smithi163.stdout:  48u
2022-07-14T04:32:16.635 INFO:tasks.workunit.client.0.smithi163.stdout:    Which is: 48
2022-07-14T04:32:16.636 INFO:tasks.workunit.client.0.smithi163.stdout:  id_entry_map.size()
2022-07-14T04:32:16.636 INFO:tasks.workunit.client.0.smithi163.stdout:    Which is: 0
2022-07-14T04:32:16.636 INFO:tasks.workunit.client.0.smithi163.stdout:We should get 40 top-level entries and the tops of 8 "subdirectories".
2022-07-14T04:32:16.637 INFO:tasks.workunit.client.0.smithi163.stdout:[  FAILED  ] cls_rgw.index_list_delimited (41026 ms)

Actions #18

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943905

Actions #19

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6944371/

Actions #20

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-07-24_15:38:21-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6946367/

Actions #21

Updated by Casey Bodley over 1 year ago

Laura Flores wrote:

It looks like the tests are version 16.2.9.

A bit earlier in the teuthology log, we can see that the OSDs are already upgraded to 17.0.0:
[...]

The mismatched versions could be the issue.

thanks Laura. when we fix bugs in cls/rgw, we update ceph_test_cls_rgw to test that updated behavior. so in general, we just can't expect test cases from one release to pass against another release

i looked into these upgrade suites to see how they're running ceph_test_cls_rgw, and it seems to be packaged with other tests in the 'cls' workunit:

https://github.com/ceph/ceph/blob/e9d361f6/qa/suites/upgrade/octopus-x/parallel/workload/rados_api.yaml#L11

we shouldn't be trying to run these during the upgrade; instead, we might run them before the upgrade, then again after

Actions #22

Updated by Laura Flores over 1 year ago

@Casey, I see what you mean. The issue here does seem to be that the RGW workload is running during the upgrade, which is causing the version mismatch problem. I checked all upgrade/parallel tests though, even back on stable branches, and it seems like we have always run workloads in parallel with the upgrade sequence. This pattern never changes among upgrade/parallel tests:

https://github.com/ceph/ceph/blob/main/qa/suites/upgrade/pacific-x/parallel/1-tasks.yaml

- print: "**** done start parallel" 
- parallel:
    - workload
    - upgrade-sequence
- print: "**** done end parallel" 

This implies to me that the workloads are supposed to be work when run during the upgrade sequence. The only way around it that I can see is by setting up a sequential task, such as:

- print: "**** done start parallel" 
- sequential:
    - workload
    - upgrade-sequence
- print: "**** done end parallel" 

But that would defeat the purpose of the parallel test.

Actions #23

Updated by Casey Bodley over 1 year ago

  • Pull request ID set to 47482
Actions #24

Updated by Casey Bodley over 1 year ago

sharing what i could find about the history here:

the cls_rgw.index_suggest_complete test was added for https://tracker.ceph.com/issues/54528, which has been backported to octopus. it isn't clear why that would fail, unless we were running a pacific version of ceph_test_cls_rgw against an octopus osd before that octopus backport was applied

cls_rgw.index_list_delimited was added for https://tracker.ceph.com/issues/41051, which i believe merged before octopus. i don't see any significant changes to
to cls_rgw.index_list there. unclear why either test would fail in upgrade suites

Actions #25

Updated by Laura Flores over 1 year ago

@Casey if it would help, here is the last good run I could find, and the earliest bad run:

last good run: earliest bad run:

Running `git log --pretty=oneline --no-merges 513a3ce033e61b54e2727a6a27915fd798082922..9c982c6b65fc320a11d31aced63ff0af50067d91 src/rgw` shows a lot of commits, but most (if not all) are coming from https://github.com/ceph/ceph/pull/39002. You'd know best, but this seems like it was maybe a big feature that was just intended for Quincy?

Point being, if this would be a minor change we could backport, it may make more sense to go that route. But if the change is coming from a large Quincy feature, it definitely wouldn't make sense to run the workload during the upgrade.

Actions #26

Updated by Ernesto Puerta over 1 year ago

  • Translation missing: en.field_tag_list set to test-failure
Actions #27

Updated by Laura Flores over 1 year ago

/a/lflores-2022-08-25_17:56:48-rados-wip-yuri11-testing-2022-08-24-0658-distro-default-smithi/6993001

Actions #28

Updated by Casey Bodley about 1 year ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF