Bug #55444
test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test
0%
Description
Description: rados/upgrade/nautilus-x-singleton/{0-cluster/{openstack start} 1-install/nautilus 2-partial-upgrade/firsthalf 3-thrash/default 4-workload/{rbd-cls rbd-import-export readwrite snaps-few-objects} 5-workload/{radosbench rbd_api} 6-finish-upgrade 7-pacific 8-workload/{rbd-python snaps-many-objects} bluestore-bitmap mon_election/classic thrashosds-health ubuntu_18.04}
/a/lflores-2022-04-22_20:48:19-rados-wip-55324-pacific-backport-distro-default-smithi/6801098
2022-04-23T08:54:27.447 INFO:tasks.workunit.client.0.smithi084.stdout:[ RUN ] TestClsRbd.directory_methods
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:297: Failure
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout: Expected: -16
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: dir_state_set(&ioctx, oid, cls::rbd::DIRECTORY_STATE_ADD_DISABLED)
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout: Which is: 0
2022-04-23T08:54:27.466 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.directory_methods (18 ms)
...
2022-04-23T08:54:27.633 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:750: Failure
2022-04-23T08:54:27.633 INFO:tasks.workunit.client.0.smithi084.stdout: Expected: 0
2022-04-23T08:54:27.633 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: get_parent(&ioctx, oid, 10, &pspec, &size)
2022-04-23T08:54:27.634 INFO:tasks.workunit.client.0.smithi084.stdout: Which is: -22
2022-04-23T08:54:27.634 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.parents_v1 (45 ms)
...
2022-04-23T08:54:27.729 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:1008: Failure
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout: Expected: 1u
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout: Which is: 1
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: snapc.snaps.size()
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout: Which is: 0
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.snapshots (6 ms)
...
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:1437: Failure
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout: Expected: 2U
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout: Which is: 2
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: pairs.size()
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout: Which is: 0
2022-04-23T08:54:27.779 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.metadata (6 ms)
... + 22 more failed tests
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[==========] 67 tests from 1 test case ran. (22012 ms total)
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[ PASSED ] 41 tests.
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] 26 tests, listed below:
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.directory_methods
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.parents_v1
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.snapshots
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.metadata
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.mirror
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.mirror_image
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.mirror_image_status
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.mirror_image_map
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_dir_list
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_dir_add
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.dir_add_already_existing
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_dir_rename
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_dir_remove
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_dir_remove_missing
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_image_add
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_image_remove
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_image_list
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_image_clean
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.image_group_add
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_snap_set_duplicate_name
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_snap_set
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_snap_list
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.group_snap_remove
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.trash_methods
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.clone_child
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[ FAILED ] TestClsRbd.namespace_methods
Related issues
History
#1 Updated by Laura Flores almost 2 years ago
- Subject changed from TestClsRbd: multiple test failures during upgrade test to test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test
#2 Updated by Yuri Weinstein almost 2 years ago
- Priority changed from Normal to Urgent
This is for 16.2.8 release
Seeing in https://pulpito.ceph.com/yuriw-2022-04-27_14:24:25-upgrade:octopus-x-pacific-distro-default-smithi/
#3 Updated by Ilya Dryomov almost 2 years ago
This may sound crazy but the only explanation that I'm able to come up with for these failures is that the octopus -> pacific upgraded OSDs are not persisting omap updates. I reran another, simpler, job from Yuri's run 5 times and the failures are exactly the same:
https://pulpito.ceph.com/dis-2022-04-28_17:15:07-upgrade:octopus-x-pacific-distro-default-smithi/
2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[==========] 27 tests from 1 test suite ran. (2842 ms total) 2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[ PASSED ] 20 tests. 2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] 7 tests, listed below: 2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] TestClsJournal.ClientRegister 2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] TestClsJournal.ClientUnregisterPruneTags 2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] TestClsJournal.ClientCommit 2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] TestClsJournal.ClientList 2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] TestClsJournal.TagCreate 2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] TestClsJournal.TagCreatePrunesTags 2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[ FAILED ] TestClsJournal.TagList
Honing in on TestClsJournal.ClientRegister failure, it is really simple:
[ RUN ] TestClsJournal.ClientRegister /build/ceph-15.2.16/src/test/cls_journal/test_cls_journal.cc:249: Failure Expected equality of these values: expected_clients Which is: { [id=id1, commit_position=[positions=[]], state=connected] } clients Which is: {} [ FAILED ] TestClsJournal.ClientRegister (23 ms)
TEST_F(TestClsJournal, ClientRegister) { librados::IoCtx ioctx; ASSERT_EQ(0, _rados.ioctx_create(_pool_name.c_str(), ioctx)); std::string oid = get_temp_image_name(); ASSERT_EQ(0, client::create(ioctx, oid, 2, 4, ioctx.get_id())); // create image1 object, set some keys on it ASSERT_EQ(0, client::client_register(ioctx, oid, "id1", bufferlist())); // set client_id1 key on the same object std::set<Client> clients; ASSERT_EQ(0, client::client_list(ioctx, oid, &clients)); // read client_id1 key on the same object -- it appears that such key doesn't exist! // and the keys that were supposed to be set by client::create don't seem to be there either... std::set<Client> expected_clients = {Client("id1", bufferlist())}; ASSERT_EQ(expected_clients, clients); // assert fires because |clients| set remained empty }
#4 Updated by Ilya Dryomov almost 2 years ago
- Assignee set to Neha Ojha
#5 Updated by Ilya Dryomov almost 2 years ago
Yup, omap is definitely involved. LibRadosAio.OmapPP passes against fresh pacific OSDs and fails against upgraded OSDs:
$ ceph_test_rados_api_aio_pp --gtest_filter=LibRadosAio.OmapPP Running main() from gmock_main.cc Note: Google Test filter = LibRadosAio.OmapPP [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from LibRadosAio [ RUN ] LibRadosAio.OmapPP /build/ceph-15.2.16/src/test/librados/aio_cxx.cc:1090: Failure Expected equality of these values: set_got.size() Which is: 0 (unsigned)1 Which is: 1 [ FAILED ] LibRadosAio.OmapPP (1725 ms) [----------] 1 test from LibRadosAio (1726 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (1726 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] LibRadosAio.OmapPP 1 FAILED TEST
#6 Updated by Neha Ojha almost 2 years ago
Ilya, can you run https://github.com/ceph/ceph-ci/commits/wip-55444-pacific through your reproducer?
#7 Updated by Ilya Dryomov almost 2 years ago
- Regression changed from No to Yes
All green:
(noting that wip-55444-pacific reverts https://github.com/ceph/ceph/pull/45963)
#8 Updated by Ilya Dryomov almost 2 years ago
- Related to Bug #55324: rocksdb omap iterators become extremely slow in the presence of large delete range tombstones added
#9 Updated by Ilya Dryomov almost 2 years ago
Ilya Dryomov wrote:
This may sound crazy but the only explanation that I'm able to come up with for these failures is that the octopus -> pacific upgraded OSDs are not persisting omap updates.
Looking at https://github.com/ceph/ceph/pull/45963, omap updates are probably persisted correctly but listomapkeys just refuses to list anything after the fact.
#10 Updated by Neha Ojha almost 2 years ago
Can you check if the same test also fails in master, which includes https://github.com/ceph/ceph/pull/45904? I'm going to go ahead with merging the Pacific revert, but would like to understand if we also need to revert it in master.
#11 Updated by Neha Ojha almost 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 46092
#12 Updated by Neha Ojha almost 2 years ago
Ilya Dryomov wrote:
Yup, omap is definitely involved. LibRadosAio.OmapPP passes against fresh pacific OSDs and fails against upgraded OSDs:
[...]
What confuses me is that this test passed in the rados run of the https://github.com/ceph/ceph/pull/45963,
2022-04-23T08:00:25.014 INFO:tasks.workunit.client.0.smithi059.stdout: api_aio_pp: [ RUN ] LibRadosAio.OmapPP 2022-04-23T08:00:25.014 INFO:tasks.workunit.client.0.smithi059.stdout: api_aio_pp: [ OK ] LibRadosAio.OmapPP (2939 ms)
#13 Updated by Ilya Dryomov almost 2 years ago
It is triggered only with upgraded OSDs.
#14 Updated by Neha Ojha almost 2 years ago
Ilya Dryomov wrote:
It is triggered only with upgraded OSDs.
Ah, that makes sense.
#15 Updated by Adam Kupczyk almost 2 years ago
Neha Ojha wrote:
Ilya Dryomov wrote:
It is triggered only with upgraded OSDs.
Ah, that makes sense.
Not for me, at least not now.
#16 Updated by Neha Ojha almost 2 years ago
In /a/yuriw-2022-04-27_14:24:25-upgrade:octopus-x-pacific-distro-default-smithi/6808913, we are using an octopus version of the workunit
- workunit: branch: octopus clients: client.0: - cls/test_cls_rbd.sh - print: '**** done cls/test_cls_rbd.sh 5-workload'
Half of the OSDs were upgraded
2022-04-27T18:55:29.564 INFO:teuthology.orchestra.run.smithi006.stdout:{ 2022-04-27T18:55:29.564 INFO:teuthology.orchestra.run.smithi006.stdout: "mon": { 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 3 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: }, 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: "mgr": { 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 2 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: }, 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: "osd": { 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: "ceph version 15.2.16-79-gb51d8665 (b51d866559c7cde15d3b75ee17bbba8fdf2584be) octopus (stable)": 4, 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 4 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: }, 2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout: "mds": { 2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout: "ceph version 15.2.16-79-gb51d8665 (b51d866559c7cde15d3b75ee17bbba8fdf2584be) octopus (stable)": 2 2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout: }, 2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout: "overall": { 2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout: "ceph version 15.2.16-79-gb51d8665 (b51d866559c7cde15d3b75ee17bbba8fdf2584be) octopus (stable)": 6, 2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout: "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 9 2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout: } 2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:}
2022-04-27T19:24:46.094 INFO:tasks.workunit.client.0.smithi006.stdout:[ RUN ] TestClsRbd.parents_v1 2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.16/rpm/el8/BUILD/ceph-15.2.16/src/test/cls_rbd/test_cls_rbd.cc:826: Failure 2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout:Expected equality of these values: 2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout: 0 2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout: get_parent(&ioctx, oid, 10, &pspec, &size) 2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout: Which is: -22 2022-04-27T19:24:46.118 INFO:tasks.workunit.client.0.smithi006.stdout:[ FAILED ] TestClsRbd.parents_v1 (23 ms)
#17 Updated by Ilya Dryomov almost 2 years ago
Half of the OSDs were upgraded
Just noting that the "half" part is irrelevant -- in my simpler run linked in https://tracker.ceph.com/issues/55444#note-3 all OSDs were upgraded (which is one of the reasons I chose that job to focus on).
#18 Updated by Ilya Dryomov almost 2 years ago
Neha Ojha wrote:
Can you check if the same test also fails in master, which includes https://github.com/ceph/ceph/pull/45904? I'm going to go ahead with merging the Pacific revert, but would like to understand if we also need to revert it in master.
Upgrading octopus OSDs to master also triggers it -- exactly the same as upgrading octopus OSDs to pacific-with-pr-45963.
#19 Updated by Ilya Dryomov almost 2 years ago
... however upgrading pacific-without-pr-45963 OSDs to master does NOT trigger it.
Could this be related to the switch to per-PG omap in pacific? Perhaps Cory's improvement just doesn't consider the old naming scheme?
#20 Updated by Neha Ojha almost 2 years ago
- Assignee changed from Neha Ojha to Adam Kupczyk
#21 Updated by Neha Ojha almost 2 years ago
- Project changed from rbd to bluestore
#22 Updated by Yuri Weinstein almost 2 years ago
#24 Updated by Ilya Dryomov almost 2 years ago
- Pull request ID changed from 46092 to 46095
https://github.com/ceph/ceph/pull/46092 was the pacific revert, changing to Adam's fix PR
#25 Updated by Ilya Dryomov almost 2 years ago
- Backport changed from pacific to pacific,quincy
#26 Updated by Neha Ojha almost 2 years ago
- Status changed from Fix Under Review to Pending Backport
#27 Updated by Backport Bot almost 2 years ago
- Copied to Backport #55517: quincy: test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test added
#28 Updated by Backport Bot almost 2 years ago
- Copied to Backport #55518: pacific: test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test added
#29 Updated by Backport Bot over 1 year ago
- Tags set to backport_processed