Project

General

Profile

Bug #55444

test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test

Added by Laura Flores 5 months ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: rados/upgrade/nautilus-x-singleton/{0-cluster/{openstack start} 1-install/nautilus 2-partial-upgrade/firsthalf 3-thrash/default 4-workload/{rbd-cls rbd-import-export readwrite snaps-few-objects} 5-workload/{radosbench rbd_api} 6-finish-upgrade 7-pacific 8-workload/{rbd-python snaps-many-objects} bluestore-bitmap mon_election/classic thrashosds-health ubuntu_18.04}

/a/lflores-2022-04-22_20:48:19-rados-wip-55324-pacific-backport-distro-default-smithi/6801098

2022-04-23T08:54:27.447 INFO:tasks.workunit.client.0.smithi084.stdout:[ RUN      ] TestClsRbd.directory_methods
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:297: Failure
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout:      Expected: -16
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: dir_state_set(&ioctx, oid, cls::rbd::DIRECTORY_STATE_ADD_DISABLED)
2022-04-23T08:54:27.465 INFO:tasks.workunit.client.0.smithi084.stdout:      Which is: 0
2022-04-23T08:54:27.466 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.directory_methods (18 ms)

...

2022-04-23T08:54:27.633 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:750: Failure
2022-04-23T08:54:27.633 INFO:tasks.workunit.client.0.smithi084.stdout:      Expected: 0
2022-04-23T08:54:27.633 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: get_parent(&ioctx, oid, 10, &pspec, &size)
2022-04-23T08:54:27.634 INFO:tasks.workunit.client.0.smithi084.stdout:      Which is: -22
2022-04-23T08:54:27.634 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.parents_v1 (45 ms)

...

2022-04-23T08:54:27.729 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:1008: Failure
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout:      Expected: 1u
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout:      Which is: 1
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: snapc.snaps.size()
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout:      Which is: 0
2022-04-23T08:54:27.730 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.snapshots (6 ms)

...

2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout:/build/ceph-14.2.22/src/test/cls_rbd/test_cls_rbd.cc:1437: Failure
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout:      Expected: 2U
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout:      Which is: 2
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout:To be equal to: pairs.size()
2022-04-23T08:54:27.778 INFO:tasks.workunit.client.0.smithi084.stdout:      Which is: 0
2022-04-23T08:54:27.779 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.metadata (6 ms)

... + 22 more failed tests

2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[==========] 67 tests from 1 test case ran. (22012 ms total)
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[  PASSED  ] 41 tests.
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] 26 tests, listed below:
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.directory_methods
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.parents_v1
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.snapshots
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.metadata
2022-04-23T08:54:39.861 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.mirror
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.mirror_image
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.mirror_image_status
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.mirror_image_map
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_dir_list
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_dir_add
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.dir_add_already_existing
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_dir_rename
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_dir_remove
2022-04-23T08:54:39.862 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_dir_remove_missing
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_image_add
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_image_remove
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_image_list
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_image_clean
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.image_group_add
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_snap_set_duplicate_name
2022-04-23T08:54:39.863 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_snap_set
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_snap_list
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.group_snap_remove
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.trash_methods
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.clone_child
2022-04-23T08:54:39.864 INFO:tasks.workunit.client.0.smithi084.stdout:[  FAILED  ] TestClsRbd.namespace_methods


Related issues

Related to bluestore - Bug #55324: rocksdb omap iterators become extremely slow in the presence of large delete range tombstones Pending Backport
Copied to bluestore - Backport #55517: quincy: test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test New
Copied to bluestore - Backport #55518: pacific: test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test Resolved

History

#1 Updated by Laura Flores 5 months ago

  • Subject changed from TestClsRbd: multiple test failures during upgrade test to test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test

#2 Updated by Yuri Weinstein 5 months ago

  • Priority changed from Normal to Urgent

#3 Updated by Ilya Dryomov 5 months ago

This may sound crazy but the only explanation that I'm able to come up with for these failures is that the octopus -> pacific upgraded OSDs are not persisting omap updates. I reran another, simpler, job from Yuri's run 5 times and the failures are exactly the same:

https://pulpito.ceph.com/dis-2022-04-28_17:15:07-upgrade:octopus-x-pacific-distro-default-smithi/

2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[==========] 27 tests from 1 test suite ran. (2842 ms total)
2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[  PASSED  ] 20 tests.
2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] 7 tests, listed below:
2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] TestClsJournal.ClientRegister
2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] TestClsJournal.ClientUnregisterPruneTags
2022-04-28T18:14:46.196 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] TestClsJournal.ClientCommit
2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] TestClsJournal.ClientList
2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] TestClsJournal.TagCreate
2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] TestClsJournal.TagCreatePrunesTags
2022-04-28T18:14:46.197 INFO:tasks.workunit.client.0.smithi156.stdout:[  FAILED  ] TestClsJournal.TagList

Honing in on TestClsJournal.ClientRegister failure, it is really simple:

[ RUN      ] TestClsJournal.ClientRegister
/build/ceph-15.2.16/src/test/cls_journal/test_cls_journal.cc:249: Failure
Expected equality of these values:
  expected_clients
    Which is: { [id=id1, commit_position=[positions=[]], state=connected] }
  clients
    Which is: {}
[  FAILED  ] TestClsJournal.ClientRegister (23 ms)
TEST_F(TestClsJournal, ClientRegister) {
  librados::IoCtx ioctx;
  ASSERT_EQ(0, _rados.ioctx_create(_pool_name.c_str(), ioctx));

  std::string oid = get_temp_image_name();
  ASSERT_EQ(0, client::create(ioctx, oid, 2, 4, ioctx.get_id()));                      // create image1 object, set some keys on it

  ASSERT_EQ(0, client::client_register(ioctx, oid, "id1", bufferlist()));              // set client_id1 key on the same object

  std::set<Client> clients;
  ASSERT_EQ(0, client::client_list(ioctx, oid, &clients));                             // read client_id1 key on the same object -- it appears that such key doesn't exist!
                                                                                       //   and the keys that were supposed to be set by client::create don't seem to be there either...

  std::set<Client> expected_clients = {Client("id1", bufferlist())};
  ASSERT_EQ(expected_clients, clients);                                                // assert fires because |clients| set remained empty
}

#4 Updated by Ilya Dryomov 5 months ago

  • Assignee set to Neha Ojha

#5 Updated by Ilya Dryomov 5 months ago

Yup, omap is definitely involved. LibRadosAio.OmapPP passes against fresh pacific OSDs and fails against upgraded OSDs:

$ ceph_test_rados_api_aio_pp --gtest_filter=LibRadosAio.OmapPP
Running main() from gmock_main.cc
Note: Google Test filter = LibRadosAio.OmapPP
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LibRadosAio
[ RUN      ] LibRadosAio.OmapPP
/build/ceph-15.2.16/src/test/librados/aio_cxx.cc:1090: Failure
Expected equality of these values:
  set_got.size()
    Which is: 0
  (unsigned)1
    Which is: 1
[  FAILED  ] LibRadosAio.OmapPP (1725 ms)
[----------] 1 test from LibRadosAio (1726 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1726 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] LibRadosAio.OmapPP

 1 FAILED TEST

#6 Updated by Neha Ojha 5 months ago

Ilya, can you run https://github.com/ceph/ceph-ci/commits/wip-55444-pacific through your reproducer?

#8 Updated by Ilya Dryomov 5 months ago

  • Related to Bug #55324: rocksdb omap iterators become extremely slow in the presence of large delete range tombstones added

#9 Updated by Ilya Dryomov 5 months ago

Ilya Dryomov wrote:

This may sound crazy but the only explanation that I'm able to come up with for these failures is that the octopus -> pacific upgraded OSDs are not persisting omap updates.

Looking at https://github.com/ceph/ceph/pull/45963, omap updates are probably persisted correctly but listomapkeys just refuses to list anything after the fact.

#10 Updated by Neha Ojha 5 months ago

Can you check if the same test also fails in master, which includes https://github.com/ceph/ceph/pull/45904? I'm going to go ahead with merging the Pacific revert, but would like to understand if we also need to revert it in master.

#11 Updated by Neha Ojha 5 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46092

#12 Updated by Neha Ojha 5 months ago

Ilya Dryomov wrote:

Yup, omap is definitely involved. LibRadosAio.OmapPP passes against fresh pacific OSDs and fails against upgraded OSDs:

[...]

What confuses me is that this test passed in the rados run of the https://github.com/ceph/ceph/pull/45963,

http://pulpito.front.sepia.ceph.com/lflores-2022-04-22_20:48:19-rados-wip-55324-pacific-backport-distro-default-smithi/6801057/

2022-04-23T08:00:25.014 INFO:tasks.workunit.client.0.smithi059.stdout:               api_aio_pp: [ RUN      ] LibRadosAio.OmapPP
2022-04-23T08:00:25.014 INFO:tasks.workunit.client.0.smithi059.stdout:               api_aio_pp: [       OK ] LibRadosAio.OmapPP (2939 ms)

#13 Updated by Ilya Dryomov 5 months ago

It is triggered only with upgraded OSDs.

#14 Updated by Neha Ojha 5 months ago

Ilya Dryomov wrote:

It is triggered only with upgraded OSDs.

Ah, that makes sense.

#15 Updated by Adam Kupczyk 5 months ago

Neha Ojha wrote:

Ilya Dryomov wrote:

It is triggered only with upgraded OSDs.

Ah, that makes sense.

Not for me, at least not now.

#16 Updated by Neha Ojha 5 months ago

In /a/yuriw-2022-04-27_14:24:25-upgrade:octopus-x-pacific-distro-default-smithi/6808913, we are using an octopus version of the workunit

  - workunit:
      branch: octopus
      clients:
        client.0:
        - cls/test_cls_rbd.sh
  - print: '**** done cls/test_cls_rbd.sh 5-workload'

Half of the OSDs were upgraded

2022-04-27T18:55:29.564 INFO:teuthology.orchestra.run.smithi006.stdout:{
2022-04-27T18:55:29.564 INFO:teuthology.orchestra.run.smithi006.stdout:    "mon": {
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:        "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 3
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:    },
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:    "mgr": {
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:        "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 2
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:    },
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:    "osd": {
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:        "ceph version 15.2.16-79-gb51d8665 (b51d866559c7cde15d3b75ee17bbba8fdf2584be) octopus (stable)": 4,
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:        "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 4
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:    },
2022-04-27T18:55:29.565 INFO:teuthology.orchestra.run.smithi006.stdout:    "mds": {
2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:        "ceph version 15.2.16-79-gb51d8665 (b51d866559c7cde15d3b75ee17bbba8fdf2584be) octopus (stable)": 2
2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:    },
2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:    "overall": {
2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:        "ceph version 15.2.16-79-gb51d8665 (b51d866559c7cde15d3b75ee17bbba8fdf2584be) octopus (stable)": 6,
2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:        "ceph version 16.2.7-969-g4fa079ba (4fa079ba14503defa8dc257d7c2d506ebefcfe6d) pacific (stable)": 9
2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:    }
2022-04-27T18:55:29.566 INFO:teuthology.orchestra.run.smithi006.stdout:}
2022-04-27T19:24:46.094 INFO:tasks.workunit.client.0.smithi006.stdout:[ RUN      ] TestClsRbd.parents_v1
2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.16/rpm/el8/BUILD/ceph-15.2.16/src/test/cls_rbd/test_cls_rbd.cc:826: Failure
2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout:Expected equality of these values:
2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout:  0
2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout:  get_parent(&ioctx, oid, 10, &pspec, &size)
2022-04-27T19:24:46.117 INFO:tasks.workunit.client.0.smithi006.stdout:    Which is: -22
2022-04-27T19:24:46.118 INFO:tasks.workunit.client.0.smithi006.stdout:[  FAILED  ] TestClsRbd.parents_v1 (23 ms)

#17 Updated by Ilya Dryomov 5 months ago

Half of the OSDs were upgraded

Just noting that the "half" part is irrelevant -- in my simpler run linked in https://tracker.ceph.com/issues/55444#note-3 all OSDs were upgraded (which is one of the reasons I chose that job to focus on).

#18 Updated by Ilya Dryomov 5 months ago

Neha Ojha wrote:

Can you check if the same test also fails in master, which includes https://github.com/ceph/ceph/pull/45904? I'm going to go ahead with merging the Pacific revert, but would like to understand if we also need to revert it in master.

Upgrading octopus OSDs to master also triggers it -- exactly the same as upgrading octopus OSDs to pacific-with-pr-45963.

#19 Updated by Ilya Dryomov 5 months ago

... however upgrading pacific-without-pr-45963 OSDs to master does NOT trigger it.

Could this be related to the switch to per-PG omap in pacific? Perhaps Cory's improvement just doesn't consider the old naming scheme?

#20 Updated by Neha Ojha 5 months ago

  • Assignee changed from Neha Ojha to Adam Kupczyk

#21 Updated by Neha Ojha 5 months ago

  • Project changed from rbd to bluestore

#24 Updated by Ilya Dryomov 5 months ago

  • Pull request ID changed from 46092 to 46095

https://github.com/ceph/ceph/pull/46092 was the pacific revert, changing to Adam's fix PR

#25 Updated by Ilya Dryomov 5 months ago

  • Backport changed from pacific to pacific,quincy

#26 Updated by Neha Ojha 5 months ago

  • Status changed from Fix Under Review to Pending Backport

#27 Updated by Backport Bot 5 months ago

  • Copied to Backport #55517: quincy: test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test added

#28 Updated by Backport Bot 5 months ago

  • Copied to Backport #55518: pacific: test_cls_rbd.sh: multiple TestClsRbd failures during upgrade test added

#29 Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed

Also available in: Atom PDF