Project

General

Profile

Actions

Bug #64514

closed

LibRadosTwoPoolsPP.PromoteSnapScrub test failed

Added by Aishwarya Mathuria 3 months ago. Updated about 2 months ago.

Status:
Duplicate
Priority:
Normal
Category:
Tiering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In rados_api_tests:

2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsPP.PromoteSnapScrub
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: my_snaps [3]
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: my_snaps [4,3]
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: my_snaps [5,4,3]
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: my_snaps [6,5,4,3]
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: promoting some heads
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: promoting from clones for snap 6
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: promoting from clones for snap 5
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: promoting from clones for snap 4
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: promoting from clones for snap 3
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: ./src/test/librados/tier_cxx.cc:673: Failure
2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: Expected equality of these values:
2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp:   1
2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp:   ioctx.read(o, bl, 1, 0)
2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp:     Which is: -2
2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout:              api_tier_pp: [  FAILED  ] LibRadosTwoPoolsPP.PromoteSnapScrub (298035 ms)

/a/yuriw-2024-02-14_14:58:57-rados-wip-yuri4-testing-2024-02-13-1546-distro-default-smithi/7559884


Related issues 1 (1 open0 closed)

Is duplicate of RADOS - Bug #64347: src/osd/PG.cc: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps)Pending BackportMatan Breizman

Actions
Actions #1

Updated by Nitzan Mordechai 2 months ago

  • Category set to Tiering
Actions #2

Updated by Radoslaw Zarzynski 2 months ago

  • Assignee set to Matan Breizman

Hmm, it seems to happen before the scrub part:

  for (unsigned snap = 0; snap < my_snaps.size(); ++snap) {
    cout << "promoting from clones for snap " << my_snaps[snap] << std::endl;
    ioctx.snap_set_read(my_snaps[snap]);

    // read some snaps, semi-randomly
    for (int i=0; i<50; ++i) {
      bufferlist bl;
      string o = string("foo") + stringify((snap * i * 137) % 80);
      //cout << o << std::endl;
      ASSERT_EQ(1, ioctx.read(o, bl, 1, 0));
    }
  }

  // ok, stop and scrub this pool (to make sure scrub can handle
  // missing clones in the cache tier).
  {
    IoCtx cache_ioctx;
    ASSERT_EQ(0, cluster.ioctx_create(cache_pool_name.c_str(), cache_ioctx));
    for (int i=0; i<10; ++i) {
      do {
        ostringstream ss;
        ss << "{\"prefix\": \"pg scrub\", \"pgid\": \"" 
           << cache_ioctx.get_id() << "." << i
           << "\"}";
        int r = cluster.mon_command(ss.str(), inbl, NULL, NULL);
        if (r == -ENOENT ||  // in case mgr osdmap is stale
            r == -EAGAIN) {
          sleep(5);
          continue;
        }
      } while (false);
    }

Matan, would you mind taking a quick look? I'm not sure it specific only to the cache tiering (which is a deprecated feature).

Actions #3

Updated by Kamoltat (Junior) Sirivadhna 2 months ago

Hi guys,

this bug came up a few weeks ago and I've asked one of the PR authors of the run I was reviewing to take a look. The author did some analysis and did a fix on the QA side of things (No PRs filed, just testing it against his branch on teuthology) to get the test passing. Here is what he said:

https://github.com/ceph/ceph/pull/53222#issuecomment-1961064275

hope this will help with patching this issue

Actions #4

Updated by Radoslaw Zarzynski 2 months ago

Bump up.

Actions #5

Updated by Matan Breizman 2 months ago

  • Pull request ID set to 55562

This may be related to bug fixed in https://tracker.ceph.com/issues/64347. However, the outcome here is different which is possible.
The author reported that wrong snapshot were being removed which is the exact result of the bug.
Morever, from the PR:

It looks like the failure is not related to this PR.
The fail occurred when reading foo21:5 because of ENOENT (according to ceph-osd.6.log).
But, the snapshot 5 was already removed earlier by snap trim---this was triggered by SnapMapper because
it adds snapids to stray in SnapMapper::Scrubber::run().

...

2024-02-10T09:50:04.750+0000` 7f4eaabfd640 10 snap_mapper.record_purged_snaps purged_snaps {94={100=[2~1],101=[e~1]}}
2024-02-10T09:50:04.750+0000 7f4eaabfd640 10 bluestore(/var/lib/ceph/osd/ceph-6) get_omap_iterator meta #-1:c1a3fc6e:::purged_snaps:0#
2024-02-10T09:50:04.750+0000 7f4eaabfd640 20 bluestore.onode(0x5616f5245600).flush flush done
2024-02-10T09:50:04.750+0000 7f4eaabfd640 10 bluestore(/var/lib/ceph/osd/ceph-6) get_omap_iterator has_omap = 1
2024-02-10T09:50:04.750+0000 7f4eaabfd640 20 bluestore.OmapIteratorImpl(0x5616f6c2d180) lower_bound to PSN__100_0000000000000001 key 0x0000000000000000C1A3FC6E0000000000000403'.PSN__100_0000000000000001'
2024-02-10T09:50:04.750+0000 7f4eaabfd640 20 bluestore.OmapIteratorImpl(0x5616f6c2d180) valid is at 0x0000000000000000C1A3FC6E0000000000000403'.PSN__101_000000000000000a'

Looking for snap in pool 100 and returning snap in pool 101.
Sounds exactly like the RCA in https://tracker.ceph.com/issues/64347#note-3.

The fix is currently in testing and should be merged to main. Let's see we hit this again with the patch included.

Actions #6

Updated by Matan Breizman 2 months ago

  • Status changed from New to In Progress
Actions #7

Updated by Matan Breizman 2 months ago

  • Related to Bug #64347: src/osd/PG.cc: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps) added
Actions #8

Updated by Radoslaw Zarzynski about 2 months ago

Bump up.

Actions #9

Updated by Matan Breizman about 2 months ago

  • Status changed from In Progress to Duplicate

Closing as this is a duplicate.

Actions #10

Updated by Matan Breizman about 2 months ago

  • Related to deleted (Bug #64347: src/osd/PG.cc: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps))
Actions #11

Updated by Matan Breizman about 2 months ago

  • Is duplicate of Bug #64347: src/osd/PG.cc: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps) added
Actions

Also available in: Atom PDF