Bug #64514
closedLibRadosTwoPoolsPP.PromoteSnapScrub test failed
0%
Description
In rados_api_tests:
2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: [ RUN ] LibRadosTwoPoolsPP.PromoteSnapScrub 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: my_snaps [3] 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: my_snaps [4,3] 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: my_snaps [5,4,3] 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: my_snaps [6,5,4,3] 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: promoting some heads 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: promoting from clones for snap 6 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: promoting from clones for snap 5 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: promoting from clones for snap 4 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: promoting from clones for snap 3 2024-02-15T23:22:45.242 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: ./src/test/librados/tier_cxx.cc:673: Failure 2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: Expected equality of these values: 2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: 1 2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: ioctx.read(o, bl, 1, 0) 2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: Which is: -2 2024-02-15T23:22:45.243 INFO:tasks.workunit.client.0.smithi012.stdout: api_tier_pp: [ FAILED ] LibRadosTwoPoolsPP.PromoteSnapScrub (298035 ms)
/a/yuriw-2024-02-14_14:58:57-rados-wip-yuri4-testing-2024-02-13-1546-distro-default-smithi/7559884
Updated by Radoslaw Zarzynski 2 months ago
- Assignee set to Matan Breizman
Hmm, it seems to happen before the scrub part:
for (unsigned snap = 0; snap < my_snaps.size(); ++snap) {
cout << "promoting from clones for snap " << my_snaps[snap] << std::endl;
ioctx.snap_set_read(my_snaps[snap]);
// read some snaps, semi-randomly
for (int i=0; i<50; ++i) {
bufferlist bl;
string o = string("foo") + stringify((snap * i * 137) % 80);
//cout << o << std::endl;
ASSERT_EQ(1, ioctx.read(o, bl, 1, 0));
}
}
// ok, stop and scrub this pool (to make sure scrub can handle
// missing clones in the cache tier).
{
IoCtx cache_ioctx;
ASSERT_EQ(0, cluster.ioctx_create(cache_pool_name.c_str(), cache_ioctx));
for (int i=0; i<10; ++i) {
do {
ostringstream ss;
ss << "{\"prefix\": \"pg scrub\", \"pgid\": \""
<< cache_ioctx.get_id() << "." << i
<< "\"}";
int r = cluster.mon_command(ss.str(), inbl, NULL, NULL);
if (r == -ENOENT || // in case mgr osdmap is stale
r == -EAGAIN) {
sleep(5);
continue;
}
} while (false);
}
Matan, would you mind taking a quick look? I'm not sure it specific only to the cache tiering (which is a deprecated feature).
Updated by Kamoltat (Junior) Sirivadhna 2 months ago
Hi guys,
this bug came up a few weeks ago and I've asked one of the PR authors of the run I was reviewing to take a look. The author did some analysis and did a fix on the QA side of things (No PRs filed, just testing it against his branch on teuthology) to get the test passing. Here is what he said:
https://github.com/ceph/ceph/pull/53222#issuecomment-1961064275
hope this will help with patching this issue
Updated by Matan Breizman 2 months ago
- Pull request ID set to 55562
This may be related to bug fixed in https://tracker.ceph.com/issues/64347. However, the outcome here is different which is possible.
The author reported that wrong snapshot were being removed which is the exact result of the bug.
Morever, from the PR:
It looks like the failure is not related to this PR.
The fail occurred when reading foo21:5 because of ENOENT (according to ceph-osd.6.log).
But, the snapshot 5 was already removed earlier by snap trim---this was triggered by SnapMapper because
it adds snapids to stray in SnapMapper::Scrubber::run().
...
2024-02-10T09:50:04.750+0000` 7f4eaabfd640 10 snap_mapper.record_purged_snaps purged_snaps {94={100=[2~1],101=[e~1]}}
2024-02-10T09:50:04.750+0000 7f4eaabfd640 10 bluestore(/var/lib/ceph/osd/ceph-6) get_omap_iterator meta #-1:c1a3fc6e:::purged_snaps:0#
2024-02-10T09:50:04.750+0000 7f4eaabfd640 20 bluestore.onode(0x5616f5245600).flush flush done
2024-02-10T09:50:04.750+0000 7f4eaabfd640 10 bluestore(/var/lib/ceph/osd/ceph-6) get_omap_iterator has_omap = 1
2024-02-10T09:50:04.750+0000 7f4eaabfd640 20 bluestore.OmapIteratorImpl(0x5616f6c2d180) lower_bound to PSN__100_0000000000000001 key 0x0000000000000000C1A3FC6E0000000000000403'.PSN__100_0000000000000001'
2024-02-10T09:50:04.750+0000 7f4eaabfd640 20 bluestore.OmapIteratorImpl(0x5616f6c2d180) valid is at 0x0000000000000000C1A3FC6E0000000000000403'.PSN__101_000000000000000a'
Looking for snap in pool 100 and returning snap in pool 101.
Sounds exactly like the RCA in https://tracker.ceph.com/issues/64347#note-3.
The fix is currently in testing and should be merged to main. Let's see we hit this again with the patch included.
Updated by Matan Breizman 2 months ago
- Related to Bug #64347: src/osd/PG.cc: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps) added
Updated by Matan Breizman about 2 months ago
- Status changed from In Progress to Duplicate
Closing as this is a duplicate.
Updated by Matan Breizman about 2 months ago
- Related to deleted (Bug #64347: src/osd/PG.cc: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps))
Updated by Matan Breizman about 2 months ago
- Is duplicate of Bug #64347: src/osd/PG.cc: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps) added