Bug #21557
closedosd.6 found snap mapper error on pg 2.0 oid 2:0e781f33:::smithi14431805-379 ... :187 snaps missing in mapper, should be: 175...repaired
0%
Description
2017-09-26 05:24:42.628342 7fb32fccd700 20 snap_mapper.add_oid 2:0e781f33:::smithi14431805-379 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:187 175 2017-09-26 05:24:42.628372 7fb32fccd700 20 snap_mapper.set_snaps 2:0e781f33:::smithi14431805-379 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:187 175 ...and next instance is the repair... 2017-09-26 05:27:11.968496 7f7b77beb700 20 snap_mapper.add_oid 2:0e781f33:::smithi14431805-379 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:187 175
/a/sage-2017-09-26_02:21:24-rados-wip-sage-testing-2017-09-25-1440-distro-basic-smithi/1673099
unfortunately this run had bluestore debug logs turned way down, so it's hard to tell what happened.
Updated by Sage Weil over 6 years ago
saw this again,
/a/sage-2017-12-05_16:19:46-rados-mimic-dev1-distro-basic-smithi/1933230
still confused. it looks to me like the snapmapper backend get_keys() returned an empty result, which happen if
1. the in_progress has an empty value (prior txn didnt' clean up?), or
2. the backend omap didn't return the key
in this case it was bluestore. hmm.
Updated by Kefu Chai over 6 years ago
https://github.com/ceph/ceph/pull/19366 is merged for more verbose logs
Updated by Sage Weil over 6 years ago
/a/sage-2017-12-21_07:24:12-rados-wip-sage3-testing-2017-12-20-2253-distro-basic-smithi/1989672
but didn't have the verbose debug output (it was mimic-dev1)
Updated by Igor Fedotov over 6 years ago
Not 100% sure if that's the same issue but we have a customer who faces an assert in SnapMapper::get_snaps()
2018-01-08 18:35:33.808458 7f57266f1700 1 /home/abuild/rpmbuild/BUILD/ceph-12.2.1+git.1510221942.af9ea5e715/src/osd/SnapMapper.cc: In function 'int SnapMapper::get_snaps(const hobject_t&, SnapMapper::object_snaps*)' thread 7f57266f1700 time 2018-01-08 18:35:33.802898>snaps.empty())
/home/abuild/rpmbuild/BUILD/ceph-12.2.1+git.1510221942.af9ea5e715/src/osd/SnapMapper.cc: 154: FAILED assert(!out
ceph version v12.2.1-383-gaf9ea5e715 (af9ea5e71523250936836ac9ded7019b8c8eeab2) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x5622801cd84e]
2: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x436) [0x56227fe79386]
3: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >)+0xc8) [0x56227fe794a8]
4: (PG::_scan_snaps(ScrubMap&)+0x72a) [0x56227fd198da]
5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x210) [0x56227fd1a840]
6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x39f) [0x56227fd4449f]
7: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0x472) [0x56227fd45f32]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d)+0x1011) [0x56227fc970b1]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x829) [0x5622801d2ff9]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5622801d4f40]
11: (()+0x8744) [0x7f5743c49744]
12: (clone()+0x6d) [0x7f5742cd0aad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
We managed to locate and retrieve the broken omap record via ceph_objectstore_tool and it contains empty snaps set indeed. While being present and decodable.
So the issue is rather somewhere in the write path.
That's luminous v12.2.1. Cache tier AFAIK. Upgrade from filestore and some previous release ...
Updated by Sage Weil over 6 years ago
/a/yuriw-2018-01-09_21:50:35-rados-wip-yuri2-testing-2018-01-09-1813-distro-basic-smithi/2050823
another one.
2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.set_snaps 2:4ff434e3:::smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:15e 144,14b,152 2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.set_snaps set OBJ_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo.. 2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.add_oid set MAP_0000000000000144_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo.. 2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.add_oid set MAP_000000000000014B_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo.. 2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.add_oid set MAP_0000000000000152_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo.. ... 2018-01-10 04:26:54.462 7f7cf711a700 20 snap_mapper.update_snaps 2:4ff434e3:::smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:15e 144 was 2018-01-10 04:26:54.462 7f7cf711a700 20 snap_mapper.get_snaps 2:4ff434e3:::smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:15e got.empty()
Updated by Sage Weil over 5 years ago
Another one:
2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.remove_oid 2:29937b58:::smithi03738207-262:39 2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper._remove_oid 2:29937b58:::smithi03738207-262:39 2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty() 2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.add_oid 2:29937b58:::smithi03738207-262:39 1e 2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty() 2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.set_snaps 2:29937b58:::smithi03738207-262:39 1e ... 2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty() ...repair... 2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.add_oid 2:29937b58:::smithi03738207-262:39 1e 2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty() 2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.set_snaps 2:29937b58:::smithi03738207-262:39 1e
/a/sage-2018-09-01_19:12:15-rados-wip-sage2-testing-2018-08-31-1246-distro-basic-smithi/2970428
rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/upmap.yaml msgr-failures/osd-delay.yaml msgr/simple.yaml objectstore/filestore-xfs.yaml rados.yaml rocksdb.yaml supported-random-distro$/{centos_latest.yaml}
Updated by Sage Weil over 5 years ago
/a/sage-2019-01-01_04:27:00-rados-wip-sage-testing-2018-12-31-1546-distro-basic-smithi/3410885
2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.remove_oid 2:88f1cc54:::smithi00112822-38:c5 2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper._remove_oid 2:88f1cc54:::smithi00112822-38:c5 2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty() 2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.add_oid 2:88f1cc54:::smithi00112822-38:c5 c4 2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty() 2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.set_snaps 2:88f1cc54:::smithi00112822-38:c5 c4 ... 2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty() 2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.add_oid 2:88f1cc54:::smithi00112822-38:c5 c4 2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty() 2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.set_snaps 2:88f1cc54:::smithi00112822-38:c5 c4
and no intervening events except an osd restart ...
Updated by Josh Durgin over 4 years ago
- Status changed from Need More Info to Can't reproduce