Project

General

Profile

Bug #21557

osd.6 found snap mapper error on pg 2.0 oid 2:0e781f33:::smithi14431805-379 ... :187 snaps missing in mapper, should be: 175...repaired

Added by Sage Weil almost 2 years ago. Updated 7 months ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Start date:
09/26/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

2017-09-26 05:24:42.628342 7fb32fccd700 20 snap_mapper.add_oid 2:0e781f33:::smithi14431805-379 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:187 175
2017-09-26 05:24:42.628372 7fb32fccd700 20 snap_mapper.set_snaps 2:0e781f33:::smithi14431805-379 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:187 175
...and next instance is the repair...
2017-09-26 05:27:11.968496 7f7b77beb700 20 snap_mapper.add_oid 2:0e781f33:::smithi14431805-379 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:187 175

/a/sage-2017-09-26_02:21:24-rados-wip-sage-testing-2017-09-25-1440-distro-basic-smithi/1673099

unfortunately this run had bluestore debug logs turned way down, so it's hard to tell what happened.

History

#1 Updated by Sage Weil over 1 year ago

saw this again,
/a/sage-2017-12-05_16:19:46-rados-mimic-dev1-distro-basic-smithi/1933230

still confused. it looks to me like the snapmapper backend get_keys() returned an empty result, which happen if
1. the in_progress has an empty value (prior txn didnt' clean up?), or
2. the backend omap didn't return the key

in this case it was bluestore. hmm.

#2 Updated by Sage Weil over 1 year ago

  • Priority changed from Normal to High

#3 Updated by Kefu Chai over 1 year ago

https://github.com/ceph/ceph/pull/19366 is merged for more verbose logs

#4 Updated by Sage Weil over 1 year ago

/a/sage-2017-12-21_07:24:12-rados-wip-sage3-testing-2017-12-20-2253-distro-basic-smithi/1989672

but didn't have the verbose debug output (it was mimic-dev1)

#5 Updated by Igor Fedotov over 1 year ago

Not 100% sure if that's the same issue but we have a customer who faces an assert in SnapMapper::get_snaps()
2018-01-08 18:35:33.808458 7f57266f1700 -1 /home/abuild/rpmbuild/BUILD/ceph-12.2.1+git.1510221942.af9ea5e715/src/osd/SnapMapper.cc: In function 'int SnapMapper::get_snaps(const hobject_t&, SnapMapper::object_snaps*)' thread 7f57266f1700 time 2018-01-08 18:35:33.802898
/home/abuild/rpmbuild/BUILD/ceph-12.2.1+git.1510221942.af9ea5e715/src/osd/SnapMapper.cc: 154: FAILED assert(!out->snaps.empty())

ceph version v12.2.1-383-gaf9ea5e715 (af9ea5e71523250936836ac9ded7019b8c8eeab2) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x5622801cd84e]
2: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x436) [0x56227fe79386]
3: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >)+0xc8) [0x56227fe794a8]
4: (PG::_scan_snaps(ScrubMap&)+0x72a) [0x56227fd198da]
5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x210) [0x56227fd1a840]
6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x39f) [0x56227fd4449f]
7: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0x472) [0x56227fd45f32]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d
)+0x1011) [0x56227fc970b1]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x829) [0x5622801d2ff9]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5622801d4f40]
11: (()+0x8744) [0x7f5743c49744]
12: (clone()+0x6d) [0x7f5742cd0aad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

We managed to locate and retrieve the broken omap record via ceph_objectstore_tool and it contains empty snaps set indeed. While being present and decodable.
So the issue is rather somewhere in the write path.
That's luminous v12.2.1. Cache tier AFAIK. Upgrade from filestore and some previous release ...

#6 Updated by Sage Weil over 1 year ago

/a/yuriw-2018-01-09_21:50:35-rados-wip-yuri2-testing-2018-01-09-1813-distro-basic-smithi/2050823

another one.

2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.set_snaps 2:4ff434e3:::smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:15e 144,14b,152
2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.set_snaps set OBJ_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo..
2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.add_oid set MAP_0000000000000144_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo..
2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.add_oid set MAP_000000000000014B_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo..
2018-01-10 04:25:18.957 7ff83144a700 20 snap_mapper.add_oid set MAP_0000000000000152_0000000000000002.2FF2C27C.15e.smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo..
...
2018-01-10 04:26:54.462 7f7cf711a700 20 snap_mapper.update_snaps 2:4ff434e3:::smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:15e 144 was 
2018-01-10 04:26:54.462 7f7cf711a700 20 snap_mapper.get_snaps 2:4ff434e3:::smithi20229270-23 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:15e got.empty()

#7 Updated by Sage Weil 11 months ago

Another one:

2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.remove_oid 2:29937b58:::smithi03738207-262:39
2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper._remove_oid 2:29937b58:::smithi03738207-262:39
2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty()
2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.add_oid 2:29937b58:::smithi03738207-262:39 1e
2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty()
2018-09-01 20:34:41.728 7fed04ff0700 20 snap_mapper.set_snaps 2:29937b58:::smithi03738207-262:39 1e
...
2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty()
...repair...
2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.add_oid 2:29937b58:::smithi03738207-262:39 1e
2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.get_snaps 2:29937b58:::smithi03738207-262:39 got.empty()
2018-09-01 20:35:23.148 7fbca81fb700 20 snap_mapper.set_snaps 2:29937b58:::smithi03738207-262:39 1e

/a/sage-2018-09-01_19:12:15-rados-wip-sage2-testing-2018-08-31-1246-distro-basic-smithi/2970428
rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/{default.yaml} backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/upmap.yaml msgr-failures/osd-delay.yaml msgr/simple.yaml objectstore/filestore-xfs.yaml rados.yaml rocksdb.yaml supported-random-distro$/{centos_latest.yaml}

#8 Updated by Sage Weil 7 months ago

/a/sage-2019-01-01_04:27:00-rados-wip-sage-testing-2018-12-31-1546-distro-basic-smithi/3410885

2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.remove_oid 2:88f1cc54:::smithi00112822-38:c5
2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper._remove_oid 2:88f1cc54:::smithi00112822-38:c5
2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty()
2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.add_oid 2:88f1cc54:::smithi00112822-38:c5 c4
2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty()
2019-01-01 06:29:44.959 7f869c93b700 20 snap_mapper.set_snaps 2:88f1cc54:::smithi00112822-38:c5 c4
...
2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty()
2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.add_oid 2:88f1cc54:::smithi00112822-38:c5 c4
2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.get_snaps 2:88f1cc54:::smithi00112822-38:c5 got.empty()
2019-01-01 06:34:29.724 7fe928412700 20 snap_mapper.set_snaps 2:88f1cc54:::smithi00112822-38:c5 c4

and no intervening events except an osd restart ...

Also available in: Atom PDF