Bug #21827
closedOSD crashed while reparing inconsistent PG
0%
Description
There was 1 inconsistent PG. ceph repair ended up with the primary crashing every time it tries to repair the PG.
2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 3 errors, 1 fixed
2017-10-17 17:48:56.047896 7f234930d700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f234930d700 time 2017-10-17 17:48:55.924115
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x56236c8ff3f2]
2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0xd63) [0x56236c476213]
3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info,
std::allocator<ReplicatedBackend::pull_complete_info> >,
ObjectStore::Transaction)+0x693) [0x56236c60d4d3]
4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2b5)
[0x56236c60dd75]
5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x20c)
[0x56236c61196c]
6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x56236c521aa0]
7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9)
[0x56236c3091a9]
9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x56236c5a2ae7]
10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
[0x56236c9041e4]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
13: (()+0x76ba) [0x7f2366be96ba]
14: (clone()+0x6d) [0x7f2365c603dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
The OSD got a push op containing a snapshot it doesn't think should exist. I also see that there's a comment "// hmm, should we warn?" on that assert.
We uploaded log of the osd with debug osd = 20 reference: 6e4dba6f-2c15-4920-b591-fe380bbca200