Project

General

Profile

Actions

Bug #21827

closed

OSD crashed while reparing inconsistent PG

Added by Ana Aviles over 6 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There was 1 inconsistent PG. ceph repair ended up with the primary crashing every time it tries to repair the PG.

2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 3 errors, 1 fixed
2017-10-17 17:48:56.047896 7f234930d700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f234930d700 time 2017-10-17 17:48:55.924115
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x56236c8ff3f2]
2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0xd63) [0x56236c476213]
3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info,
std::allocator<ReplicatedBackend::pull_complete_info> >,
ObjectStore::Transaction
)+0x693) [0x56236c60d4d3]
4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2b5)
[0x56236c60dd75]
5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x20c)
[0x56236c61196c]
6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x56236c521aa0]
7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9)
[0x56236c3091a9]
9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x56236c5a2ae7]
10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
[0x56236c9041e4]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
13: (()+0x76ba) [0x7f2366be96ba]
14: (clone()+0x6d) [0x7f2365c603dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

The OSD got a push op containing a snapshot it doesn't think should exist. I also see that there's a comment "// hmm, should we warn?" on that assert.

We uploaded log of the osd with debug osd = 20 reference: 6e4dba6f-2c15-4920-b591-fe380bbca200

Actions

Also available in: Atom PDF