Project

General

Profile

Bug #20843

assert(i->prior_version == last) when a MODIFY entry follows an ERROR entry

Added by Jeegn Chen over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
Peering
Target version:
% Done:

0%

Source:
Tags:
Backport:
kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We encountered a core dump of ceph-osd. According to the following information from gdb, the problem was that the prior_version of PGLog entry 8336'960 was pointing to 8336'957 while the code expected for 8336'959.

According to the output of "ceph-objectstore-tool ... --skip-journal-replay --pgid 5.1 --op log", we knew that entries 8336'957, 8336'958, 8336'959 and 8336'960 were all available. But 8336'957 was MODIFY, 8336'958 and 8336'959 were were ERROR, 8336'960 were MODIFY.

After some googling, we noticed https://github.com/ceph/ceph/pull/15649/commits/7879efdd6b14770b287c672641dc2461e491f9b0 might be related. But further investigation showed that the commit would not help in our case since the commit only work when i is pointing to an ERROR entry, which is different from our 8336'960 MODIFY case.

An intuitive solution for me is that we execute "last = i->version" when "! i->is_error()".
Or the consistency check (such as "i->prior_version == last") should be done against "entries" in 7879efdd6b14770b287c672641dc2461e491f9b0 instead of "orig_entries".

PS: to address the problem before the code change, we used ceph-objectstore-tool to remove the problematic PG (since we still had enough valid replicas) and started the OSD successfully.

@
-------------gdb begin-------------
(gdb) bt
#0 0x00007f83c68e123b in raise () from /lib64/libpthread.so.0
#1 0x00007f83c9592e75 in reraise_fatal (signum=6) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/global/signal_handler.cc:72
#2 handle_fatal_signal (signum=6) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/global/signal_handler.cc:134
#3 <signal handler called>
#4 0x00007f83c56ff1d7 in raise () from /lib64/libc.so.6
#5 0x00007f83c57008c8 in abort () from /lib64/libc.so.6
#6 0x00007f83c9734d37 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f83c99ab265 "i->prior_version == last", file=file@entry=0x7f83c9975408 "/root/GIT/Dawa/rpmbuild/BUILD/ceph-11.2.0-4-gf658c05/src/osd/PGLog.h", line=line@entry=773, func=func@entry=0x7f83c99ab9a0 <void PGLog::_merge_object_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, hobject_t const&, std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)4, pg_log_entry_t> > const&, pg_info_t const&, eversion_t, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)::__PRETTY_FUNCTION__> "static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, mempool::osd::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, missing_type&, PGLog::LogEntryHandler*"...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/common/assert.cc:78
Python Exception <type 'exceptions.IndexError'> list index out of range:
#7 0x00007f83c91dc6b4 in PGLog::_merge_object_divergent_entries<pg_missing_set<true> > (log=..., hoid=..., entries=std::list, info=..., olog_can_rollback_to=..., missing=..., rollbacker=rollbacker@entry=0x7f83a0b82270, dpp=dpp@entry=0x7f83d712dd30) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PGLog.h:773
Python Exception <type 'exceptions.IndexError'> list index out of range:
#8 0x00007f83c91de61c in PGLog::_merge_divergent_entries<pg_missing_set<true> > (log=..., entries=empty std::list, oinfo=..., olog_can_rollback_to=..., omissing=..., rollbacker=rollbacker@entry=0x7f83a0b82270, dpp=dpp@entry=0x7f83d712dd30) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PGLog.h:939
#9 0x00007f83c91d67f4 in PGLog::merge_log (this=this@entry=0x7f83d712dd30, t=..., oinfo=..., olog=..., fromosd=..., info=..., rollbacker=rollbacker@entry=0x7f83a0b82270, dirty_info=@0x7f83d712d585: false, dirty_big_info=@0x7f83d712d586: false) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PGLog.cc:387
#10 0x00007f83c91415e5 in PG::merge_log (this=this@entry=0x7f83d712d000, t=..., oinfo=..., olog=..., from=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PG.cc:449
#11 0x00007f83c915b885 in PG::proc_master_log (this=this@entry=0x7f83d712d000, t=..., oinfo=..., olog=..., omissing=..., from=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PG.cc:319
#12 0x00007f83c9178e7c in PG::RecoveryState::GetLog::react (this=this@entry=0x7f83eb6eafa0) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PG.cc:7605
#13 0x00007f83c91cf992 in react<PG::RecoveryState::GetLog, boost::statechart::event_base, void const*> (eventType=<synthetic pointer>, evt=..., stt=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/custom_reaction.hpp:42
#14 local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::GotLog>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::transition<PG::RecoveryState::IsIncomplete, PG::RecoveryState::Incomplete> >, boost::statechart::simple_state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (eventType=0x7f83c9e1bf90 <boost::statechart::detail::id_holder<PG::RecoveryState::GotLog>::idProvider_>, evt=..., stt=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/simple_state.hpp:816
#15 local_react<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::GotLog>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::transition<PG::RecoveryState::IsIncomplete, PG::RecoveryState::Incomplete> > > (eventType=0x7f83c9e1bf90 <boost::statechart::detail::id_holder<PG::RecoveryState::GotLog>::idProvider_>, evt=..., this=0x7f83eb6eafa0) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/simple_state.hpp:851
#16 local_react_impl<boost::mpl::list4<boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::RecoveryState::GotLog>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::transition<PG::RecoveryState::IsIncomplete, PG::RecoveryState::Incomplete> >, boost::statechart::simple_state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (eventType=0x7f83c9e1bf90 <boost::statechart::detail::id_holder<PG::RecoveryState::GotLog>::idProvider_>, evt=..., stt=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/simple_state.hpp:820
#17 local_react<boost::mpl::list4<boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::RecoveryState::GotLog>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::transition<PG::RecoveryState::IsIncomplete, PG::RecoveryState::Incomplete> > > (eventType=0x7f83c9e1bf90 <boost::statechart::detail::id_holder<PG::RecoveryState::GotLog>::idProvider_>, evt=..., this=0x7f83eb6eafa0) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/simple_state.hpp:851
#18 local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::RecoveryState::GotLog>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::transition<PG::RecoveryState::IsIncomplete, PG::RecoveryState::Incomplete> >, boost::statechart::simple_state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (eventType=0x7f83c9e1bf90 <boost::statechart::detail::id_holder<PG::RecoveryState::GotLog>::idProvider_>, evt=..., stt=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/simple_state.hpp:820
#19 local_react<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::RecoveryState::GotLog>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::transition<PG::RecoveryState::IsIncomplete, PG::RecoveryState::Incomplete> > > (eventType=0x7f83c9e1bf90 <boost::statechart::detail::id_holder<PG::RecoveryState::GotLog>::idProvider_>, evt=..., this=0x7f83eb6eafa0) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/simple_state.hpp:851
#20 boost::statechart::simple_state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl (this=0x7f83eb6eafa0, evt=..., eventType=0x7f83c9e1bf90 <boost::statechart::detail::id_holder<PG::RecoveryState::GotLog>::idProvider_>) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/simple_state.hpp:489
#21 0x00007f83c91ac2bb in operator() (this=<synthetic pointer>) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/state_machine.hpp:87
#22 operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, void const*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial>::exception_event_handler> (this=0x7f83d712efe0, action=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/null_exception_translator.hpp:33
#23 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event (this=0x7f83d712ef88, evt=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/state_machine.hpp:889
#24 0x00007f83c91ac411 in boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events (this=this@entry=0x7f83d712ef88) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/state_machine.hpp:910
#25 0x00007f83c91798b5 in process_event (evt=..., this=0x7f83d712ef88) at /usr/src/debug/ceph-11.2.0-4-gf658c05/build/boost/include/boost/statechart/state_machine.hpp:280
#26 handle_event (rctx=0x7f83a0b828a0, evt=..., this=0x7f83d712ef88) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PG.h:2058
#27 PG::handle_peering_event (this=0x7f83d712d000, evt=..., rctx=0x7f83a0b828a0) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PG.cc:5655
Python Exception <type 'exceptions.IndexError'> list index out of range:
#28 0x00007f83c90da214 in OSD::process_peering_events (this=0x7f83d3f52000, pgs=std::list, handle=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/OSD.cc:9132
Python Exception <type 'exceptions.IndexError'> list index out of range:
#29 0x00007f83c912a5b7 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=std::list, handle=...) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/OSD.h:1933
#30 0x00007f83c973b969 in ThreadPool::worker (this=0x7f83d3f527b8, wt=0x7f83e96529f0) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/common/WorkQueue.cc:132
#31 0x00007f83c973c940 in ThreadPool::WorkThread::entry (this=<optimized out>) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/common/WorkQueue.h:446
#32 0x00007f83c68d9dc5 in start_thread () from /lib64/libpthread.so.0
#33 0x00007f83c57c174d in clone () from /lib64/libc.so.6
(gdb) frame 7
Python Exception <type 'exceptions.IndexError'> list index out of range:
#7 0x00007f83c91dc6b4 in PGLog::_merge_object_divergent_entries<pg_missing_set<true> > (log=..., hoid=..., entries=std::list, info=..., olog_can_rollback_to=..., missing=..., rollbacker=rollbacker@entry=0x7f83a0b82270, dpp=dpp@entry=0x7f83d712dd30) at /usr/src/debug/ceph-11.2.0-4-gf658c05/src/osd/PGLog.h:773
773 assert(i->prior_version == last);
(gdb) l 760,777
760
761 // entries is non-empty
762 assert(!entries.empty());
763 eversion_t last;
764 for (list<pg_log_entry_t>::const_iterator i = entries.begin();
765 i != entries.end();
766 ++i) {
767 // all entries are on hoid
768 assert(i->soid == hoid);
769 if (i != entries.begin() && i->prior_version != eversion_t()) {
770 // in increasing order of version
771 assert(i->version > last);
772 // prior_version correct
773 assert(i->prior_version == last);
774 }
775 last = i->version;
776 }
777
(gdb) p last
$3 = {
version = 959,
epoch = 8336,
__pad = <optimized out>
}
(gdb) p i
$4 = {
mod_desc = {
can_local_rollback = false,
rollback_info_completed = false,
max_required_version = 1 '\001',
bl = {
_buffers = empty std::list,
_len = 0,
_memcopy_count = 0,
append_buffer = {
_raw = ,
_off = 0,
_len = 0
},
last_p = {
<ceph::buffer::list::iterator_impl<false>> = {
<std::iterator<std::forward_iterator_tag, char, long, char*, char&>> = {<No data fields>},
members of ceph::buffer::list::iterator_impl<false>:
bl = ,
ls = ,
off = 0,
p = {
_raw = ,
_off = 0,
_len = 0
},
p_off = 0
}, <No data fields>},
static CLAIM_DEFAULT = 0,
static CLAIM_ALLOW_NONSHAREABLE = 1
}
},
snaps = {
_buffers = empty std::list,
_len = 0,
_memcopy_count = 0,
append_buffer = {
_raw = ,
_off = 0,
_len = 0
},
last_p = {
<ceph::buffer::list::iterator_impl<false>> = {
<std::iterator<std::forward_iterator_tag, char, long, char*, char&>> = {<No data fields>},
members of ceph::buffer::list::iterator_impl<false>:
bl = ,
ls = ,
off = 0,
p = {
_raw = ,
_off = 0,
_len = 0
},
p_off = 0
}, <No data fields>},
static CLAIM_DEFAULT = 0,
static CLAIM_ALLOW_NONSHAREABLE = 1
},
soid = {
oid = {
name = "notify.7"
},
snap = {
val = 18446744073709551614
},
hash = 2225973193,
max = false,
nibblewise_key_cache = 2625296968,
hash_reverse_bits = 2481304865,
static POOL_META = 1,
static POOL_TEMP_START = -2,
pool = 5,
nspace = "",
key = ""
},
reqid = {
name = {
_type = 0 '\000',
_num = 0,
static TYPE_MON = 1,
static TYPE_MDS = 2,
static TYPE_OSD = 4,
static TYPE_CLIENT = 8,
static TYPE_MGR = 16,
static NEW = -1
},
tid = 0,
inc = 0
},
extra_reqids = std::vector of length 0, capacity 0,
version = {
version = 960,
epoch = 8336,
__pad = 0
},
prior_version = {
version = 957,
epoch = 8336,
__pad = 0
},
reverting_to = {
version = 0,
epoch = 0,
__pad = 0
},
user_version = 0,
mtime = {
tv = {
tv_sec = 1501059332,
tv_nsec = 585029087
}
},
return_code = 0,
op = 1,
invalid_hash = false,
invalid_pool = false
}
-------------gdb end------------

-------------ceph-objectstore-tool --op log begin------------- {
"op": "modify ",
"object": "5:93e5b521:::notify.7:head",
"version": "8336'957",
--
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": false,
"rollback_info_completed": false,
"ops": []
}
}
}, {
"op": "error ",
"object": "5:93e5b521:::notify.7:head",
"version": "8336'958",
--
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": true,
"rollback_info_completed": false,
"ops": []
}
}
}, {
"op": "error ",
"object": "5:93e5b521:::notify.7:head",
"version": "8336'959",
--
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": true,
"rollback_info_completed": false,
"ops": []
}
}
}, {
"op": "modify ",
"object": "5:93e5b521:::notify.7:head",
"version": "8336'960",

-------------ceph-objectstore-tool --op log end-------------
@


Related issues

Copied to RADOS - Backport #20930: kraken: assert(i->prior_version == last) when a MODIFY entry follows an ERROR entry Rejected

History

#1 Updated by Sage Weil over 6 years ago

  • Status changed from New to Fix Under Review

#2 Updated by Sage Weil over 6 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Sage Weil over 6 years ago

  • Priority changed from Urgent to Immediate

#4 Updated by Josh Durgin over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to kraken

Backport only needed for kraken, jewel does not have error log entries.

#5 Updated by Loïc Dachary over 6 years ago

  • Copied to Backport #20930: kraken: assert(i->prior_version == last) when a MODIFY entry follows an ERROR entry added

#6 Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF