Bug #43489
openPG.cc: 953: FAILED assert(0 == "past_interval start interval mismatch")
0%
Description
Upgrade runs from Jewel to Luminous and Luminous to Mimic
yuriw-2019-12-23_19:53:50-rados-wip-yuri3-testing-2019-12-23-1538-luminous-distro-basic-smithi/4628818
yuriw-2019-12-05_23:20:37-upgrade:luminous-x-mimic-distro-basic-smithi/4571297
NOT RELATED TO https://tracker.ceph.com/issues/24866
41-gece61ff/src/osd/PG.cc: 953: FAILED assert(0 == "past_interval start interval mismatch")
ceph version 12.2.12-841-gece61ff (ece61ff7fd951e05ed05bc710a5fbe60550c780a) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5645663c6ad0]
2: (PG::check_past_interval_bounds() const+0xa6e) [0x564565eaa3ce]
3: (PG::RecoveryState::GetInfo::GetInfo(boost::statechart::state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x151) [0x564565ed9511]
4: (boost::statechart::state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::deep_construct(boost::intrusive_ptr<PG::RecoveryState::Started> const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x10a) [0x564565f331ba]
5: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Start, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Primary, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0x84) [0x564565f3d2c4]
6: (boost::statechart::simple_state<PG::RecoveryState::Start, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc2) [0x564565f3eef2]
7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x564565f1a90b]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x91) [0x564565f1aa61]
9: (PG::handle_create(PG::RecoveryCtx*)+0x125) [0x564565ee0a65]
10: (OSD::handle_pg_peering_evt(spg_t, pg_history_t const&, PastIntervals const&, unsigned int, std::shared_ptr<PG::CephPeeringEvt>)+0xf78) [0x564565e2a658]
11: (OSD::handle_pg_notify(boost::intrusive_ptr<OpRequest>)+0x512) [0x564565e2c862]
12: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0xc9) [0x564565e2db29]
13: (OSD::_dispatch(Message*)+0x3bc) [0x564565e2e64c]
14: (OSD::ms_dispatch(Message*)+0x87) [0x564565e2e987]
15: (DispatchQueue::entry()+0x792) [0x56456665b1a2]
16: (DispatchQueue::DispatchThread::entry()+0xd) [0x56456645315d]
17: (()+0x7e25) [0x7f13dcd7fe25]
18: (clone()+0x6d) [0x7f13dbe70bad]
Updated by Steve Taylor about 3 years ago
This has been observed a handful of times at AT&T over the last six months or so. I'm afraid I don't have logs, but there have been enough commonalities among the occurrences that I might be able to give some helpful clues.
All of the instances observed at AT&T have been with Nautilus 14.2.9 and may have the same root cause as https://tracker.ceph.com/issues/49689, but the call stack looks more like this one.
In every AT&T occurrence this issue has been related to host reboots and problems associated with getting those hosts back up. It has mainly been observed in resiliency testing as hosts are rebooted, but it has not been successfully reproduced in a controlled environment. The sequence is as follows:
1. The cluster is deployed on a single bootstrap host initially
2. Additional hosts are deployed shortly thereafter
3. The cluster is healthy and populated with some amount of data
4. The initial bootstrap host goes down. Sometimes other hosts go down and come back up first and recovery completes before the initial host goes down. Other times no other hosts go down first.
5. The bootstrap host experiences some issues and stays down for a while
6. The OSDs on the bootstrap host get marked out automatically
7. PGs are remapped, recovery commences and completes
8. The cluster is HEALTH_OK with the bootstrap host still down
9. Sometime later the bootstrap host comes back up
10. OSDs on the bootstrap host start successfully and peers on other hosts start crashing with this past_interval assert
In these cases the cluster has been recovered by removing the PGs affected from the OSDs that are asserting. At that point they are able to come up and backfill the PGs successfully. My concern with this workaround is that it may be possible that newer PG replicas are being removed and overwritten with older replicas, but the cluster is able to recover.
My theory is that this is related to the remapped state of PGs when the bootstrap host's OSDs are marked out. If the primary is a bootstrap host OSD and any of the original peers is among the remapped peers for a PG when the PG recovers on the remapped peers, then the remapped peers get a new last_clean_epoch that disagrees with the original primary. Then when the original primary comes back up and the PG remaps back to the original peer set, which includes one or more OSDs that were remapped peers while the original primary was down, peering fails due to a mismatch.
I will provide additional details as they become available.