Bug #10780
closed"FAILED assert(infoevt.info.history.last_epoch_started".. in upgrade:giant-x-hammer-distro-basic-multi run
0%
Description
Run: http://pulpito.front.sepia.ceph.com/teuthology-2015-02-05_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/
Job: 741180
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-05_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/741180/
From scrape:
Assertion: osd/PG.cc: 6380: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since) ceph version 0.87-86-g1ccd73a (1ccd73a16e1829b5519ec5b83b2554af173ad052) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb8282b] 2: (PG::RecoveryState::Active::react(PG::MInfoRec const&)+0x28b) [0x7c7f7b] 3: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x296) [0x813f36] 4: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xbd) [0x81702d] 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7feb4b] 6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1ce) [0x7ae6ae] 7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c0) [0x698eb0] 8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x6eeb58] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb739b6] 10: (ThreadPool::WorkThread::entry()+0x10) [0xb74a60] 11: (()+0x8182) [0x7ffc7a8ca182] 12: (clone()+0x6d) [0x7ffc78e3630d]
Updated by Yuri Weinstein over 9 years ago
Also in run:
http://pulpito.ceph.com/teuthology-2015-02-08_17:05:02-upgrade:giant-x-hammer-distro-basic-multi/
Job: 744225
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-08_17:05:02-upgrade:giant-x-hammer-distro-basic-multi/744225/
Assertion: osd/PG.cc: 6380: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since) ceph version 0.87-86-g1ccd73a (1ccd73a16e1829b5519ec5b83b2554af173ad052) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa02ef] 2: (PG::RecoveryState::Active::react(PG::MInfoRec const&)+0x2f9) [0x772679] 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list6<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MNotifyRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Backfilled>, boost::statechart::custom_reaction<PG::AllReplicasActivated> >, boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x100) [0x7c6fd0] 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x8d) [0x7c709d] 5: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x11d) [0x7c60cd] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7a6cdb] 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x1e) [0x7a6fde] 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x75d933] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x274) [0x657fb4] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6ab812] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa8ff1e] 12: (ThreadPool::WorkThread::entry()+0x10) [0xa92cb0] 13: (()+0x7e9a) [0x7f72a5a9ee9a] 14: (clone()+0x6d) [0x7f72a446931d]
Updated by Yuri Weinstein over 9 years ago
Also i
Run: http://pulpito.ceph.com/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/
Job: 743622
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743351- 0> 2015-02-08 08:18:43.496594 7f8c5af38700 -1 osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::Active::react(const PG::MInfoRec&)' thread 7f8c5af38700 time 2015-02-08 08:18:43.492375 /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743559-osd/PG.cc: 6380: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since) /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743671- /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743672: ceph version 0.87-86-g1ccd73a (1ccd73a16e1829b5519ec5b83b2554af173ad052) /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743746- 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa02ef] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743837- 2: (PG::RecoveryState::Active::react(PG::MInfoRec const&)+0x2f9) [0x772679] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743914- 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list6<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MNotifyRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Backfilled>, boost::statechart::custom_reaction<PG::AllReplicasActivated> >, boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x100) [0x7c6fd0] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896744876- 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x8d) [0x7c709d] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896745115- 5: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x11d) [0x7c60cd] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896745545- 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7a6cdb] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896745782- 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x1e) [0x7a6fde] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746022- 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x75d933] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746130- 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x274) [0x657fb4] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746249- 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6ab812] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746365- 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa8ff1e] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746433- 12: (ThreadPool::WorkThread::entry()+0x10) [0xa92cb0] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746488- 13: (()+0x7e9a) [0x7f8c745f9e9a] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746522- 14: (clone()+0x6d) [0x7f8c72fab31d] /a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746559- NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Yuri Weinstein over 9 years ago
Updated by David Zafman about 9 years ago
- Assignee set to David Zafman
Looking at the job 741180 I see that osd.5 (hammer) sent a PG::MInfoRec peering message to osd.12 (giant) at around 2015-02-05 19:08:15. I'm suspicious of the following change:
commit 818b91b7e2af56d9564717a9cef2d353cf42cf3d Author: Samuel Just <sam.just@inktank.com> Date: Fri Nov 7 15:20:02 2014 -0800 osd/: go "peered" instead of "active" when < min_size In the case of a replicated pool, the pg will transition to "peered" rather than "active", where it can perform backfill bringing itself up to min_size peers. To that end, most operations now require is_peered rather than is_active (OSDOps being the primary exception). Also, rather than using the query_epoch on the activation message as the activation epoch (for last_epoch_started) on the replica, we instead use the last_epoch_started in the info sent by the primary. This allows the primary to not advance last_epoch_started past the last known actual activation. This will prevent later peering epochs from requiring the last_update from a peered epoch to go active (might be divergent). Fixes: #7862 Signed-off-by: Samuel Just <sam.just@inktank.com>
Updated by Samuel Just about 9 years ago
- Assignee changed from David Zafman to Samuel Just
Agreed, though there is a feature flag which should have prevented it.
Updated by Samuel Just about 9 years ago
- Assignee changed from Samuel Just to Yuri Weinstein
Updated by Samuel Just about 9 years ago
wip-sam-testing should have a fix for this.