Project

General

Profile

Actions

Bug #10780

closed

"FAILED assert(infoevt.info.history.last_epoch_started".. in upgrade:giant-x-hammer-distro-basic-multi run

Added by Yuri Weinstein about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.front.sepia.ceph.com/teuthology-2015-02-05_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/
Job: 741180
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-05_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/741180/

From scrape:

Assertion: osd/PG.cc: 6380: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)
ceph version 0.87-86-g1ccd73a (1ccd73a16e1829b5519ec5b83b2554af173ad052)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb8282b]
 2: (PG::RecoveryState::Active::react(PG::MInfoRec const&)+0x28b) [0x7c7f7b]
 3: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x296) [0x813f36]
 4: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xbd) [0x81702d]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7feb4b]
 6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1ce) [0x7ae6ae]
 7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c0) [0x698eb0]
 8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x6eeb58]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb739b6]
 10: (ThreadPool::WorkThread::entry()+0x10) [0xb74a60]
 11: (()+0x8182) [0x7ffc7a8ca182]
 12: (clone()+0x6d) [0x7ffc78e3630d]

Actions #1

Updated by Jason Dillaman about 9 years ago

  • Project changed from rbd to Ceph
Actions #2

Updated by Yuri Weinstein about 9 years ago

Also in run:

http://pulpito.ceph.com/teuthology-2015-02-08_17:05:02-upgrade:giant-x-hammer-distro-basic-multi/
Job: 744225
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-08_17:05:02-upgrade:giant-x-hammer-distro-basic-multi/744225/

Assertion: osd/PG.cc: 6380: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)
ceph version 0.87-86-g1ccd73a (1ccd73a16e1829b5519ec5b83b2554af173ad052)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa02ef]
 2: (PG::RecoveryState::Active::react(PG::MInfoRec const&)+0x2f9) [0x772679]
 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list6<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MNotifyRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Backfilled>, boost::statechart::custom_reaction<PG::AllReplicasActivated> >, boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x100) [0x7c6fd0]
 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x8d) [0x7c709d]
 5: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x11d) [0x7c60cd]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7a6cdb]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x1e) [0x7a6fde]
 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x75d933]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x274) [0x657fb4]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6ab812]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa8ff1e]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xa92cb0]
 13: (()+0x7e9a) [0x7f72a5a9ee9a]
 14: (clone()+0x6d) [0x7f72a446931d]
Actions #3

Updated by Yuri Weinstein about 9 years ago

Also i
Run: http://pulpito.ceph.com/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/
Job: 743622
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/

/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743351-     0> 2015-02-08 08:18:43.496594 7f8c5af38700 -1 osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::Active::react(const PG::MInfoRec&)' thread 7f8c5af38700 time 2015-02-08 08:18:43.492375
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743559-osd/PG.cc: 6380: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743671-
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743672: ceph version 0.87-86-g1ccd73a (1ccd73a16e1829b5519ec5b83b2554af173ad052)
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743746- 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa02ef]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743837- 2: (PG::RecoveryState::Active::react(PG::MInfoRec const&)+0x2f9) [0x772679]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896743914- 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list6<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MNotifyRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Backfilled>, boost::statechart::custom_reaction<PG::AllReplicasActivated> >, boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x100) [0x7c6fd0]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896744876- 4: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x8d) [0x7c709d]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896745115- 5: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x11d) [0x7c60cd]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896745545- 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7a6cdb]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896745782- 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x1e) [0x7a6fde]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746022- 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x75d933]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746130- 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x274) [0x657fb4]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746249- 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6ab812]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746365- 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa8ff1e]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746433- 12: (ThreadPool::WorkThread::entry()+0x10) [0xa92cb0]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746488- 13: (()+0x7e9a) [0x7f8c745f9e9a]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746522- 14: (clone()+0x6d) [0x7f8c72fab31d]
/a/teuthology-2015-02-07_17:05:03-upgrade:giant-x-hammer-distro-basic-multi/743622/remote/plana21/log//ceph-osd.13.log.gz:896746559- NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #4

Updated by Samuel Just about 9 years ago

  • Priority changed from Normal to Urgent
Actions #6

Updated by David Zafman about 9 years ago

  • Assignee set to David Zafman

Looking at the job 741180 I see that osd.5 (hammer) sent a PG::MInfoRec peering message to osd.12 (giant) at around 2015-02-05 19:08:15. I'm suspicious of the following change:


commit 818b91b7e2af56d9564717a9cef2d353cf42cf3d
Author: Samuel Just <sam.just@inktank.com>
Date:   Fri Nov 7 15:20:02 2014 -0800

    osd/: go "peered" instead of "active" when < min_size

    In the case of a replicated pool, the pg will transition to "peered" 
    rather than "active", where it can perform backfill bringing itself up
    to min_size peers.  To that end, most operations now require
    is_peered rather than is_active (OSDOps being the primary exception).
    Also, rather than using the query_epoch on the activation message as the
    activation epoch (for last_epoch_started) on the replica, we instead
    use the last_epoch_started in the info sent by the primary.  This
    allows the primary to not advance last_epoch_started past the last
    known actual activation.  This will prevent later peering epochs from
    requiring the last_update from a peered epoch to go active (might be
    divergent).

    Fixes: #7862
    Signed-off-by: Samuel Just <sam.just@inktank.com>
Actions #7

Updated by Samuel Just about 9 years ago

  • Assignee changed from David Zafman to Samuel Just

Agreed, though there is a feature flag which should have prevented it.

Actions #8

Updated by Samuel Just about 9 years ago

  • Status changed from New to 7
Actions #9

Updated by Samuel Just about 9 years ago

  • Assignee changed from Samuel Just to Yuri Weinstein
Actions #10

Updated by Samuel Just about 9 years ago

wip-sam-testing should have a fix for this.

Actions #12

Updated by Sage Weil about 9 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF