Bug #2068: osd: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since) - Ceph - Ceph

Actions

Copy link

Bug #2068

closed

osd: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

OSD

Target version:

v0.43

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::Active::react(const PG::RecoveryState::MInfoRec&)' thread 7f4748a1a700 time 2012-02-15 03:32:23.134232
osd/PG.cc: 4033: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)
 ceph version 0.41-350-ge32668f (commit:e32668f8b83abad74e858d9e9fffbd456968a918)
 1: (PG::RecoveryState::Active::react(PG::RecoveryState::MInfoRec const&)+0x293) [0x6a78a3]
 2: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x237) [0x6c5a67]
 3: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x6bedfb]
 4: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x157) [0x69b1b7]
 5: (OSD::handle_pg_info(OpRequest*)+0x4b9) [0x55b7d9]
 6: (OSD::_dispatch(Message*)+0x1cd) [0x5743fd]
 7: (OSD::ms_dispatch(Message*)+0x1af) [0x574b0f]
 8: (SimpleMessenger::dispatch_entry()+0x89a) [0x5bb68a]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6abc]
 10: (()+0x7971) [0x7f47556a5971]
 11: (clone()+0x6d) [0x7f4753d3092d]
 ceph version 0.41-350-ge32668f (commit:e32668f8b83abad74e858d9e9fffbd456968a918)

on ubuntu@teuthology:/a/nightly_coverage_2012-02-15-a/12095

Actions

Copy link

Updated by Sage Weil about 12 years ago

looking at the core file.
- we are primary
- replica is sending us an info message, with one record. it is therefore coming from _activate_committed.
- the last_epoch_started is 29, that's teh epoch_t e argument to _activate_committed
- the m->epoch == 31.
- the primary same_interval_since is 30.. so this pg changed in epoch 30, and the replica shouldn't have sent the message
- it looks like the e < last_peering_reset check isn't right, although it appears to be called at the appropriate places in the state machine... additionally checking for e < info.history.same_interval_since will work, but it's not clear whether last_peering_reset should have been > 29 in this case... :/

(gdb) p infoevt.info.history
$2 = {
  epoch_created = 1, 
  last_epoch_started = 29, 
  last_epoch_clean = 22, 
  last_epoch_split = 26, 
  same_up_since = 27, 
  same_interval_since = 29, 
  same_primary_since = 29, 
  last_scrub = {
    version = 0, 
    epoch = 0, 
    __pad = 0
  }, 
  last_scrub_stamp = {
    tv = {
      tv_sec = 1329305246, 
      tv_nsec = 275315000
    }
  }
}
(gdb) p pg->info.history
$3 = {
  epoch_created = 1, 
  last_epoch_started = 26, 
  last_epoch_clean = 22, 
  last_epoch_split = 26, 
  same_up_since = 30, 
  same_interval_since = 30, 
  same_primary_since = 29, 
  last_scrub = {
    version = 0, 
    epoch = 0, 
    __pad = 0
  }, 
  last_scrub_stamp = {
    tv = {
      tv_sec = 1329305246, 
      tv_nsec = 275315000
    }
  }
}

Actions

Copy link