Project

General

Profile

Actions

Bug #2068

closed

osd: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::Active::react(const PG::RecoveryState::MInfoRec&)' thread 7f4748a1a700 time 2012-02-15 03:32:23.134232
osd/PG.cc: 4033: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)
 ceph version 0.41-350-ge32668f (commit:e32668f8b83abad74e858d9e9fffbd456968a918)
 1: (PG::RecoveryState::Active::react(PG::RecoveryState::MInfoRec const&)+0x293) [0x6a78a3]
 2: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x237) [0x6c5a67]
 3: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x6bedfb]
 4: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x157) [0x69b1b7]
 5: (OSD::handle_pg_info(OpRequest*)+0x4b9) [0x55b7d9]
 6: (OSD::_dispatch(Message*)+0x1cd) [0x5743fd]
 7: (OSD::ms_dispatch(Message*)+0x1af) [0x574b0f]
 8: (SimpleMessenger::dispatch_entry()+0x89a) [0x5bb68a]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6abc]
 10: (()+0x7971) [0x7f47556a5971]
 11: (clone()+0x6d) [0x7f4753d3092d]
 ceph version 0.41-350-ge32668f (commit:e32668f8b83abad74e858d9e9fffbd456968a918)

on ubuntu@teuthology:/a/nightly_coverage_2012-02-15-a/12095
Actions #1

Updated by Sage Weil about 12 years ago

looking at the core file.
- we are primary
- replica is sending us an info message, with one record. it is therefore coming from _activate_committed.
- the last_epoch_started is 29, that's teh epoch_t e argument to _activate_committed
- the m->epoch == 31.
- the primary same_interval_since is 30.. so this pg changed in epoch 30, and the replica shouldn't have sent the message
- it looks like the e < last_peering_reset check isn't right, although it appears to be called at the appropriate places in the state machine... additionally checking for e < info.history.same_interval_since will work, but it's not clear whether last_peering_reset should have been > 29 in this case... :/

(gdb) p infoevt.info.history
$2 = {
  epoch_created = 1, 
  last_epoch_started = 29, 
  last_epoch_clean = 22, 
  last_epoch_split = 26, 
  same_up_since = 27, 
  same_interval_since = 29, 
  same_primary_since = 29, 
  last_scrub = {
    version = 0, 
    epoch = 0, 
    __pad = 0
  }, 
  last_scrub_stamp = {
    tv = {
      tv_sec = 1329305246, 
      tv_nsec = 275315000
    }
  }
}
(gdb) p pg->info.history
$3 = {
  epoch_created = 1, 
  last_epoch_started = 26, 
  last_epoch_clean = 22, 
  last_epoch_split = 26, 
  same_up_since = 30, 
  same_interval_since = 30, 
  same_primary_since = 29, 
  last_scrub = {
    version = 0, 
    epoch = 0, 
    __pad = 0
  }, 
  last_scrub_stamp = {
    tv = {
      tv_sec = 1329305246, 
      tv_nsec = 275315000
    }
  }
}
Actions #2

Updated by Sage Weil about 12 years ago

Oh, i see the problem.. the osdmap ref is taken by lock().. this pg hasn't seen the new map yet.

just need to tag the message with e, not the current osdmap epoch.

Actions #3

Updated by Sage Weil about 12 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF