Project

General

Profile

Bug #2068

osd: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)

Added by Sage Weil over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::Active::react(const PG::RecoveryState::MInfoRec&)' thread 7f4748a1a700 time 2012-02-15 03:32:23.134232
osd/PG.cc: 4033: FAILED assert(infoevt.info.history.last_epoch_started >= pg->info.history.same_interval_since)
 ceph version 0.41-350-ge32668f (commit:e32668f8b83abad74e858d9e9fffbd456968a918)
 1: (PG::RecoveryState::Active::react(PG::RecoveryState::MInfoRec const&)+0x293) [0x6a78a3]
 2: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x237) [0x6c5a67]
 3: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x6bedfb]
 4: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x157) [0x69b1b7]
 5: (OSD::handle_pg_info(OpRequest*)+0x4b9) [0x55b7d9]
 6: (OSD::_dispatch(Message*)+0x1cd) [0x5743fd]
 7: (OSD::ms_dispatch(Message*)+0x1af) [0x574b0f]
 8: (SimpleMessenger::dispatch_entry()+0x89a) [0x5bb68a]
 9: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a6abc]
 10: (()+0x7971) [0x7f47556a5971]
 11: (clone()+0x6d) [0x7f4753d3092d]
 ceph version 0.41-350-ge32668f (commit:e32668f8b83abad74e858d9e9fffbd456968a918)

on ubuntu@teuthology:/a/nightly_coverage_2012-02-15-a/12095

Associated revisions

Revision 4b3bb5ab (diff)
Added by Sage Weil over 8 years ago

osd: fix _activate_committed replica->primary message

Normally we take a fresh map reference in PG::lock(). However,
_activate_committed needs to make sure the map hasn't changed significantly
before acting. In the case of #2068, the OSD map has moved forward and
the mapping has changed, but the PG hasn't processed that yet, and thus
mis-tags the MOSDPGInfo message.

Tag the message with the e epoch, and also pass down the primary's address
to send the message to the right location.

Fixes: #2068
Signed-off-by: Sage Weil <>

Revision 95633b9b (diff)
Added by Sage Weil over 8 years ago

osd: fix _activate_committed replica->primary message

Normally we take a fresh map reference in PG::lock(). However,
_activate_committed needs to make sure the map hasn't changed significantly
before acting. In the case of #2068, the OSD map has moved forward and
the mapping has changed, but the PG hasn't processed that yet, and thus
mis-tags the MOSDPGInfo message.

Tag the message with the e epoch, and also pass down the primary's address
to send the message to the right location.

Fixes: #2068
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 8 years ago

looking at the core file.
- we are primary
- replica is sending us an info message, with one record. it is therefore coming from _activate_committed.
- the last_epoch_started is 29, that's teh epoch_t e argument to _activate_committed
- the m->epoch == 31.
- the primary same_interval_since is 30.. so this pg changed in epoch 30, and the replica shouldn't have sent the message
- it looks like the e < last_peering_reset check isn't right, although it appears to be called at the appropriate places in the state machine... additionally checking for e < info.history.same_interval_since will work, but it's not clear whether last_peering_reset should have been > 29 in this case... :/

(gdb) p infoevt.info.history
$2 = {
  epoch_created = 1, 
  last_epoch_started = 29, 
  last_epoch_clean = 22, 
  last_epoch_split = 26, 
  same_up_since = 27, 
  same_interval_since = 29, 
  same_primary_since = 29, 
  last_scrub = {
    version = 0, 
    epoch = 0, 
    __pad = 0
  }, 
  last_scrub_stamp = {
    tv = {
      tv_sec = 1329305246, 
      tv_nsec = 275315000
    }
  }
}
(gdb) p pg->info.history
$3 = {
  epoch_created = 1, 
  last_epoch_started = 26, 
  last_epoch_clean = 22, 
  last_epoch_split = 26, 
  same_up_since = 30, 
  same_interval_since = 30, 
  same_primary_since = 29, 
  last_scrub = {
    version = 0, 
    epoch = 0, 
    __pad = 0
  }, 
  last_scrub_stamp = {
    tv = {
      tv_sec = 1329305246, 
      tv_nsec = 275315000
    }
  }
}

#2 Updated by Sage Weil over 8 years ago

Oh, i see the problem.. the osdmap ref is taken by lock().. this pg hasn't seen the new map yet.

just need to tag the message with e, not the current osdmap epoch.

#3 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF