Bug #1958
osd: crash during peering due to receiving an info msg in WaitActingChange
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
This happened during a teuthology run with thrashing and reads/writes/deletes.
Logs are in vit:~joshd/bug_1958
2012-01-20 15:18:26.757104 11c4f700 osd.1 91 pg[0.f( v 21'319 (21'118,21'319] n=4 ec=1 les/c 65/65 81/85/3) [1,2]/[1,2,0] r=0 lpr=85 bft=2 mlcod 0'0 active] log audit: log(21'118,21'319] handle_in fo 0.f( v 21'319 (21'119,21'319] n=4 ec=1 les/c 85/65 81/85/3) from osd.0 2012-01-20 15:18:26.759722 11c4f700 osd.1 91 pg[0.f( v 21'319 (21'118,21'319] n=4 ec=1 les/c 65/65 81/85/3) [1,2]/[1,2,0] r=0 lpr=85 bft=2 mlcod 0'0 active] log audit: log(21'118,21'319] exit Star ted/Primary/Peering/WaitActingChange 0.008219 1 0.011454 2012-01-20 15:18:26.760536 11c4f700 osd.1 91 pg[0.f( v 21'319 (21'118,21'319] n=4 ec=1 les/c 65/65 81/85/3) [1,2]/[1,2,0] r=0 lpr=85 bft=2 mlcod 0'0 active] log audit: log(21'118,21'319] exit Star ted/Primary 0.009844 0 0.000000 2012-01-20 15:18:26.761300 11c4f700 osd.1 91 pg[0.f( v 21'319 (21'118,21'319] n=4 ec=1 les/c 65/65 81/85/3) [1,2]/[1,2,0] r=0 lpr=85 bft=2 mlcod 0'0 active] log audit: log(21'118,21'319] exit Star ted 35.573870 0 0.000000 2012-01-20 15:18:26.764141 11c4f700 osd.1 91 pg[0.f( v 21'319 (21'118,21'319] n=4 ec=1 les/c 65/65 81/85/3) [1,2]/[1,2,0] r=0 lpr=85 bft=2 mlcod 0'0 active] log audit: log(21'118,21'319] enter Cra shed osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0u>::my_context)', in thread '11c4f700' osd/PG.cc: 3744: FAILED assert(0 == "we got a bad state machine event") ceph version 0.40-185-g75004db (commit:75004dbe4063baf8211b41e2da45d8bb7861e1f6) 1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xfd) [0x662b1d] 2: (boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator> >::construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x26) [0x6a2236] 3: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd4) [0x6a2eb4] 4: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa1) [0x6a9631] 5: (boost::statechart::simple_state<PG::RecoveryState::WaitActingChange, PG::RecoveryState::Primary, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x66) [0x6a24d6] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x6a3e3b] 7: (PG::RecoveryState::handle_info(int, PG::Info&, PG::RecoveryCtx*)+0x157) [0x676c27] 8: (OSD::handle_pg_info(MOSDPGInfo*)+0x468) [0x55b308] 9: (OSD::_dispatch(Message*)+0x5fd) [0x56b1fd] 10: (OSD::ms_dispatch(Message*)+0x19f) [0x56c03f] 11: (SimpleMessenger::dispatch_entry()+0x883) [0x5b7863] 12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a31dc] 13: (()+0x7971) [0x4e35971] 14: (clone()+0x6d) [0x659d92d]
Associated revisions
osd: ignore MInfoRec, MNotifyRec in WaitActingChange
We should ignore logs, infos, and notifies while we are waiting for the
map to change. Peering has reached a dead-end (we need acting to change)
and we will redo our work when that happens. That includes the replicas
resending notifies.
Fixes: #1958
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
History
#1 Updated by Sage Weil almost 12 years ago
- Priority changed from Normal to High
#2 Updated by Sage Weil almost 12 years ago
- Status changed from New to 4
- Assignee set to Sage Weil
fix pushed to commit:2f6205e57c7b8a21da72f0af8f1edd38a5989149
#3 Updated by Sage Weil almost 12 years ago
- Status changed from 4 to Resolved