Project

General

Profile

Actions

Bug #8863

closed

osd: second reservation rejection -> crash

Added by shaojun ruan almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I found bug#7624 resolved this problem?http://tracker.ceph.com/issues/7642? and the source code of OSDMonitor in 0.80.4 is correct with this fix?
But it still occur...
-----------------------------------------------------------------------------------
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-osd() [0x98aa3a]
2: (()+0xfcb0) [0x7ffe1b62bcb0]
3: (gsignal()+0x35) [0x7ffe1a216445]
4: (abort()+0x17b) [0x7ffe1a219bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7ffe1ab6469d]
6: (()+0xb5846) [0x7ffe1ab62846]
7: (()+0xb5873) [0x7ffe1ab62873]
8: (()+0xb596e) [0x7ffe1ab6296e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0xa6adcf]
10: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xdf) [0x74706f]
11: /usr/bin/ceph-osd() [0x772de6]
12: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &(boost::statechart::detail::no_context<boost::statechart::event_base>::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xc3) [0x79a6e3]
13: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc0) [0x79a840]
14: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc9) [0x79b3b9]
15: (boost::statechart::simple_state<PG::RecoveryState::RepRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x10b) [0x79109b]
16: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x77f8eb]
17: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x77fc41]
18: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x736e73]
19: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c6) [0x64d1f6]
20: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x69c782]
21: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa5b5e6]
22: (ThreadPool::WorkThread::entry()+0x10) [0xa5d3f0]
23: (()+0x7e9a) [0x7ffe1b623e9a]
24: (clone()+0x6d) [0x7ffe1a2d24bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.22.log
--
end dump of recent events ---


Files

ceph-osd.15.log.gz (252 KB) ceph-osd.15.log.gz shaojun ruan, 07/28/2014 12:37 AM
ceph-mon.a.log.gz (1.93 MB) ceph-mon.a.log.gz shaojun ruan, 07/28/2014 12:37 AM
ceph-osd.21.log._ (424 KB) ceph-osd.21.log._ Sahana Lokeshappa, 09/17/2014 10:52 PM
health (24.1 KB) health Sahana Lokeshappa, 09/18/2014 02:08 AM
Actions #1

Updated by shaojun ruan over 9 years ago

We've tracked this problem and catch some logs: ==================================================
core?
(gdb) p evt._M_ptr->evt.px
$14 = (PG::RequestBackfillPrio *) 0x1c2fb2a0
-----------------------------------------------
log?
-9> 2014-07-24 13:02:42.053071 7f5f185df700 5 osd.8 pg_epoch: 13577 pg[0.7fa( v 9961'14746 (312'11213,9961'14746] lb ddeca7fa/10000003c89.00000000/head//0 local-les=0 n=19 ec=1 les/c 11623/5096 11121/11622/11622) [1,8]/[10] r=-1 lpr=11622 pi=324-11621/124 luod=0'0 crt=9961'14746 lcod 0'0 active+remapped] exit Started/ReplicaActive/RepRecovering 58386.873226 3698 0.007132
-8> 2014-07-24 13:02:42.053296 7f5f185df700 5 osd.8 pg_epoch: 13577 pg[0.7fa( v 9961'14746 (312'11213,9961'14746] lb ddeca7fa/10000003c89.00000000/head//0 local-les=0 n=19 ec=1 les/c 11623/5096 11121/11622/11622) [1,8]/[10] r=-1 lpr=11622 pi=324-11621/124 luod=0'0 crt=9961'14746 lcod 0'0 active+remapped] exit Started/ReplicaActive 71673.667168 0 0.000000
-7> 2014-07-24 13:02:42.053321 7f5f185df700 5 osd.8 pg_epoch: 13577 pg[0.7fa( v 9961'14746 (312'11213,9961'14746] lb ddeca7fa/10000003c89.00000000/head//0 local-les=0 n=19 ec=1 les/c 11623/5096 11121/11622/11622) [1,8]/[10] r=-1 lpr=11622 pi=324-11621/124 luod=0'0 crt=9961'14746 lcod 0'0 active+remapped] exit Started 71674.754978 0 0.000000
-6> 2014-07-24 13:02:42.053340 7f5f185df700 5 osd.8 pg_epoch: 13577 pg[0.7fa( v 9961'14746 (312'11213,9961'14746] lb ddeca7fa/10000003c89.00000000/head//0 local-les=0 n=19 ec=1 les/c 11623/5096 11121/11622/11622) [1,8]/[10] r=-1 lpr=11622 pi=324-11621/124 luod=0'0 crt=9961'14746 lcod 0'0 active+remapped] enter Crashed

0> 2014-07-24 13:02:42.222904 7f5f185df700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f5f185df700 time 2014-07-24 13:02:42.061761

Actions #2

Updated by Sage Weil over 9 years ago

Can you attach teh full osd log please? thanks!

also, are all mons upgraded to 0.80.4 at this point? and were tunables adjusted? this might be #8738

Actions #3

Updated by Sage Weil over 9 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Sage Weil over 9 years ago

  • Priority changed from Urgent to High

Updated by shaojun ruan over 9 years ago

Last week we've created a new cluster(all components use v0.80.4), continuously writing data until space is full, then kill all osds in one OSD server to cause cluster recovering, almost 1 hour we see some osd crushed, at this moment we try to restart these crushed osds, but there are still some osds crushed after a while.
Here is one monitor log and one crushed osd log for reference. tks..

Actions #6

Updated by shaojun ruan over 9 years ago

scenario:
1. 3-replica
2. space is nearlly full(some osd >96%)

We guess the reason is osd continuously receiving two "MBackfillReserve" message,
first cause "Backfilling" status convert to "NotBackfilling"
then osd crashed when the second arrival because osd still in "NotBackfilling"

gdb info:
(gdb) p evt._M_ptr->evt.px
$4 = (PG::RemoteReservationRejected *) 0x2bc42d80

stack info:
#0 0x00007faad75dcb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x000000000099650e in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2 handle_fatal_signal (signum=6) at global/signal_handler.cc:105
#3 <signal handler called>
#4 0x00007faad5b1e425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007faad5b21b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007faad647169d in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007faad646f846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007faad646f873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007faad646f96e in _cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0000000000a778ff in ceph::
_ceph_assert_fail (assertion=0xb88e28 "0 == \"we got a bad state machine event\"",
file=<optimized out>, line=5325,
func=0xb8e4c0 "PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)") at common/assert.cc:77
#11 0x00000000007583df in PG::RecoveryState::Crashed::Crashed (this=<optimized out>, ctx=...) at osd/PG.cc:5325
#12 0x00000000007863d6 in shallow_construct (outermostContextBase=..., pContext=<optimized out>)
at /usr/include/boost/statechart/state.hpp:89
#13 deep_construct (outermostContextBase=..., pContext=<optimized out>) at /usr/include/boost/statechart/state.hpp:79
#14 boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator> >::construct (outermostContextBase=..., pContext=<optimized out>)
at /usr/include/boost/statechart/detail/constructor.hpp:93
#15 0x00000000007adca3 in transit_impl<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function> (this=0x19bff6e0, transitionAction=...) at /usr/include/boost/statechart/simple_state.hpp:798
#16 transit<PG::RecoveryState::Crashed> (this=0x19bff6e0) at /usr/include/boost/statechart/simple_state.hpp:314
#17 react_without_action (stt=...) at /usr/include/boost/statechart/transition.hpp:38
#18 react (stt=...) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:47
#19 react (stt=..., evt=...) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:68
#20 react (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:109
#21 react<PG::RecoveryState::Started, boost::statechart::event_base, void const*> (stt=..., evt=...,
eventType=<optimized out>) at /usr/include/boost/statechart/transition.hpp:59
#22 local_react_impl<boost::mpl::list1<boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=<optimized out>)
at /usr/include/boost/statechart/simple_state.hpp:816
#23 local_react<boost::mpl::list1<boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (this=0x19bff6e0, evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/simple_state.hpp:851
#24 boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function> >, boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=<optimized out>)
at /usr/include/boost/statechart/simple_state.hpp:820
#25 0x00000000007ade00 in local_react<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xf47c80, evt=..., this=0x19bff6e0)
at /usr/include/boost/statechart/simple_state.hpp:851
#26 local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0> > (eventType=0xf47c80, evt=..., stt=...) at /usr/include/boost/statechart/simple_state.hpp:820
#27 local_react<boost::mpl::list3<boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xf47c80,
evt=..., this=0x19bff6e0) at /usr/include/boost/statechart/simple_state.hpp:851
#28 local_react_impl<boost::mpl::list4<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0> > (eventType=0xf47c80, evt=..., stt=...)
at /usr/include/boost/statechart/simple_state.hpp:820
#29 local_react<boost::mpl::list4<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (this=0x19bff6e0, eventType=0xf47c80, evt=...)
at /usr/include/boost/statechart/simple_state.hpp:851
#30 local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> >, boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0> > (
eventType=0xf47c80, evt=..., stt=...) at /usr/include/boost/statechart/simple_state.hpp:820
#31 local_react<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xf47c80, evt=..., this=0x19bff6e0)
at /usr/include/boost/statechart/simple_state.hpp:851
#32 boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl (this=0x19bff6e0, evt=..., eventType=0xf47c80)
at /usr/include/boost/statechart/simple_state.hpp:489
#33 0x00000000007b300c in react_impl (eventType=0xf47c80, evt=..., this=0x244fc8c0)
at /usr/include/boost/statechart/simple_state.hpp:498
#34 boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::react_impl (this=0x244fc8c0, evt=..., eventType=0xf47c80)
at /usr/include/boost/statechart/simple_state.hpp:482
#35 0x00000000007b1a79 in react_impl (eventType=0xf47c80, evt=..., this=0x34811840)
at /usr/include/boost/statechart/simple_state.hpp:498
#36 boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl (this=0x34811840, evt=..., eventType=0xf47c80)
at /usr/include/boost/statechart/simple_state.hpp:482
#37 0x00000000007afe78 in react_impl (eventType=0xf47c80, evt=..., this=0x244fc910)
at /usr/include/boost/statechart/simple_state.hpp:498
#38 boost::statechart::simple_state<PG::RecoveryState::NotBackfilling, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl (
this=0x244fc910, evt=..., eventType=0xf47c80) at /usr/include/boost/statechart/simple_state.hpp:482
#39 0x00000000007933ab in operator() (this=<synthetic pointer>) at /usr/include/boost/statechart/state_machine.hpp:87
#40 operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, const void*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial>::exception_event_handler> (action=..., this=<optimized out>)
at /usr/include/boost/statechart/null_exception_translator.hpp:33
#41 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event (this=0x1afa5930, evt=...)
at /usr/include/boost/statechart/state_machine.hpp:885
#42 0x0000000000793701 in boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event (this=0x1afa5930, evt=...)
at /usr/include/boost/statechart/state_machine.hpp:275
#43 0x000000000074a263 in handle_event (rctx=0x7faac40aa8d0,
evt=<error reading variable: access outside bounds of object referenced via synthetic pointer>, this=0x1afa5930)
at osd/PG.h:1877
#44 PG::handle_peering_event (this=0x1afa4000, evt=..., rctx=0x7faac40aa8d0) at osd/PG.cc:5170
#45 0x0000000000661a76 in OSD::process_peering_events (this=0x30b2000, pgs=..., handle=...) at osd/OSD.cc:7770
#46 0x00000000006b0e62 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=..., handle=...) at osd/OSD.h:1202
#47 0x0000000000a68116 in ThreadPool::worker (this=0x30b2470, wt=0x19694260) at common/WorkQueue.cc:125
#48 0x0000000000a69f20 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:317
#49 0x00007faad75d4e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#50 0x00007faad5bdc3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6

log info:
ceph-osd.6.log.1: -158> 2014-07-29 05:12:50.037217 7faac40ab700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+wait_backfill] exit Started/Primary/Active/WaitRemoteBackfillReserved 15008.755999 3339 0.017972
ceph-osd.6.log.1: -157> 2014-07-29 05:12:50.037255 7faac40ab700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+wait_backfill] enter Started/Primary/Active/Backfilling
ceph-osd.6.log.1: -54> 2014-07-29 05:12:51.242714 7faac38aa700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+backfilling] backfill_pos is 0//0//-1
ceph-osd.6.log.1: -46> 2014-07-29 05:12:51.243402 7faac48ac700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+backfilling+backfill_toofull] exit Started/Primary/Active/Backfilling 1.206147 1 0.000079
ceph-osd.6.log.1: -44> 2014-07-29 05:12:51.243452 7faac48ac700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+backfill_toofull] enter Started/Primary/Active/NotBackfilling
ceph-osd.6.log.1: -20> 2014-07-29 05:12:51.265706 7faac40ab700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+backfill_toofull] exit Started/Primary/Active/NotBackfilling 0.022255 1 0.000096
ceph-osd.6.log.1: -17> 2014-07-29 05:12:51.265755 7faac40ab700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+backfill_toofull] exit Started/Primary/Active 15015.343863 0 0.000000
ceph-osd.6.log.1: -16> 2014-07-29 05:12:51.265853 7faac40ab700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped] exit Started/Primary 15018.581516 0 0.000000
ceph-osd.6.log.1: -15> 2014-07-29 05:12:51.265878 7faac40ab700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped] exit Started 15018.581619 0 0.000000
ceph-osd.6.log.1: -14> 2014-07-29 05:12:51.265899 7faac40ab700 5 osd.6 pg_epoch: 3914 pg[0.862( v 166'25838 (166'22837,166'25838] local-les=2799 n=412 ec=1 les/c 2799/1506 1507/2797/2797) [8,20,10]/[6,15,19] r=0 lpr=2797 pi=549-2796/18 rops=1 bft=8,10,20 crt=0'0 lcod 0'0 mlcod 0'0 active+remapped] enter Crashed

Actions #7

Updated by shaojun ruan over 9 years ago

osd reject the other osd's backfill request twice probably because the space is full, then the request one crashed

Actions #8

Updated by Sage Weil over 9 years ago

try this:

ceph osd getcrushmap -o cm
ceph osd setcrushmap -i cm

and then see if you can reproduce it after that. if not, then this is (was) #8738

Actions #9

Updated by shaojun ruan over 9 years ago

i used this command reimport the crushmap, bug osd still crash

Actions #10

Updated by Samuel Just over 9 years ago

  • Priority changed from High to Urgent
Actions #11

Updated by Sage Weil over 9 years ago

  • Subject changed from some osds crash when recovery to osd: second reservation rejection -> crash
  • Category set to OSD
  • Status changed from Need More Info to Fix Under Review

see wip-8863

Actions #12

Updated by Sage Weil over 9 years ago

  • Target version deleted (0.84 cont.)
Actions #13

Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #14

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
  • Backport set to firefly
Actions #15

Updated by Sahana Lokeshappa over 9 years ago

Even i got the above crash, when few osds were in nearfull situation.

Snippet of logs:

2014-09-17 17:29:41.696752 7f45ec1c3700 10 osd.21 3605 dequeue_op 0x3893100 finish
2014-09-17 17:29:41.698437 7f45f01cb700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f45f01cb700 time 2014-09-17 17:29:41.686110
osd/PG.cc: 5272: FAILED assert(0 == "we got a bad state machine event")

ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb776db]
2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state&lt;PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::my_context)+0x11b) [0x7c67fb]
3: /usr/bin/ceph-osd() [0x7f2536]
4: (boost::statechart::simple_state&lt;PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0&gt;::react_impl(boost::statechart::event_base const&, void const*)+0x109) [0x81d059]
5: (boost::statechart::simple_state&lt;PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0&gt;::react_impl(boost::statechart::event_base const&, void const*)+0x161) [0x81ab41]
6: (boost::statechart::simple_state&lt;PG::RecoveryState::RepRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xbc) [0x81130c]
7: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x805d6b]
8: (PG::handle_peering_event(std::tr1::shared_ptr&lt;PG::CephPeeringEvt&gt;, PG::RecoveryCtx*)+0x1ce) [0x7b5b8e]
9: (OSD::process_peering_events(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&, ThreadPool::TPHandle&)+0x2b0) [0x6a3140]
10: (OSD::PeeringWQ::_process(std::list&lt;PG*, std::allocator&lt;PG*&gt; > const&, ThreadPool::TPHandle&)+0x18) [0x6f83c8]
11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb68866]
12: (ThreadPool::WorkThread::entry()+0x10) [0xb69910]
13: (()+0x8182) [0x7f460ce64182]
14: (clone()+0x6d) [0x7f460b24a30d]
Actions #16

Updated by Sahana Lokeshappa over 9 years ago

Two osds were down and out due to that crash, I was not able to start those osds again. So removed those osds and added again to teh cluster. After that, my cluster is not coming to healthy state.

sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
health HEALTH_WARN 7 pgs backfilling; 69 pgs stuck unclean; recovery 6/3189311 objects degraded (0.000%); 7357/3189311 objects misplaced (0.231%)
monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2006, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
osdmap e9860: 64 osds: 64 up, 64 in
pgmap v45876: 2112 pgs, 2 pools, 4147 GB data, 1036 kobjects
12507 GB used, 10969 GB / 23476 GB avail
6/3189311 objects degraded (0.000%); 7357/3189311 objects misplaced (0.231%)
2041 active+clean
64 active+remapped
7 active+remapped+backfilling

Attached output of osd dump and health detail

Actions #17

Updated by Sage Weil over 9 years ago

Actions #18

Updated by Sahana Lokeshappa over 9 years ago

Yes Sage it is included.

commit 2b13de16c522754e30a0a55fb9d072082dac455e
Author: Sage Weil <>
Date: Wed Aug 27 06:19:12 2014 -0700

osd/PG: fix crash from second backfill reservation rejection
If we get more than one reservation rejection we should ignore them; when
we got the first we already sent out cancellations. More importantly, we
should not crash.
Fixes: #8863
Backport: firefly, dumpling
Signed-off-by: Sage Weil &lt;&gt;
Actions #19

Updated by Sahana Lokeshappa over 9 years ago

Hi Sage,

We are still getting this issue, even thought commit is included in our build. Any Updates?

Actions

Also available in: Atom PDF