Project

General

Profile

Bug #3816

osd/OSD.cc: 3318: FAILED assert(osd_lock.is_locked())

Added by Sage Weil over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
Start date:
01/16/2013
Due date:
% Done:

0%

Spent time:
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I had two of these systems running, but yesterday I wanted to add a third one.

So I had 8 OSDs (one per disk) running on 0.56.1 and I added one host bringing
the total to 12.

The cluster came into a degraded state (about 50%) and it started to recover
until it reached somewhere about 48%

In a manner of about 5 minutes all the original 8 OSDs had crashed with the same
backtrace:

    -1> 2013-01-15 17:20:29.058426 7f95a0fd8700 10 --
[2a00:f10:113:0:6051:e06c:df3:f374]:6803/4913 reaper done
     0> 2013-01-15 17:20:29.061054 7f959cfd0700 -1 osd/OSD.cc: In function 'void
OSD::do_waiters()' thread 7f959cfd0700 time 2013-01-15 17:20:29.057714
osd/OSD.cc: 3318: FAILED assert(osd_lock.is_locked())

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (OSD::do_waiters()+0x2c3) [0x6251f3]
 2: (OSD::ms_dispatch(Message*)+0x1c4) [0x62d714]
 3: (DispatchQueue::entry()+0x349) [0x8ba289]
 4: (DispatchQueue::DispatchThread::entry()+0xd) [0x8137cd]
 5: (()+0x7e9a) [0x7f95a95dae9a]
 6: (clone()+0x6d) [0x7f95a805ecbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

from Wido on ML

ceph-osd.0.log.1.gz (2.3 MB) Wido den Hollander, 01/16/2013 12:43 PM

ceph-osd.1.log.1.gz (2.29 MB) Wido den Hollander, 01/16/2013 12:43 PM

ceph-osd.2.log.1.gz (2.25 MB) Wido den Hollander, 01/16/2013 12:43 PM

ceph-osd.3.log.1.gz (2.27 MB) Wido den Hollander, 01/16/2013 12:43 PM

ceph-osd.0.log.gz - osd.0 log (6.85 MB) Wido den Hollander, 02/27/2013 09:56 AM

ceph-osd.3.log.gz - osd.3 log (9.43 MB) Wido den Hollander, 02/27/2013 09:56 AM

ceph-osd.0.log.1.gz - osd.0 crash 06-03-2013 (3.63 MB) Wido den Hollander, 03/07/2013 05:57 AM

ceph-osd.32.log.1.gz - osd.32 crash 06-03-2013 (3.03 MB) Wido den Hollander, 03/07/2013 05:57 AM

ceph-osd.26.log.1.gz - osd.26 crash 06-03-2013 (4.3 MB) Wido den Hollander, 03/07/2013 05:57 AM

ceph-osd-logs.tar.gz - OSD logs from March 15th (1.91 MB) Wido den Hollander, 03/15/2013 11:57 AM

ceph-osd-0-3.tar.gz - osd 0 - 3 March 15th (6.82 MB) Wido den Hollander, 03/15/2013 12:31 PM


Related issues

Related to Ceph - Bug #4556: OSDs crash with OSD::handle_op during recovery Resolved 03/26/2013

Associated revisions

Revision 98a76312 (diff)
Added by Sage Weil over 6 years ago

osd: leave osd_lock locked in shutdown()

No callers expect the lock to be dropped.

Fixes: #3816
Signed-off-by: Sage Weil <>

Revision 8c6f5221 (diff)
Added by Sage Weil over 6 years ago

osd: leave osd_lock locked in shutdown()

No callers expect the lock to be dropped.

Fixes: #3816
Signed-off-by: Sage Weil <>
(cherry picked from commit 98a763123240803741ac9f67846b8f405f1b005b)

History

#1 Updated by Sage Weil over 6 years ago

  • Subject changed from osd: to osd/OSD.cc: 3318: FAILED assert(osd_lock.is_locked())

#2 Updated by Wido den Hollander over 6 years ago

Like requested on the mailinglist I'm attaching the logfiles from osd.0 to osd.3

There is indeed a osd_map logline just prior to crashing:

-5710> 2013-01-15 17:18:24.498263 7f5a10cec700  3 osd.0 97 handle_osd_map epochs [90,101], i have 97, src has [1,101]
-306> 2013-01-15 17:18:26.678317 7f5a084db700 10 -- [2a00:f10:113:0:6051:e06c:df3:f374]:6800/4644 >> [2a00:f10:113::1]:6789/0 pipe(0x64b2b40 sd=65 :44364 pgs=91 cs=1 l=1).reader got message 16 0xcf0c600 osd_map(90..101 src has 1..101) v3

Just 4 seconds later the OSD crashes. You can find this in the logs.

#3 Updated by Sage Weil over 6 years ago

  • Status changed from Verified to Need Review

-5678> 2013-01-15 17:18:24.509093 7f5a10cec700 1 accepter.accepter.rebind avoid 6812
-5677> 2013-01-15 17:18:24.509954 7f5a10cec700 -1 accepter.accepter.bind unable to bind to [2a00:f10:113:0:6051:e06c:df3:f374]:7100 on any port in range 6800-7100: Cannot assign requested address

is the trigger. you can configure those ranges via 'ms bind port min/max', btw. pushing a fix for this crash to wip-3816.

#4 Updated by Wido den Hollander over 6 years ago

Interesting, but where did this actually get from?

And why didn't it get triggered when I started the OSDs again? Just trying to understand what actually happened.

#5 Updated by Sage Weil over 6 years ago

  • Status changed from Need Review to Resolved

#6 Updated by Sage Weil over 6 years ago

98a763123240803741ac9f67846b8f405f1b005b

When the osd does a "mark myself back up" it takes care to rebind to ports it didn't just use. It may be that after it restarts there are just enough ports free that it doesn't retrigger, because you're right at the threshold...

#7 Updated by Wido den Hollander over 6 years ago

  • Status changed from Resolved to Testing

I just tried the fix on another cluster where I was seeing this, but still see OSDs crash:

     0> 2013-02-27 15:28:40.944067 7f85ea119700 -1 *** Caught signal (Aborted) **
 in thread 7f85ea119700

 ceph version 0.56.3-19-g8c6f522 (8c6f52215240f48b5e4d5bb99a5f2f451e7ce70a)
 1: /usr/bin/ceph-osd() [0x78b0ba]
 2: (()+0xfcb0) [0x7f85f60a9cb0]
 3: (gsignal()+0x35) [0x7f85f437d425]
 4: (abort()+0x17b) [0x7f85f4380b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f85f4ccf69d]
 6: (()+0xb5846) [0x7f85f4ccd846]
 7: (()+0xb5873) [0x7f85f4ccd873]
 8: (()+0xb596e) [0x7f85f4ccd96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8366ff]
 10: (OSD::handle_op(std::tr1::shared_ptr<OpRequest>)+0x12d8) [0x62d688]
 11: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xe9) [0x635bc9]
 12: (OSD::_dispatch(Message*)+0x26e) [0x63e51e]
 13: (OSD::ms_dispatch(Message*)+0x1ba) [0x63ec8a]
 14: (DispatchQueue::entry()+0x349) [0x8c8aa9]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x821eed]
 16: (()+0x7e9a) [0x7f85f60a1e9a]
 17: (clone()+0x6d) [0x7f85f443acbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This is with the commit in the bobtail branch.

#8 Updated by Sage Weil over 6 years ago

Can you capture a bit of log leading up to the crash? debug osd = 0/20

#9 Updated by Sage Weil over 6 years ago

  • Status changed from Testing to Need More Info

#10 Updated by Wido den Hollander over 6 years ago

I've attached the log of osd.0 and osd.3 which both crashed again.

#11 Updated by Sage Weil over 6 years ago

Hi Wido-

The crash in those last 2 logs is different. I pushed backported patch to wip-3816-2...

The first patch adds a debug line that should print just prior to the crash. If you can run with just that patch and reproduce, that will let us confirm why you are hitting the assert (it seems strange that you would be submitting "bad" rados request... what is the workload?). The second patch backports a fix that's in master and (if the theory is correct) fixes it.

Thanks! Sorry I missed these logs earlier.

#12 Updated by Wido den Hollander over 6 years ago

I tried but it failed again.

The backtrace of osd.0 looks kind of similar, but not exactly the same:

    -3> 2013-03-06 19:18:58.478542 7f32e1723700  2 -- [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6813/9527 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/4393 pipe(0x5fa7280 sd=118 :6813 s=2 pgs=138 cs=1 l=1).fault 0: Success
    -2> 2013-03-06 19:19:01.819963 7f32e131f700  2 -- [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6813/9527 >> [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:0/15557 pipe(0x5fa7500 sd=120 :6813 s=2 pgs=112 cs=1 l=1).reader couldn't read tag, Success
    -1> 2013-03-06 19:19:01.820084 7f32e131f700  2 -- [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6813/9527 >> [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:0/15557 pipe(0x5fa7500 sd=120 :6813 s=2 pgs=112 cs=1 l=1).fault 0: Success
     0> 2013-03-06 19:19:02.550284 7f32f42e5700 -1 *** Caught signal (Aborted) **
 in thread 7f32f42e5700

 ceph version 0.56.3-24-g2106ae5 (2106ae57a214bbd3cdef4a18242a8825ecc3904b)
 1: /usr/bin/ceph-osd() [0x78b09a]
 2: (()+0xfcb0) [0x7f3301277cb0]
 3: (gsignal()+0x35) [0x7f32ff54b425]
 4: (abort()+0x17b) [0x7f32ff54eb8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f32ffe9d69d]
 6: (()+0xb5846) [0x7f32ffe9b846]
 7: (()+0xb5873) [0x7f32ffe9b873]
 8: (()+0xb596e) [0x7f32ffe9b96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8366df]
 10: (Pipe::register_pipe()+0x2f0) [0x8dce40]
 11: (SimpleMessenger::connect_rank(entity_addr_t const&, int, Connection*, Message*)+0x187) [0x81ee57]
 12: (SimpleMessenger::submit_message(Message*, Connection*, entity_addr_t const&, int, bool)+0xd03) [0x820573]
 13: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0xaa) [0x820a5a]
 14: (OSD::send_incremental_map(unsigned int, Connection*)+0x77) [0x61f177]
 15: (OSD::require_same_or_newer_map(std::tr1::shared_ptr<OpRequest>, unsigned int)+0x5a8) [0x61fc58]
 16: (OSD::handle_pg_query(std::tr1::shared_ptr<OpRequest>)+0x235) [0x630265]
 17: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x190) [0x635d60]
 18: (OSD::_dispatch(Message*)+0x26e) [0x63e60e]
 19: (OSD::ms_dispatch(Message*)+0x1ba) [0x63ed7a]
 20: (DispatchQueue::entry()+0x349) [0x8c8a89]
 21: (DispatchQueue::DispatchThread::entry()+0xd) [0x821ecd]
 22: (()+0x7e9a) [0x7f330126fe9a]
 23: (clone()+0x6d) [0x7f32ff608cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The logs of osd.0 are attached.

I have a bit more OSDs which stayed alive, but the health is far from OK:

root@monitor:~# ceph -s
   health HEALTH_ERR 6617 pgs down; 24 pgs incomplete; 1 pgs inconsistent; 6641 pgs peering; 51 pgs stale; 6665 pgs stuck inactive; 51 pgs stuck stale; 6678 pgs stuck unclean; 2 scrub errors
   monmap e3: 1 mons at {pri=[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0}, election epoch 1, quorum 0 pri
   osdmap e13374: 40 osds: 23 up, 23 in
    pgmap v2552573: 7744 pgs: 10 active, 1065 active+clean, 3 stale+active, 15 peering, 5818 down+peering, 6 stale+peering, 18 stale+down+peering, 3 remapped+peering, 781 down+remapped+peering, 1 active+clean+inconsistent, 24 stale+incomplete; 45348 MB data, 164 GB used, 40938 GB / 42849 GB avail
   mdsmap e1: 0/0/1 up

root@monitor:~#

Only osd.0 gave a similar backtrace when crashing. Other crashes I saw were:

' thread 7f5b1fdfa700 time 2013-03-06 19:23:39.348234
osd/PG.cc: 5193: FAILED assert(0 == "we got a bad state machine event")

 ceph version 0.56.3-24-g2106ae5 (2106ae57a214bbd3cdef4a18242a8825ecc3904b)
 1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xab) [0x69106b]
 2: /usr/bin/ceph-osd() [0x6bc436]
 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &(boost::statechart::detail::no_context<boost::statechart::event_base>::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xb3) [0x6da603]
 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &(boost::statechart::detail::no_context<boost::statechart::event_base>::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xb4) [0x6da754]
 5: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x4e) [0x6da88e]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x6c2e7b]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x6c3151]
 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x347) [0x683b87]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c8) [0x62e228]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x10) [0x663d60]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x82bb06]
 12: (ThreadPool::WorkThread::entry()+0x10) [0x82d930]
 13: (()+0x7e9a) [0x7f5b3098fe9a]
 14: (clone()+0x6d) [0x7f5b2ed28cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The bad state machine again? #3467 ?

The log for osd.32 is also attached, but debugging wasn't that high.

I saw the same backtrace on osd.26, but that log also showed:

    -1> 2013-03-06 19:19:57.756479 7f3b24c02700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f3b163e5700' had timed out after 15
     0> 2013-03-06 19:19:57.825239 7f3b24c02700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f3b16be6700' had timed out after 15

I don't see any disk issues, but I'm using btrfs with the 3.8 kernel. Might be related? Log of osd.26 is also attached.

I hope this helps, I know it's a lot of things all at once...

#13 Updated by Sage Weil over 6 years ago

were you running with both or just 1 of those patches? I fyou could try with just the first patch that would be great, since it'd tell us how the original crash triggered.

#14 Updated by Sage Weil over 6 years ago

  • Priority changed from Urgent to High

#15 Updated by Wido den Hollander over 6 years ago

I'm trying without the second patch, only with: "osd: print op prior to calling methods to check for caps"

Some disks in the OSDs are giving troubles so I'm seeing all kinds of crashes now which seem to relate to the failing disks.

#16 Updated by Wido den Hollander over 6 years ago

So, running with a1f85e8344849dcb1bb6ab5eb18fdc6d30b9b524 the cluster won't recover at all.

It seems I'm still seeing something similar to #3467. That one is closed, but the backtraces I'm seeing are:

     0> 2013-03-15 19:42:11.003853 7ffd8a38c700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7ffd8a38c700 time 2013-03-15 19:42:10.971521
osd/PG.cc: 5193: FAILED assert(0 == "we got a bad state machine event")

 ceph version 0.56.3-23-ga1f85e8 (a1f85e8344849dcb1bb6ab5eb18fdc6d30b9b524)
 1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xab) [0x690ffb]
 2: /usr/bin/ceph-osd() [0x6bc2d6]
 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &(boost::statechart::detail::no_context<boost::statechart::event_base>::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xb3) [0x6da6f3]
 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &(boost::statechart::detail::no_context<boost::statechart::event_base>::no_function(boost::statechart::event_base const&))> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xb4) [0x6da844]
 5: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x4e) [0x6da96e]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x6c2ccb]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x6c2fa1]
 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x347) [0x683b47]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c8) [0x62e138]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x10) [0x663c70]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x82bc36]
 12: (ThreadPool::WorkThread::entry()+0x10) [0x82da60]
 13: (()+0x7e9a) [0x7ffd9ab1de9a]
 14: (clone()+0x6d) [0x7ffd98eb6cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I tarred up a bunch of logs since about 50% of the OSDs went down with the same backtrace.

One note, the OSDs running with debug on 20 didn't crash this time, so I don't have that much information on them.

The original backtrace from this issue isn't coming back at this moment, so it could well be that we're seeing a different crash now.

#17 Updated by Wido den Hollander over 6 years ago

After some time even more OSDs crashed with another backtrace already seen.

This time it were the OSDs running with debug on 20

They (osd.0 - osd.3) all had the same backtrace:

    -1> 2013-03-15 19:56:43.728981 7f2b910de700 10 osd.3 pg_epoch: 17791 pg[1.850( empty local-les=0 n=0 ec=1 les/c 6940/7092 17788/17788/17788) [3] r=0 lpr=17788 pi=6785-17787/135 mlcod 0'0 peering]  Prio
rSet: build_prior interval(7903-7904 [13,18,38]/[13,18,38])
     0> 2013-03-15 19:56:43.709858 7f2b960e8700 -1 *** Caught signal (Aborted) **
 in thread 7f2b960e8700

 ceph version 0.56.3-23-ga1f85e8 (a1f85e8344849dcb1bb6ab5eb18fdc6d30b9b524)
 1: /usr/bin/ceph-osd() [0x78b1ca]
 2: (()+0xfcb0) [0x7f2ba2078cb0]
 3: (gsignal()+0x35) [0x7f2ba034c425]
 4: (abort()+0x17b) [0x7f2ba034fb8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2ba0c9e69d]
 6: (()+0xb5846) [0x7f2ba0c9c846]
 7: (()+0xb5873) [0x7f2ba0c9c873]
 8: (()+0xb596e) [0x7f2ba0c9c96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x83680f]
 10: (OSD::handle_op(std::tr1::shared_ptr<OpRequest>)+0x12d8) [0x62d688]
 11: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xe9) [0x635bc9]
 12: (OSD::_dispatch(Message*)+0x26e) [0x63e51e]
 13: (OSD::ms_dispatch(Message*)+0x1ba) [0x63ec8a]
 14: (DispatchQueue::entry()+0x349) [0x8c8bb9]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x821ffd]
 16: (()+0x7e9a) [0x7f2ba2070e9a]
 17: (clone()+0x6d) [0x7f2ba0409cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Their logs are attached as well.

The current cluster state:

osdmap e18884: 40 osds: 11 up, 12 in

#18 Updated by Sage Weil over 6 years ago

  • Status changed from Need More Info to Resolved

This second set of crashes in handle_op is #4556, now fixed.

The osd_lock issue has been fixed earlier as well. There is a backtrace in here with the state machine assert, but i think we should see if that comes up again so we can make sense of the situation.

Also available in: Atom PDF