Project

General

Profile

Actions

Bug #9731

closed

Ceph 0.80.6 OSD crashes

Added by Brad House over 9 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I received 2 different crashes on 2 different OSDs on different nodes within 30s of eachother on 0.80.6. I just upgraded from 0.80.5 where I never saw these kind of issues


    2014-10-10 12:06:32.502375 7f4bb3a3f700  5 -- op tracker -- , seq: 9606, time: 2014-10-10 12:06:32.502375, event: waiting_for_osdmap, request: osd_sub_op_reply(client.4443832.0:6639 3.29 d5d4c329/rbd_data.440302ae8944a.0000000000001173/head//3 [] ondisk, result = 0) v2
         0> 2014-10-10 12:06:32.506607 7f4bae234700 -1 *** Caught signal (Segmentation fault) **
     in thread 7f4bae234700

     ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)
     1: /usr/bin/ceph-osd() [0xab5882]
     2: (()+0xf030) [0x7f4bc613c030]
     3: (ceph::buffer::ptr::ptr(ceph::buffer::ptr const&)+0) [0xb94360]
     4: /usr/bin/ceph-osd() [0x7372d2]
     5: (ceph::buffer::list::list(ceph::buffer::list const&)+0x1e) [0x737e2e]
     6: (pg_log_entry_t::pg_log_entry_t(pg_log_entry_t const&)+0xba) [0x7b354a]
     7: /usr/bin/ceph-osd() [0x75196f]
     8: (PG::PGLogEntryHandler::trim(pg_log_entry_t const&)+0xc) [0x7b35ec]
     9: (PGLog::IndexedLog::advance_rollback_info_trimmed_to(eversion_t, PGLog::LogEntryHandler*)+0x83) [0x824d73]
     10: (PG::append_log(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, eversion_t, eversion_t, ObjectStore::Transaction&, bool)+0x833) [0x890bb3]
     11: (ReplicatedPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, boost::optional<pg_hit_set_history_t>&, eversion_t const&, eversion_t const&, bool, ObjectStore::Transaction*)+0xca) [0x96036a]
     12: (ReplicatedBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::tr1::shared_ptr<OpRequest>)+0x634) [0xa38294]
     13: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x33b) [0x8f194b]
     14: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xf77) [0x94c757]
     15: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x2471) [0x94f511]
     16: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x52b) [0x8e64ab]
     17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x336) [0x7428c6]
     18: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1ea) [0x75f78a]
     19: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x79be4e]
     20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb81b1a]
     21: (ThreadPool::WorkThread::entry()+0x10) [0xb82d60]
     22: (()+0x6b50) [0x7f4bc6133b50]
     23: (clone()+0x6d) [0x7f4bc4d57e6d]

        -1> 2014-10-10 12:06:56.493260 7f1d47702700  5 osd.2 pg_epoch: 1752 pg[3.ee( v 1702'5694 (175'2693,1702'5694] local-les=1752 n=117 ec=56 les/c 1750/1750 1751/1751/1751) [2] r=0 lpr=1751 pi=1742-1750/4 crt=1702'5692 lcod 0'0 mlcod 0'0 degraded] enter Started/Primary/Active/Activating
         0> 2014-10-10 12:06:56.493182 7f1d47f03700 -1 *** Caught signal (Segmentation fault) **
     in thread 7f1d47f03700

     ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)
     1: /usr/bin/ceph-osd() [0xab5882]
     2: (()+0xf030) [0x7f1d5fd05030]
     3: (PG::log_weirdness()+0x1b0) [0x865800]
     4: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >*, PG::RecoveryCtx*)+0x395) [0x89cf45]
     5: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x52e) [0x8a05ce]
     6: (boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Primary> const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x5c) [0x8db1fc]
     7: (boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::deep_construct(boost::intrusive_ptr<PG::RecoveryState::Primary> const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x16) [0x8dc596]
     8: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0x8d) [0x8dc66d]
     9: (boost::statechart::transition<PG::Activate, PG::RecoveryState::Active, boost::statechart::detail::no_context<PG::Activate>, &boost::statechart::detail::no_context<PG::Activate>::no_function>::reactions<PG::RecoveryState::Peering>::react_without_action(PG::RecoveryState::Peering&)+0x12) [0x8dc702]
     10: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::transition<PG::Activate, PG::RecoveryState::Active, boost::statechart::detail::no_context<PG::Activate>, &boost::statechart::detail::no_context<PG::Activate>::no_function>, boost::statechart::custom_reaction<PG::AdvMap> >, boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x7b) [0x8dc78b]
     11: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::transition<PG::Activate, PG::RecoveryState::Active, boost::statechart::detail::no_context<PG::Activate>, &boost::statechart::detail::no_context<PG::Activate>::no_function>, boost::statechart::custom_reaction<PG::AdvMap>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x57) [0x8dc7f7]
     12: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x21) [0x8dc871]
     13: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x137) [0x8bca17]
     14: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x26) [0x8bcff6]
     15: (PG::handle_activate_map(PG::RecoveryCtx*)+0xb4) [0x873304]
     16: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x77f) [0x775b1f]
     17: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x31c) [0x77635c]
     18: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x7d3494]
     19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb81b1a]
     20: (ThreadPool::WorkThread::entry()+0x10) [0xb82d60]
     21: (()+0x6b50) [0x7f1d5fcfcb50]
     22: (clone()+0x6d) [0x7f1d5e920e6d]
     NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I know there have been other reported issues with assertions being triggered in 0.80.6 (#9696), but haven't seen any with actual segmentation faults.


Files

ceph-osd.2.log.xz (16.9 MB) ceph-osd.2.log.xz log from osd crash with debugging Brad House, 10/10/2014 11:44 AM
core_ceph-osd.2990.xz (34 MB) core_ceph-osd.2990.xz core file from crash Brad House, 10/14/2014 07:13 AM
core_ceph-osd.2964.xz (41 MB) core_ceph-osd.2964.xz Brad House, 10/14/2014 07:32 AM
ceph-osd.0.log.xz (111 KB) ceph-osd.0.log.xz Brad House, 10/16/2014 06:47 AM
core_ceph-osd.2970.xz (29.8 MB) core_ceph-osd.2970.xz Brad House, 10/16/2014 06:47 AM
ceph-osd.1.log.xz (6.65 MB) ceph-osd.1.log.xz Brad House, 10/16/2014 12:48 PM
core_ceph-osd.2974.xz (42 MB) core_ceph-osd.2974.xz Brad House, 10/16/2014 01:49 PM
valgrind.supp (4.4 KB) valgrind.supp Samuel Just, 10/17/2014 02:45 PM
ceph-osd.1.log.xz (178 KB) ceph-osd.1.log.xz Brad House, 10/22/2014 08:01 AM
ceph-osd.2.log.xz (104 KB) ceph-osd.2.log.xz Brad House, 10/22/2014 08:16 AM
core_ceph-osd.2989.xz (50.4 MB) core_ceph-osd.2989.xz Brad House, 10/23/2014 10:19 AM
Actions #1

Updated by Ian Colle over 9 years ago

  • Assignee set to Samuel Just
Actions #2

Updated by Samuel Just over 9 years ago

Can you reproduce either of these with logging?

debug osd = 20
debug filestore = 20
debug ms = 1

Actions #3

Updated by Brad House over 9 years ago

I will add those to my configuration and restart ceph on each node.

Luckily this is just my test environment.

Actions #4

Updated by Brad House over 9 years ago

Attached ceph OSD log from crash with debugging turned on.

Actions #5

Updated by Samuel Just over 9 years ago

I think this is a bug in PGLog::IndexedLog::trim(). Making patch.

Actions #6

Updated by Samuel Just over 9 years ago

wip-firefly-9696-9731 should have a fix for this as well as 9696. Let me know whether that helps.

Actions #7

Updated by Wang Qiang over 9 years ago

I already saw the commit in branch. The status of this issue should not be New.

Actions #8

Updated by Samuel Just over 9 years ago

  • Status changed from New to 7
Actions #9

Updated by Brad House over 9 years ago

The Debian Wheezy build server doesn't seem to be online yet so I haven't been able to test your patch.

However, I did get a core file from one of the server crashes (871MB uncompressed)... attached.

Actions #10

Updated by Brad House over 9 years ago

And here is another core file from another server. The backtrace in the log looks like a different path to me.

Actions #11

Updated by Samuel Just over 9 years ago

We included the patch I think should fix this in 80.7. Let me know what happens.

Actions #12

Updated by Brad House over 9 years ago

I just upgraded to 0.80.7, and got a crash on startup of one of my OSDs. I'll grab the log and core dump and attach in a few.

Updated by Brad House over 9 years ago

0.80.7 segfault core file and log. Happened immediately at startup after rebooting after update.

Actions #14

Updated by Samuel Just over 9 years ago

Can you reproduce with
debug osd = 20
debug ms = 1
debug filestore = 20
?

Actions #15

Updated by Brad House over 9 years ago

backtrace from last core file:

Core was generated by `/usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd1366c4efb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fd1366c4efb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000ab5549 in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  operator++ (this=<optimized out>) at /usr/include/c++/4.7/bits/stl_list.h:236
#5  __distance<std::_List_const_iterator<pg_log_entry_t> > (__last=..., __first=...) at /usr/include/c++/4.7/bits/stl_iterator_base_funcs.h:82
#6  distance<std::_List_const_iterator<pg_log_entry_t> > (__last=..., __first=...) at /usr/include/c++/4.7/bits/stl_iterator_base_funcs.h:118
#7  size (this=<optimized out>) at /usr/include/c++/4.7/bits/stl_list.h:855
#8  PG::log_weirdness (this=this@entry=0x15bc4c00) at osd/PG.cc:2849
#9  0x000000000089cc05 in PG::activate (this=this@entry=0x15bc4c00, t=..., query_epoch=query_epoch@entry=2087, tfin=..., query_map=..., activator_map=activator_map@entry=0x0, ctx=ctx@entry=0x0) at osd/PG.cc:1491
#10 0x000000000089f9d3 in PG::RecoveryState::ReplicaActive::react (this=this@entry=0x1b7b5270, actevt=...) at osd/PG.cc:6641
#11 0x00000000008da006 in react<PG::RecoveryState::ReplicaActive, boost::statechart::event_base, void const*> (evt=..., stt=..., eventType=<optimized out>) at /usr/include/boost/statechart/custom_reaction.hpp:42
#12 boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list1<boost::statechart::custom_reaction<PG::Activate> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=eventType@entry=0x1057b30)
    at /usr/include/boost/statechart/simple_state.hpp:816
#13 0x00000000008da087 in local_react<boost::mpl::list1<boost::statechart::custom_reaction<PG::Activate> > > (eventType=0x1057b30, evt=..., this=0x1b7b5270) at /usr/include/boost/statechart/simple_state.hpp:851
#14 boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0> > (stt=..., evt=..., 
    eventType=eventType@entry=0x1057b30) at /usr/include/boost/statechart/simple_state.hpp:820
#15 0x00000000008da137 in local_react<boost::mpl::list2<boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> > > (eventType=0x1057b30, evt=..., this=0x1b7b5270) at /usr/include/boost/statechart/simple_state.hpp:851
#16 boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0> >
    (stt=..., evt=..., eventType=eventType@entry=0x1057b30) at /usr/include/boost/statechart/simple_state.hpp:820
#17 0x00000000008da1e7 in local_react<boost::mpl::list3<boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> > > (eventType=0x1057b30, evt=..., this=0x1b7b5270)
    at /usr/include/boost/statechart/simple_state.hpp:851
#18 boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list4<boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=eventType@entry=0x1057b30) at /usr/include/boost/statechart/simple_state.hpp:820
#19 0x00000000008da297 in local_react<boost::mpl::list4<boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> > > (
    eventType=0x1057b30, evt=..., this=0x1b7b5270) at /usr/include/boost/statechart/simple_state.hpp:851
#20 boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=eventType@entry=0x1057b30) at /usr/include/boost/statechart/simple_state.hpp:820
#21 0x00000000008da347 in local_react<boost::mpl::list5<boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> > > (eventType=0x1057b30, evt=..., this=0x1b7b5270) at /usr/include/boost/statechart/simple_state.hpp:851
#22 boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=eventType@entry=0x1057b30) at /usr/include/boost/statechart/simple_state.hpp:820
#23 0x00000000008da3c1 in local_react<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> > > (eventType=0x1057b30, evt=..., this=0x1b7b5270) at /usr/include/boost/statechart/simple_state.hpp:851
#24 boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl (this=0x1b7b5270, evt=..., eventType=0x1057b30)
    at /usr/include/boost/statechart/simple_state.hpp:489
#25 0x00000000008bc6d7 in operator() (this=<synthetic pointer>) at /usr/include/boost/statechart/state_machine.hpp:87
#26 operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, const void*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial>::exception_event_handler> (action=..., this=<optimized out>) at /usr/include/boost/statechart/null_exception_translator.hpp:33
#27 send_event (evt=..., this=0x15bc6558) at /usr/include/boost/statechart/state_machine.hpp:885
#28 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events (this=this@entry=0x15bc6558) at /usr/include/boost/statechart/state_machine.hpp:910
#29 0x00000000008bccb6 in boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event (this=this@entry=0x15bc6558, evt=...)
    at /usr/include/boost/statechart/state_machine.hpp:280
#30 0x00000000008bcd41 in PG::RecoveryState::handle_event (this=this@entry=0x15bc6558, evt=..., rctx=rctx@entry=0x7fd11ea53a30) at osd/PG.h:1980
#31 0x0000000000872a08 in PG::handle_peering_event (this=0x15bc4c00, evt=..., rctx=0x7fd11ea53a30) at osd/PG.cc:5269
#32 0x000000000077619c in OSD::process_peering_events (this=0x2ac4000, pgs=..., handle=...) at osd/OSD.cc:7842
#33 0x00000000007d31e4 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=..., handle=...) at osd/OSD.h:1242
#34 0x0000000000b8173a in ThreadPool::worker (this=0x2ac4470, wt=0x2a38540) at common/WorkQueue.cc:128
#35 0x0000000000b82980 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
#36 0x00007fd1366bcb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#37 0x00007fd1352e0e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#38 0x0000000000000000 in ?? ()

Actions #16

Updated by Brad House over 9 years ago

Another backtrace from a different machine, definitely different:

#0  0x00007ffda6702efb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000ab5549 in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  operator-- (this=<synthetic pointer>) at /usr/include/c++/4.7/bits/stl_list.h:171
#5  operator* (this=0x17ae5570) at /usr/include/c++/4.7/bits/stl_iterator.h:165
#6  operator-> (this=0x17ae5570) at /usr/include/c++/4.7/bits/stl_iterator.h:175
#7  PGLog::IndexedLog::advance_rollback_info_trimmed_to (this=0x17ae54b0, to=..., h=0x7ffd8ecc0550) at osd/PGLog.cc:38
#8  0x00000000008908e3 in PG::append_log (this=0x17ae4c00, logv=..., trim_to=..., trim_rollback_to=..., t=..., transaction_applied=true) at osd/PG.cc:2685
#9  0x000000000096010a in ReplicatedPG::log_operation (this=0x17ae4c00, logv=..., hset_history=..., trim_to=..., trim_rollback_to=..., transaction_applied=true, t=0x7ffd8ecc0940) at osd/ReplicatedPG.h:357
#10 0x0000000000a37ea4 in ReplicatedBackend::submit_transaction (this=0x169c72c0, soid=..., at_version=..., _t=<optimized out>, trim_to=..., trim_rollback_to=..., log_entries=..., hset_history=..., on_local_applied_sync=0xe444440, on_all_acked=0x1a4d6ca0, 
    on_all_commit=0x1a42d320, tid=3271, reqid=..., orig_op=...) at osd/ReplicatedBackend.cc:560
#11 0x00000000008f179b in ReplicatedPG::issue_repop (this=this@entry=0x17ae4c00, repop=repop@entry=0x1a5949c0, now=..., now@entry=...) at osd/ReplicatedPG.cc:6778
#12 0x000000000094c577 in ReplicatedPG::execute_ctx (this=this@entry=0x17ae4c00, ctx=ctx@entry=0x1b0b9200) at osd/ReplicatedPG.cc:1874
#13 0x000000000094f331 in ReplicatedPG::do_op (this=this@entry=0x17ae4c00, op=...) at osd/ReplicatedPG.cc:1556
#14 0x00000000008e62fb in ReplicatedPG::do_request (this=this@entry=0x17ae4c00, op=..., handle=...) at osd/ReplicatedPG.cc:1129
#15 0x0000000000742616 in OSD::dequeue_op (this=0x22ca000, pg=..., op=..., handle=...) at osd/OSD.cc:7779
#16 0x000000000075f4da in OSD::OpWQ::_process (this=0x22cae58, pg=..., handle=...) at osd/OSD.cc:7749
#17 0x000000000079bb9e in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x22cae58, handle=...) at ./common/WorkQueue.h:191
#18 0x0000000000b8173a in ThreadPool::worker (this=0x22ca470, wt=0x208d260) at common/WorkQueue.cc:128
#19 0x0000000000b82980 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
#20 0x00007ffda66fab50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#21 0x00007ffda531ee6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#22 0x0000000000000000 in ?? ()

Actions #17

Updated by Brad House over 9 years ago

Another crash from another node, this time with debug increased. Will attach log, here is the backtrace from gdb:

#0  0x00007facb6f14efb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000ab5549 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x00007facb5a871a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007facb5a8a420 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007facb62dd89d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007facb62db996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007facb62db9c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007facb62dbc36 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000007516e6 in std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> >::_M_create_node (__x=..., this=<error reading variable: Unhandled dwarf expression opcode 0xfa>) at /usr/include/c++/4.7/bits/stl_list.h:491
#11 0x00000000007b333c in _M_insert (__x=..., __position=..., this=<optimized out>) at /usr/include/c++/4.7/bits/stl_list.h:1525
#12 push_back (__x=..., this=0x7fac9da72598) at /usr/include/c++/4.7/bits/stl_list.h:997
#13 PG::PGLogEntryHandler::trim (this=0x7fac9da72550, entry=...) at ./osd/PG.h:866
#14 0x0000000000824ac3 in PGLog::IndexedLog::advance_rollback_info_trimmed_to (this=0x127c50b0, to=..., h=0x7fac9da72550) at osd/PGLog.cc:42
#15 0x00000000008908e3 in PG::append_log (this=0x127c4800, logv=..., trim_to=..., trim_rollback_to=..., t=..., transaction_applied=true) at osd/PG.cc:2685
#16 0x000000000096010a in ReplicatedPG::log_operation (this=0x127c4800, logv=..., hset_history=..., trim_to=..., trim_rollback_to=..., transaction_applied=true, t=0x7fac9da72940) at osd/ReplicatedPG.h:357
#17 0x0000000000a37ea4 in ReplicatedBackend::submit_transaction (this=0x1259d540, soid=..., at_version=..., _t=<optimized out>, trim_to=..., trim_rollback_to=..., log_entries=..., hset_history=..., on_local_applied_sync=0x19ed8f40, on_all_acked=0x1a02cf40, 
    on_all_commit=0x1a08c800, tid=40, reqid=..., orig_op=...) at osd/ReplicatedBackend.cc:560
#18 0x00000000008f179b in ReplicatedPG::issue_repop (this=this@entry=0x127c4800, repop=repop@entry=0x19ecdc80, now=..., now@entry=...) at osd/ReplicatedPG.cc:6778
#19 0x000000000094c577 in ReplicatedPG::execute_ctx (this=this@entry=0x127c4800, ctx=ctx@entry=0x1a12b200) at osd/ReplicatedPG.cc:1874
#20 0x000000000094f331 in ReplicatedPG::do_op (this=this@entry=0x127c4800, op=...) at osd/ReplicatedPG.cc:1556
#21 0x00000000008e62fb in ReplicatedPG::do_request (this=this@entry=0x127c4800, op=..., handle=...) at osd/ReplicatedPG.cc:1129
#22 0x0000000000742616 in OSD::dequeue_op (this=0x1e7c000, pg=..., op=..., handle=...) at osd/OSD.cc:7779
#23 0x000000000075f4da in OSD::OpWQ::_process (this=0x1e7ce58, pg=..., handle=...) at osd/OSD.cc:7749
#24 0x000000000079bb9e in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x1e7ce58, handle=...) at ./common/WorkQueue.h:191
#25 0x0000000000b8173a in ThreadPool::worker (this=0x1e7c470, wt=0x1e46000) at common/WorkQueue.cc:128
#26 0x0000000000b82980 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
#27 0x00007facb6f0cb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#28 0x00007facb5b30e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#29 0x0000000000000000 in ?? ()

Actions #18

Updated by Samuel Just over 9 years ago

  • Status changed from 7 to 12
Actions #19

Updated by Brad House over 9 years ago

core file for last crash as requested by Samuel Just

Actions #20

Updated by Samuel Just over 9 years ago

Would it be possible to run the osds in question under valgrind?

Actions #21

Updated by Brad House over 9 years ago

sure, just tell me the best command line to us as I haven't ever tried to run ceph-osd outside of the standard init script, and I'll start all the OSDs in the pool under valgrind on each machine and let it run. Also, do you need to mega debugging turned on for that? Don't want to slow it down too much incase it is something race-related.

Actions #22

Updated by Sage Weil over 9 years ago

Brad House wrote:

sure, just tell me the best command line to us as I haven't ever tried to run ceph-osd outside of the standard init script, and I'll start all the OSDs in the pool under valgrind on each machine and let it run. Also, do you need to mega debugging turned on for that? Don't want to slow it down too much incase it is something race-related.

valgrind --tool=memcheck ceph-osd -i NNN

should do it. it will go much slower, though; you may need to disable logging

Actions #23

Updated by Samuel Just over 9 years ago

Sage also pushed a wip-9731 based on 0.80.7 with a piece of debugging which would be handy. Reproducing with that would also be helpful.

Actions #24

Updated by Brad House over 9 years ago

valgrind appears to detach from the console when running with ceph-osd, is there some other flag I need to pass to ceph-osd? Perhaps it is daemonizing. Though it does appear to still output messages, but they are all something about a syscall not being implemented by valgrind. I haven't done much testing with valgrind yet due to this.

Finally, regarding wip-9731, any chance the gitbuilder for wheezy will be back online for me to test that?

Actions #25

Updated by Samuel Just over 9 years ago

yeah, -f I think.

Actions #26

Updated by Samuel Just over 9 years ago

wheezy gitbuilder should be working.

Actions #27

Updated by Samuel Just over 9 years ago

Looks like in our testing we invoke valgrind as:

valgrind --suppressions=<suppression_file> --num-callers=50 --xml=yes --xml-file=<valgrind_output_file> --time-stamp=yes ceph-osd -i <osdid> -f

where <suppression_file> should be the valgrind.supp file I am attaching with this comment and <valgrind_output_file> should be whatever path you want the output to go to. Probably /var/log/ceph/ceph-osd.<id>.valgrind or something.

Actions #28

Updated by Samuel Just over 9 years ago

Oh, and the

--00:00:06:05.108 2312-- WARNING: unhandled syscall: 306
--00:00:06:05.108 2312-- You may be able to write your own handler.
--00:00:06:05.108 2312-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--00:00:06:05.108 2312-- Nevertheless we consider this a bug. Please report
--00:00:06:05.108 2312-- it at http://valgrind.org/support/bug_reports.html.

is harmless.

Actions #29

Updated by Samuel Just over 9 years ago

Just to check, there isn't anything interesting in dmesg, right?

Actions #30

Updated by Sage Weil over 9 years ago

Any update?

Actions #31

Updated by Brad House over 9 years ago

Sorry, I only have access during the week to the test system, and I'm out sick today. Hopefully I'll be able to continue testing tomorrow.

Actions #32

Updated by Sage Weil over 9 years ago

Brad House wrote:

Sorry, I only have access during the week to the test system, and I'm out sick today. Hopefully I'll be able to continue testing tomorrow.

Hi Brad, any chance you can give the branch a go today?

Actions #33

Updated by Samuel Just over 9 years ago

  • Priority changed from Urgent to Immediate
Actions #34

Updated by Brad House over 9 years ago

Sorry, still out sick today. Hoping to be in tomorrow.

Actions #35

Updated by Brad House over 9 years ago

I've got ceph updated to the wip-9731, and am attempting to start the OSDs under valgrind. However, the first one appears to be taking many many minutes to join while running under valgrind.

Actions #36

Updated by Brad House over 9 years ago

after installing wip-9731 but before running under valgrind, I received a crash at 2014-10-22 10:44:42.326583 log attached.

Actions #37

Updated by Brad House over 9 years ago

the 3rd OSD won't join, it is now always aborting at startup. log attached. Perhaps all the starting/stopping has caused corruption?

Actions #38

Updated by Samuel Just over 9 years ago

(Last I heard, 2/3 were running valgrind, cluster is healthy)

Question: what version are the clients?

Actions #39

Updated by Brad House over 9 years ago

Right, I couldn't get 3/3 under valgrind to ever come up to a good health, probably because of the load on it. However, I let the other 2 run until 5 under valgrind before I restarted then as normal. Nothing showed up under valgrind, but unfortunately that doesn't mean anything.

The clients are actually the same physical nodes as the ceph nodes, just using RBD through QEMU/KVM.

Actions #40

Updated by Samuel Just over 9 years ago

We probably want to let them run under valgrind overnight if possible.

Actions #41

Updated by Sage Weil over 9 years ago

so far no luck replicating this with

machine_type: plana,burnupi,mira
nuke-on-error: true
overrides:
  admin_socket: {branch: master}
  ceph:
    conf:
      global: {ms inject socket failures: 5000, osd heartbeat grace: 40}
      mon: {debug mon: 20, debug ms: 1, debug paxos: 20}
      osd: {debug refs: 10, debug filestore: 20, debug journal: 20, debug ms: 1, debug osd: 20, osd op thread timeout: 60,
        osd sloppy crc: true}
    fs: xfs
    log-whitelist: [slow request]
    valgrind:
      mds: [--tool=memcheck]
      mon: [--tool=memcheck, --leak-check=full, --show-reachable=yes]
      osd: [--tool=memcheck, --leak-check=full, --show-reachable=yes]
  install:
    ceph: {branch: wip-9731}
  rgw: {ec-data-pool: true}
roles:
- [mon.a, osd.0, osd.1, osd.2, client.0]
- [mds.a, osd.3, osd.4, osd.5, client.1]
suite: rgw
tasks:
- {install: null}
- {ceph: null}
- thrashosds:
- pexec:
    client.0:
      - rbd create foo --size 100
      - rbd bench-write foo --io-pattern rand --io-total 40960000
    client.1:
      - rbd create bar --size 100
      - rbd bench-write bar --io-pattern rand --io-total 40960000

Actions #42

Updated by Samuel Just over 9 years ago

We don't really want leak-check, it is likely slowing down the osds more than necessary.

Actions #43

Updated by Brad House over 9 years ago

Forgot to attach latest core file from the crash prior to testing with valgrind when running wip-9731

Actions #44

Updated by Brad House over 9 years ago

backtrace from last core

#0  0x00007ffc29da6efb in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000ab42f9 in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x00007ffc291cae2b in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000000000079347b in hobject_t::hobject_t (this=0x2382f8d8) at ./common/hobject.h:32
#6  0x0000000000846056 in pg_log_entry_t (this=0x2382f8d0) at osd/osd_types.h:1980
#7  construct (__val=..., __p=0x2382f8d0, this=<optimized out>) at /usr/include/c++/4.7/ext/new_allocator.h:120
#8  _M_create_node (__x=..., this=<optimized out>) at /usr/include/c++/4.7/bits/stl_list.h:485
#9  _M_insert (__x=..., __position=..., this=<optimized out>) at /usr/include/c++/4.7/bits/stl_list.h:1525
#10 push_front (__x=..., this=0x20127538) at /usr/include/c++/4.7/bits/stl_list.h:956
#11 pg_log_t::copy_up_to (this=0x201274f8, other=..., max=3000) at osd/osd_types.cc:2800
#12 0x000000000089c474 in PG::activate (this=this@entry=0x18882c00, t=..., query_epoch=query_epoch@entry=2983, tfin=..., query_map=..., activator_map=activator_map@entry=0x12644690, ctx=ctx@entry=0x188846f0) at osd/PG.cc:1560
#13 0x000000000089f4de in PG::RecoveryState::Active::Active (this=0x1c84f550, ctx=...) at osd/PG.cc:6374
#14 0x00000000008d9c5c in boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::shallow_construct (pContext=..., outermostContextBase=...)
    at /usr/include/boost/statechart/state.hpp:89
#15 0x00000000008daff6 in boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::deep_construct (pContext=..., outermostContextBase=...) at /usr/include/boost/statechart/state.hpp:79
#16 0x00000000008db0cd in construct (outermostContextBase=..., pContext=...) at /usr/include/boost/statechart/detail/constructor.hpp:93
#17 boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function> (this=0x1c093ae0, transitionAction=...) at /usr/include/boost/statechart/simple_state.hpp:798
#18 0x00000000008db162 in transit<PG::RecoveryState::Active> (this=<optimized out>) at /usr/include/boost/statechart/simple_state.hpp:314
#19 boost::statechart::transition<PG::Activate, PG::RecoveryState::Active, boost::statechart::detail::no_context<PG::Activate>, &boost::statechart::detail::no_context<PG::Activate>::no_function>::reactions<PG::RecoveryState::Peering>::react_without_action (stt=...)
    at /usr/include/boost/statechart/transition.hpp:38
#20 0x00000000008db1eb in react (stt=...) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:47
#21 react (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:92
#22 react (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/detail/reaction_dispatcher.hpp:109
#23 react<PG::RecoveryState::Peering, boost::statechart::event_base, void const*> (stt=..., evt=..., eventType=<optimized out>) at /usr/include/boost/statechart/transition.hpp:59
#24 boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list2<boost::statechart::transition<PG::Activate, PG::RecoveryState::Active, boost::statechart::detail::no_context<PG::Activate>, &boost::statechart::detail::no_context<PG::Activate>::no_function>, boost::statechart::custom_reaction<PG::AdvMap> >, boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=eventType@entry=0x10567b0) at /usr/include/boost/statechart/simple_state.hpp:816
#25 0x00000000008db257 in local_react<boost::mpl::list2<boost::statechart::transition<PG::Activate, PG::RecoveryState::Active>, boost::statechart::custom_reaction<PG::AdvMap> > > (eventType=0x10567b0, evt=..., this=0x1c093ae0)
    at /usr/include/boost/statechart/simple_state.hpp:851
#26 boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::transition<PG::Activate, PG::RecoveryState::Active, boost::statechart::detail::no_context<PG::Activate>, &boost::statechart::detail::no_context<PG::Activate>::no_function>, boost::statechart::custom_reaction<PG::AdvMap>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=eventType@entry=0x10567b0) at /usr/include/boost/statechart/simple_state.hpp:820
#27 0x00000000008db2d1 in local_react<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::transition<PG::Activate, PG::RecoveryState::Active>, boost::statechart::custom_reaction<PG::AdvMap> > > (eventType=0x10567b0, evt=..., this=0x1c093ae0)
    at /usr/include/boost/statechart/simple_state.hpp:851
#28 boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl (this=0x1c093ae0, evt=..., eventType=0x10567b0) at /usr/include/boost/statechart/simple_state.hpp:489
#29 0x00000000008bb417 in operator() (this=<synthetic pointer>) at /usr/include/boost/statechart/state_machine.hpp:87
#30 operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, const void*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial>::exception_event_handler> (action=..., this=<optimized out>) at /usr/include/boost/statechart/null_exception_translator.hpp:33
#31 send_event (evt=..., this=0x18884558) at /usr/include/boost/statechart/state_machine.hpp:885
#32 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events (this=this@entry=0x18884558) at /usr/include/boost/statechart/state_machine.hpp:910
#33 0x00000000008bb9f6 in boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event (this=this@entry=0x18884558, evt=...)
    at /usr/include/boost/statechart/state_machine.hpp:280
#34 0x00000000008726e4 in handle_event (rctx=0x7ffc11f95a30, evt=..., this=0x18884558) at osd/PG.h:1973
#35 PG::handle_activate_map (this=0x18882c00, rctx=0x7ffc11f95a30) at osd/PG.cc:5365
#36 0x0000000000774f2f in OSD::advance_pg (this=this@entry=0x24be000, osd_epoch=2983, pg=pg@entry=0x18882c00, handle=..., rctx=rctx@entry=0x7ffc11f95a30, new_pgs=new_pgs@entry=0x7ffc11f95a00) at osd/OSD.cc:5817
#37 0x000000000077576c in OSD::process_peering_events (this=0x24be000, pgs=..., handle=...) at osd/OSD.cc:7837
#38 0x00000000007d28a4 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=..., handle=...) at osd/OSD.h:1242
#39 0x0000000000b804ea in ThreadPool::worker (this=0x24be470, wt=0x257bf80) at common/WorkQueue.cc:128
#40 0x0000000000b81730 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
#41 0x00007ffc29d9eb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#42 0x00007ffc289c2e6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#43 0x0000000000000000 in ?? ()
Actions #45

Updated by Brad House over 9 years ago

Other details as per sjustwork on irc:

  • 3-node ceph cluster, 2 OSDs per node (1ssd 1hdd). All ssds are assigned to the ssd pool, and all hdds are assigned the the platter pool. Currently only images exist on the ssd pool.
  • All OSDs run XFS
  • Prior to starting to experience crashes in ceph, 'nobarrier' appears to have been mistakingly added to these nodes even though they do not use RAID w/BBU and SuperCAP enabled SSDs. This has been removed since we started experiencing the crash. However, power outage simulations were run while using 'nobarrier'.
  • There are about 24 active (but mostly idle) test KVM/QEMU VMs running through Proxmox VE on this. The nodes double as VM hosts.
  • Activity seems to cause the crash, I've only seen it crash once when I wasn't here, so leaving it with an 'idle workload' over night without any IT staff playing with the VM guests sitting on top doesn't tend to cause any issues.
  • We have experienced crashes on all 3 nodes, some times 2 nodes would fail within a minute or two of eachother.
  • Yesterday, sjustwork had me move a file out of a ceph directory to get an osd to start, we have not experienced a crash since.
  • This is a test lab, not production.
  • WIP-9371 does not seem to be crashing as much, the last backtrace and core file are the only crash we've gotten from WIP-9371
Actions #46

Updated by Sage Weil over 9 years ago

  • Priority changed from Immediate to Urgent
Actions #47

Updated by Samuel Just over 9 years ago

Still no crashes under valgrind? How many osds are running under valgrind? We should probably leave it running for another night. Next, I've pushed a new version of wip-9731 with tcmalloc disabled. If you run the osds with MALLOC_CHECK_=3 we may get lucky and the glibc malloc may notice the problem (or crash earlier).

Actions #48

Updated by Brad House over 9 years ago

Nothing reported from valgrind. Also haven't seen crashes lately. At this point I'm thinking the issues were corruption related. I haven't yet attempted MALLOC_CHECK_=3

I want to test a bit today without running it in any special way to see if I can get it to crash again, as if it was corruption related and the corruption is cleared up I may not be able to reproduce anything.

Actions #49

Updated by Samuel Just over 9 years ago

Ok, let me know what happens.

Actions #50

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF