Project

General

Profile

Actions

Bug #8229

closed

0.80~rc1: OSD crash (domino effect)

Added by Dmitry Smirnov almost 10 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Situation: degraded cluster recovering/remapping ~20% after replacing some OSDs.
During recovery I reboot two servers with MONs/OSDs so cluster temporary didn't have copies of some (unfound) objects during restart of those servers.
Remaining machines had 6 OSDs up in total.
Some time later I've noticed that all OSDs are down (crashed) due to some sort of domino effect.
Here are log and backtrace from OSD that was up during restart of other machines.
This OSD was expected to stay operational but it crashed during restart of other servers/OSDs.

    -7> 2014-04-27 15:14:33.155598 7fd1503e2700 10 monclient: tick
    -6> 2014-04-27 15:14:33.155616 7fd1503e2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-04-27 15:14:03.155615)
    -5> 2014-04-27 15:14:33.155628 7fd1503e2700 10 monclient: renew subs? (now: 2014-04-27 15:14:33.155627; renew after: 2014-04-27 15:14:53.153509) -- no
    -4> 2014-04-27 15:14:33.164726 7fd1523e6700  1 -- 192.168.0.7:6802/4596 <== osd.0 192.168.0.250:0/5715 1009 ==== osd_ping(ping e22127 stamp 2014-04-27 15:14:33.166041) v2 ==== 47+0+0 (295349861 0 0) 0x7fd18f1d1500 con 0x7fd182c566e0
    -3> 2014-04-27 15:14:33.164751 7fd1523e6700  1 -- 192.168.0.7:6802/4596 --> 192.168.0.250:0/5715 -- osd_ping(ping_reply e22127 stamp 2014-04-27 15:14:33.166041) v2 -- ?+0 0x7fd18f44a000 con 0x7fd182c566e0
    -2> 2014-04-27 15:14:33.164825 7fd1533e8700  1 -- 192.168.0.7:6803/4596 <== osd.0 192.168.0.250:0/5715 1009 ==== osd_ping(ping e22127 stamp 2014-04-27 15:14:33.166041) v2 ==== 47+0+0 (295349861 0 0) 0x7fd1950e6bc0 con 0x7fd193891080
    -1> 2014-04-27 15:14:33.164843 7fd1533e8700  1 -- 192.168.0.7:6803/4596 --> 192.168.0.250:0/5715 -- osd_ping(ping_reply e22127 stamp 2014-04-27 15:14:33.166041) v2 -- ?+0 0x7fd1947cd340 con 0x7fd193891080
     0> 2014-04-27 15:14:33.188354 7fd14f3e0700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7fd14f3e0700 time 2014-04-27 15:14:33.113457
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c50) [0x7fd16a4dad30]
 2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x9c) [0x7fd16a46035c]
 3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x482) [0x7fd16a49a352]
 4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x7fd16a4ce594]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7fd16a4bb87b]
 6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1c0) [0x7fd16a46cbc0]
 7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28c) [0x7fd16a38d7dc]
 8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x7fd16a3d4a78]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fd16a78abb1]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x7fd16a78baa0]
 11: (()+0x8062) [0x7fd16986a062]
 12: (clone()+0x6d) [0x7fd167d92a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---
2014-04-27 15:14:33.188381 7fd14ebdf700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7fd14ebdf700 time 2014-04-27 15:14:33.107709
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c50) [0x7fd16a4dad30]
 2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x9c) [0x7fd16a46035c]
 3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x482) [0x7fd16a49a352]
 4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x7fd16a4ce594]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7fd16a4bb87b]
 6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1c0) [0x7fd16a46cbc0]
 7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28c) [0x7fd16a38d7dc]
 8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x7fd16a3d4a78]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fd16a78abb1]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x7fd16a78baa0]
 11: (()+0x8062) [0x7fd16986a062]
 12: (clone()+0x6d) [0x7fd167d92a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -4> 2014-04-27 15:14:33.188381 7fd14ebdf700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7fd14ebdf700 time 2014-04-27 15:14:33.107709
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c50) [0x7fd16a4dad30]
 2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x9c) [0x7fd16a46035c]
 3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x482) [0x7fd16a49a352]
 4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x7fd16a4ce594]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7fd16a4bb87b]
 6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1c0) [0x7fd16a46cbc0]
 7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28c) [0x7fd16a38d7dc]
 8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x7fd16a3d4a78]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fd16a78abb1]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x7fd16a78baa0]
 11: (()+0x8062) [0x7fd16986a062]
 12: (clone()+0x6d) [0x7fd167d92a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -3> 2014-04-27 15:14:33.188407 7fd1523e6700  1 -- 192.168.0.7:6802/4596 <== osd.6 192.168.0.204:0/14861 3676 ==== osd_ping(ping e22127 stamp 2014-04-27 15:14:33.190958) v2 ==== 47+0+0 (1266741472 0 0) 0x7fd190931500 con 0x7fd18e9d3440
    -2> 2014-04-27 15:14:33.188407 7fd1533e8700  1 -- 192.168.0.7:6803/4596 <== osd.6 192.168.0.204:0/14861 3676 ==== osd_ping(ping e22127 stamp 2014-04-27 15:14:33.190958) v2 ==== 47+0+0 (1266741472 0 0) 0x7fd1738e1340 con 0x7fd18e9d3b20
    -1> 2014-04-27 15:14:33.188447 7fd1533e8700  1 -- 192.168.0.7:6803/4596 --> 192.168.0.204:0/14861 -- osd_ping(ping_reply e22127 stamp 2014-04-27 15:14:33.190958) v2 -- ?+0 0x7fd1950e6bc0 con 0x7fd18e9d3b20
     0> 2014-04-27 15:14:33.188518 7fd1523e6700  1 -- 192.168.0.7:6802/4596 --> 192.168.0.204:0/14861 -- osd_ping(ping_reply e22127 stamp 2014-04-27 15:14:33.190958) v2 -- ?+0 0x7fd18f1d1500 con 0x7fd18e9d3440
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---

From crash dump:

Core was generated by `/usr/bin/ceph-osd -i 9 --pid-file /var/run/ceph/osd.9.pid -c /etc/ceph/ceph.con'.                                                                                
Program terminated with signal 6, Aborted.                                                                                                                                              
#0  0x00007fd167ce23a9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0  0x00007fd167ce23a9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fd167ce54c8 in __GI_abort () at abort.c:89
#2  0x00007fd1685cf5e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fd1685cd746 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fd1685cd773 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fd1685cd9b2 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fd16a799cf2 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7fd16a8aed68 "log.head >= olog.tail && olog.head >= log.tail", 
    file=file@entry=0x7fd16a8ae56a "osd/PGLog.cc", line=line@entry=512, 
    func=func@entry=0x7fd16a8af140 <PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)::__PRETTY_FUNCTION__> "void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)") at common/assert.cc:77
#7  0x00007fd16a4dad30 in PGLog::merge_log (this=this@entry=0x7fd18868c818, t=..., oinfo=..., olog=..., fromosd=..., info=..., rollbacker=rollbacker@entry=0x7fd14ebde540, 
    dirty_info=@0x7fd18868c4a9: false, dirty_big_info=@0x7fd18868c4aa: false) at osd/PGLog.cc:512
#8  0x00007fd16a46035c in PG::merge_log (this=this@entry=0x7fd18868c000, t=..., oinfo=..., olog=..., from=...) at osd/PG.cc:370
#9  0x00007fd16a49a352 in PG::RecoveryState::Stray::react (this=this@entry=0x7fd16eb6c480, logevt=...) at osd/PG.cc:6584
#10 0x00007fd16a4ce594 in react<PG::RecoveryState::Stray, boost::statechart::event_base, void const*> (eventType=<synthetic pointer>, evt=..., stt=...)
    at /usr/include/boost/statechart/custom_reaction.hpp:42
#11 local_react_impl<boost::mpl::list4<boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::RecoveryDone> >, boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (eventType=0x7fd16ac3a790 <boost::statechart::detail::id_holder<PG::MLogRec>::idProvider_>, evt=..., stt=...)
    at /usr/include/boost/statechart/simple_state.hpp:816
#12 local_react<boost::mpl::list4<boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::RecoveryDone> > > (eventType=0x7fd16ac3a790 <boost::statechart::detail::id_holder<PG::MLogRec>::idProvider_>, evt=..., this=0x7fd16eb6c480)
    at /usr/include/boost/statechart/simple_state.hpp:851
#13 local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::RecoveryDone> >, boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (eventType=0x7fd16ac3a790 <boost::statechart::detail::id_holder<PG::MLogRec>::idProvider_>, evt=..., 
    stt=...) at /usr/include/boost/statechart/simple_state.hpp:820
#14 local_react<boost::mpl::list<boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::RecoveryDone> > > (
    eventType=0x7fd16ac3a790 <boost::statechart::detail::id_holder<PG::MLogRec>::idProvider_>, evt=..., this=0x7fd16eb6c480) at /usr/include/boost/statechart/simple_state.hpp:851
#15 boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl (
    this=0x7fd16eb6c480, evt=..., eventType=0x7fd16ac3a790 <boost::statechart::detail::id_holder<PG::MLogRec>::idProvider_>) at /usr/include/boost/statechart/simple_state.hpp:489
#16 0x00007fd16a4bb87b in operator() (this=<synthetic pointer>) at /usr/include/boost/statechart/state_machine.hpp:87
#17 operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, void const*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial>::exception_event_handler> (this=0x7fd18868d988, action=...)
    at /usr/include/boost/statechart/null_exception_translator.hpp:33
#18 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event (
    this=0x7fd18868d930, evt=...) at /usr/include/boost/statechart/state_machine.hpp:889
#19 0x00007fd16a46cbc0 in process_event (evt=..., this=0x7fd18868d930) at /usr/include/boost/statechart/state_machine.hpp:275
#20 handle_event (rctx=0x7fd14ebdea60, evt=..., this=0x7fd18868d930) at osd/PG.h:1864
#21 PG::handle_peering_event (this=this@entry=0x7fd18868c000, evt=..., rctx=rctx@entry=0x7fd14ebdea60) at osd/PG.cc:5152
#22 0x00007fd16a38d7dc in OSD::process_peering_events (this=0x7fd16d599000, pgs=..., handle=...) at osd/OSD.cc:7722
#23 0x00007fd16a3d4a78 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=..., handle=...) at osd/OSD.h:1199
#24 0x00007fd16a78abb1 in ThreadPool::worker (this=0x7fd16d599470, wt=0x7fd175ddf620) at common/WorkQueue.cc:125
#25 0x00007fd16a78baa0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:317
#26 0x00007fd16986a062 in start_thread (arg=0x7fd14ebdf700) at pthread_create.c:312
#27 0x00007fd167d92a3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Files

kaa-101.log.gz (21.6 MB) kaa-101.log.gz Log of crashing OSD Markus Blank-Burian, 06/03/2014 03:42 PM
ceph.osd.2.logs (42.5 MB) ceph.osd.2.logs Sahana Lokeshappa, 07/22/2014 02:40 AM

Related issues 1 (0 open1 closed)

Copied to Ceph - Bug #8532: 0.80.1: OSD crash (domino effect), same as BUG #8229Can't reproduceSamuel Just04/27/2014

Actions
Actions #1

Updated by Ian Colle almost 10 years ago

  • Assignee set to David Zafman
  • Priority changed from Normal to High
Actions #2

Updated by Dmitry Smirnov almost 10 years ago

Interesting experiment: on degraded cluster I take down one OSD and bring it up on another host.
Three other OSDs crash as soon as relocated OSD comes up (replicated size 4 min_size 3):

 -6> 2014-04-29 00:27:51.410423 7fe3a0fbc700  5 -- op tracker -- , seq: 196687, time: 2014-04-29 00:27:51.410422, event: started, request: pg_notify(0.2b(194) epoch 23658) v5
    -5> 2014-04-29 00:27:51.411285 7fe3a6f87700  1 -- 192.168.0.7:6801/22454 --> osd.7 192.168.0.2:6812/4690 -- pg_trim(0.28 to 23473'457919 e23658) v1 -- ?+0 0x7fe3de021340
    -4> 2014-04-29 00:27:51.411302 7fe3a6f87700  1 -- 192.168.0.7:6801/22454 --> osd.7 192.168.0.2:6812/4690 -- pg_trim(0.28 to 23473'457919 e23658) v1 -- ?+0 0x7fe3de020a80
    -3> 2014-04-29 00:27:51.411497 7fe3a7788700  1 -- 192.168.0.7:6801/22454 --> osd.4 192.168.0.250:6812/7809 -- pg_info(1 pgs e23658:1.59) v4 -- ?+0 0x7fe3d952aa80
    -2> 2014-04-29 00:27:51.411579 7fe3a7788700  1 -- 192.168.0.7:6801/22454 --> osd.4 192.168.0.250:6812/7809 -- pg_info(1 pgs e23658:0.a) v4 -- ?+0 0x7fe3e3c8c8c0
    -1> 2014-04-29 00:27:51.413415 7fe3a7788700  1 -- 192.168.0.7:6801/22454 --> 192.168.0.2:6812/4690 -- MOSDPGPushReply(0.28 23658 [PushReplyOp(b7264fa8/100003380a3.00000000/head//0),PushReplyOp(3b664fa8/100003fa5f8.00000000/head//0)]) v2 -- ?+0 0x7fe3e37bdc00 con 0x7fe3d8c64420
     0> 2014-04-29 00:27:51.431896 7fe39afb0700 -1 osd/PG.h: In function 'void PG::MissingLoc::add_active_missing(const pg_missing_t&)' thread 7fe39afb0700 time 2014-04-29 00:27:51.411051
osd/PG.h: 378: FAILED assert(i->second.need == j->second.need)

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (PG::MissingLoc::add_active_missing(pg_missing_t const&)+0xb4) [0x7fe3b5ab2e74]
 2: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_
t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >*, PG::RecoveryCtx*)+0x1ffc) [0x7fe3b5a95b9c]
 3: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x334) [0x7fe3b5a966f4]
 4: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xa8) [0x7fe3b5ac8ab8]
 5: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13a) [0x7fe3b5ac8f2a]
 6: (boost::statechart::simple_state<PG::RecoveryState::GetMissing, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc0) [0x7fe3b5ac96a0]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7fe3b5ab387b]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7fe3b5ab39f4]
 9: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1d1) [0x7fe3b5a64bd1]
 10: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28c) [0x7fe3b59857dc]
 11: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x7fe3b59cca78]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fe3b5d82bb1]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x7fe3b5d83aa0]
 14: (()+0x8062) [0x7fe3b4e62062]
 15: (clone()+0x6d) [0x7fe3b338aa3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---
2014-04-29 00:27:51.530913 7fe39afb0700 -1 *** Caught signal (Aborted) **
 in thread 7fe39afb0700

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (()+0x5a5daf) [0x7fe3b5cb0daf]
 2: (()+0xf880) [0x7fe3b4e69880]
 3: (gsignal()+0x39) [0x7fe3b32da3a9]
 4: (abort()+0x148) [0x7fe3b32dd4c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe3b3bc75e5]
 6: (()+0x5e746) [0x7fe3b3bc5746]
 7: (()+0x5e773) [0x7fe3b3bc5773]
 8: (()+0x5e9b2) [0x7fe3b3bc59b2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fe3b5d91cf2]
 10: (PG::MissingLoc::add_active_missing(pg_missing_t const&)+0xb4) [0x7fe3b5ab2e74]
 11: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_
interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >*, PG::RecoveryCtx*)+0x1ffc) [0x7fe3b5a95b9c]
 12: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x334) [0x7fe3b5a966f4]
 13: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xa8) [0x7fe3b5ac8ab8]
 14: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13a) [0x7fe3b5ac8f2a]
 15: (boost::statechart::simple_state<PG::RecoveryState::GetMissing, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc0) [0x7fe3b5ac96a0]
 16: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7fe3b5ab387b]
 17: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7fe3b5ab39f4]
 18: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1d1) [0x7fe3b5a64bd1]
 19: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28c) [0x7fe3b59857dc]
 20: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x7fe3b59cca78]
 21: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fe3b5d82bb1]
 22: (ThreadPool::WorkThread::entry()+0x10) [0x7fe3b5d83aa0]
 23: (()+0x8062) [0x7fe3b4e62062]
 24: (clone()+0x6d) [0x7fe3b338aa3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2014-04-29 00:27:51.530913 7fe39afb0700 -1 *** Caught signal (Aborted) **
 in thread 7fe39afb0700

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (()+0x5a5daf) [0x7fe3b5cb0daf]
 2: (()+0xf880) [0x7fe3b4e69880]
 3: (gsignal()+0x39) [0x7fe3b32da3a9]
 4: (abort()+0x148) [0x7fe3b32dd4c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe3b3bc75e5]
 6: (()+0x5e746) [0x7fe3b3bc5746]
 7: (()+0x5e773) [0x7fe3b3bc5773]
 8: (()+0x5e9b2) [0x7fe3b3bc59b2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7fe3b5d91cf2]
 10: (PG::MissingLoc::add_active_missing(pg_missing_t const&)+0xb4) [0x7fe3b5ab2e74]
 11: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_
interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >*, PG::RecoveryCtx*)+0x1ffc) [0x7fe3b5a95b9c]
 12: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x334) [0x7fe3b5a966f4]
 13: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xa8) [0x7fe3b5ac8ab8]
 14: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13a) [0x7fe3b5ac8f2a]
 15: (boost::statechart::simple_state<PG::RecoveryState::GetMissing, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc0) [0x7fe3b5ac96a0]
 16: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7fe3b5ab387b]
 17: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7fe3b5ab39f4]
 18: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1d1) [0x7fe3b5a64bd1]
 19: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28c) [0x7fe3b59857dc]
 20: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x7fe3b59cca78]
 21: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7fe3b5d82bb1]
 22: (ThreadPool::WorkThread::entry()+0x10) [0x7fe3b5d83aa0]
 23: (()+0x8062) [0x7fe3b4e62062]
 24: (clone()+0x6d) [0x7fe3b338aa3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---

It may be a different issue but its manifestation is very similar so reporting here...

Actions #3

Updated by Samuel Just almost 10 years ago

  • Assignee changed from David Zafman to Samuel Just
  • Priority changed from High to Urgent

Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1

on all osds?
?

Actions #4

Updated by Dmitry Smirnov almost 10 years ago

Sorry, I can't reproduce on healthy cluster and users won't forgive me for potential downtime...
I think it could be reproduced by removing (taking out) one or two OSDs and then (during recovery) another OSD should be stopped and brought up on another server.

Actions #5

Updated by Samuel Just almost 10 years ago

What do you mean by brought up on another server?

Actions #6

Updated by Dmitry Smirnov almost 10 years ago

I mean stopped on "host1", physically moved to "host2" and started there. Crush map will be updated automatically and OSD will change its place in "osd tree". Then rebalancing will be moving some data around...

Actions #7

Updated by Sage Weil almost 10 years ago

  • Status changed from New to Can't reproduce
Actions #8

Updated by Markus Blank-Burian almost 10 years ago

I have the same issue as Dmitry with ceph 0.80.1. It happened after I restarted a some groups of OSDs, each one simultaneously. There was also read/write activity via cephfs during the restarts. Now only ~20 of our 70 OSDs come up. If more OSDs are started, then either the new one or one of the already started OSDs crashes with the assert:
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

Restarting mon did not help, nor did shutting down all OSDs and restarting one by one. A complete log of one OSD with the requested debug level has been attached. Feel free to send me a patch for testing/debugging.
Hope you can help me bring the storage cluster up again.

Actions #9

Updated by Sahana Lokeshappa almost 10 years ago

Even I got same asserts in one of the osds, when removed one osd from each node in a ceph cluster of 3 nodes ( 5 osds in each ) and 3 monitor nodes.

-6> 2014-07-21 19:33:26.690216 7f8b6ee64700 5 – op tracker – , seq: 1214638, time: 2014-07-21 19:33:26.690213, event: started, op: pg_backfill(progress 4.1a e 650/650 lb c291001a/rb.0.10e6.238e1f29.000000030470/head//4)
-5> 2014-07-21 19:33:26.690259 7f8b6ee64700 10 log is not dirty
-4> 2014-07-21 19:33:26.690261 7f8b6ee64700 5 filestore(/var/lib/ceph/osd/ceph-2) queue_transactions existing osr(4.1a 0x899f200)/0x899f200
-3> 2014-07-21 19:33:26.690270 7f8b6ee64700 5 filestore(/var/lib/ceph/osd/ceph-2) queue_transactions (writeahead) 785637 0x18d69980
-2> 2014-07-21 19:33:26.690279 7f8b6ee64700 10 osd.2 650 dequeue_op 0x3e72d00 finish
-1> 2014-07-21 19:33:26.690288 7f8b6ee64700 5 – op tracker – , seq: 1214638, time: 2014-07-21 19:33:26.690285, event: done, op: pg_backfill(progress 4.1a e 650/650 lb c291001a/rb.0.10e6.238e1f29.000000030470/head//4)
0> 2014-07-21 19:33:26.708142 7f8b6fe66700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f8b6fe66700 time 2014-07-21 19:33:26.544445
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c50) [0x7a2500]
2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x9c) [0x72981c]
3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x482) [0x762342]
4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x795ea4]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7832fb]
6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1c0) [0x735d40]
7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x294) [0x64afa4]
8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x69da98]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa4b351]
10: (ThreadPool::WorkThread::entry()+0x10) [0xa4c440]
11: (()+0x8182) [0x7f8b891ae182]
12: (clone()+0x6d) [0x7f8b8754f30d]

Attched logs of osd.2

Actions #10

Updated by Samuel Just over 9 years ago

This bug described a whole bunch of unrelated problems, can you open a fresh bug?

Actions #11

Updated by Dmitry Smirnov over 9 years ago

  • Status changed from Can't reproduce to Closed

Closing: nothing left to track here; did not have this problem with 0.80.4.

Actions

Also available in: Atom PDF