Actions
Bug #8584
closedOSD Crashing on firefly - Timeouts on starting again
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Besides the Crashing of OSD's on firefly with the following error in theyre logfiles:
0> 2014-06-11 13:49:11.814960 7f0798acf700 -1 *** Caught signal (Aborted) ** in thread 7f0798acf700 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: /usr/bin/ceph-osd() [0xaa0e82] 2: (()+0xf030) [0x7f07af79e030] 3: (gsignal()+0x35) [0x7f07ae0c1475] 4: (abort()+0x180) [0x7f07ae0c46f0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f07ae91789d] 6: (()+0x63996) [0x7f07ae915996] 7: (()+0x639c3) [0x7f07ae9159c3] 8: (()+0x63bee) [0x7f07ae915bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb7948a] 10: (PG::fulfill_info(pg_shard_t, pg_query_t const&, std::pair<pg_shard_t, pg_info_t>&)+0x5a) [0x86f30a] 11: (PG::RecoveryState::Stray::react(PG::MQuery const&)+0xef) [0x88181f] 12: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::MQuery>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::RecoveryDone>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x86) [0x8bf856] 13: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x21) [0x8bf8a1] 14: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x8a67bb] 15: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x19) [0x8a6849] 16: (PG::RecoveryState::handle_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x31) [0x8a68e1] 17: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x338) [0x85d8f8] 18: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x404) [0x76e8f4] 19: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x7cb334] 20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb6b24a] 21: (ThreadPool::WorkThread::entry()+0x10) [0xb6c4a0] 22: (()+0x6b50) [0x7f07af795b50] 23: (clone()+0x6d) [0x7f07ae16b0ed] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.1324.log --- end dump of recent events ---
i get a lot of timeouts when i try to start such an OSD again.
== osd.821 === failed: 'timeout 120 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.821 --keyring=/var/lib/ceph/osd/ceph-821/keyring osd crush create-or-move -- 821 3.64 host=csliveeubs-u01b03 root=default'
The overall state of out test cluster is pretty bad after we've upgraded from Emperor to Firefly and set the CRUSH tuneables to optimum:
2014-06-11 16:24:21.656959 mon.1 [INF] pgmap v435414: 131072 pgs: 4351 inactive, 498 down+remapped+peering, 17 active, 2 stale+degraded+remapped, 1 stale+down, 39 stale+active+degraded+remapped, 3602 active+clean, 288 stale+incomplete, 62765 peering, 41 stale+down+peering, 156 stale+remapped, 283 degraded+remapped, 89 stale+active+remapped, 12 down, 480 active+degraded+remapped, 5152 incomplete, 2 stale+active+clean+scrubbing+deep, 694 stale+remapped+peering, 1 stale+down+incomplete, 1316 down+peering, 7825 remapped, 8 stale+degraded, 1 active+clean+scrubbing, 1610 active+remapped, 229 stale+active+degraded, 1 stale+down+remapped, 16 active+clean+scrubbing+deep, 13 stale+remapped+incomplete, 35149 remapped+peering, 3 down+incomplete, 198 stale, 902 degraded, 8 stale+down+remapped+peering, 3089 active+degraded, 1 down+remapped, 222 stale+active+clean, 316 remapped+incomplete, 1692 stale+peering; 180 TB data, 694 TB used, 4271 TB / 4966 TB avail; 13333126/546076160 objects degraded (2.442%)
Files
Updated by Samuel Just almost 10 years ago
- Assignee set to Samuel Just
Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
Updated by Sage Weil almost 10 years ago
- Status changed from New to Duplicate
this look slike it was #8738
Actions