Project

General

Profile

Bug #9369

init: ceph-osd (...) main process (...) killed by ABRT signal

Added by Jamin Collins over 9 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All storage nodes are running the same (firefly) version:
$ ceph --version
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)

My cluster was originally setup with two storage nodes with "min_size" set to 1 and "size" set to 2 for each of the pools. The monitor node for the cluster runs on the more power of the storage nodes.

Recently, I added a third storage node by setting the "size" on the main pool to 3 and adding the additional OSDs.

While the cluster has been recovering, I've noticed that OSDs on all three nodes are fairly frequently being marked down and automatically coming back up. Checking dmesg on each storage node shows messages like the following:

[108467.872884] init: ceph-osd (ceph/11) main process (5834) killed by ABRT signal
[108467.872900] init: ceph-osd (ceph/11) main process ended, respawning
[108469.984765] init: ceph-osd (ceph/10) main process (5904) killed by ABRT signal
[108469.984781] init: ceph-osd (ceph/10) main process ended, respawning
[108509.084961] init: ceph-osd (ceph/11) main process (6409) killed by ABRT signal
[108509.084979] init: ceph-osd (ceph/11) main process ended, respawning
[108528.288805] init: ceph-osd (ceph/10) main process (6481) killed by ABRT signal
[108528.288823] init: ceph-osd (ceph/10) main process ended, respawning
[108540.898544] init: ceph-osd (ceph/11) main process (7036) killed by ABRT signal
[108540.898561] init: ceph-osd (ceph/11) main process ended, respawning

This appears to also lead to rather rapid log file growth. In checking the log file this appeared to be of interest:

2014-09-06 01:45:37.721783 7ff524517700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7ff524517700 time 2014-09-06 01:45:37.719467
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
 1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c50) [0x6e4cc0]
 2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x9c) [0x71fd2c]
 3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x482) [0x75a182]
 4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x78d5c4]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x77a7fb]
 6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1c0) [0x72c530]
 7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x27e) [0x643a7e]
 8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x68aed8]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa43c61]
 10: (ThreadPool::WorkThread::entry()+0x10) [0xa44b50]
 11: (()+0x8182) [0x7ff53d573182]
 12: (clone()+0x6d) [0x7ff53bce6fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I see a similar reports here:
http://tracker.ceph.com/issues/8229
http://tracker.ceph.com/issues/2462

History

#1 Updated by Jamin Collins over 9 years ago

I've reverted the main pool "size" to 2 in an attempt to get the storage cluster back to a completely healthy state, but am still seeing OSDs go down periodically, along with output like the above in the ceph-osd logs.

#2 Updated by Jamin Collins over 9 years ago

If it would be of any use, I have a paired log file and core dump for one of the occurrences.

#3 Updated by Jamin Collins over 9 years ago

I eventually decided to take the problematic OSDs down and try reweighting them to 0 in an attempt to get the cluster to move on without them as much as possible, without fully removing the OSDs. In hopes that I would then be able to bring the problematic OSDs back online, without them flapping, and the cluster would then pull copies of the remaining PGs from them.

This appeared to work. After reweighting the OSDs to 0, the cluster began to recovering as much as it could. Once recovery halted with only "down" PGs remaining, I brought the three problematic OSDs back online one by one. They remained stable and did not flap. The cluster brought the "down" PGs up and continued recovery.

Once all PGs were once again "active" in the cluster, I began checking my VMs that are backed by this ceph cluster. At this point I found that nearly everyone of them had suffered some form of extensive filesystem corruption. I'm now in the process of reverting to pre-ceph backups of VMs were possible.

#4 Updated by Samuel Just over 9 years ago

This looks like some kind of local fs corruption. What os/filesystem ar you using?

#5 Updated by Jamin Collins over 9 years ago

Mixture of ext3, ext4, and xfs. The rbd volume that completely lost its partition table was GPT partitioned with an xfs volume.

#6 Updated by Samuel Just over 9 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF