Project

General

Profile

Actions

Bug #8974

closed

osd crashed with merge_log assert due to removal of isds

Added by Sahana Lokeshappa over 9 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Even I got same asserts in one of the osds, when removed one osd from each node in a ceph cluster of 3 osd nodes ( 5 osds in each )

-6> 2014-07-21 19:33:26.690216 7f8b6ee64700 5 – op tracker – , seq: 1214638, time: 2014-07-21 19:33:26.690213, event: started, op: pg_backfill(progress 4.1a e 650/650 lb c291001a/rb.0.10e6.238e1f29.000000030470/head//4)
-5> 2014-07-21 19:33:26.690259 7f8b6ee64700 10 log is not dirty
-4> 2014-07-21 19:33:26.690261 7f8b6ee64700 5 filestore(/var/lib/ceph/osd/ceph-2) queue_transactions existing osr(4.1a 0x899f200)/0x899f200
-3> 2014-07-21 19:33:26.690270 7f8b6ee64700 5 filestore(/var/lib/ceph/osd/ceph-2) queue_transactions (writeahead) 785637 0x18d69980
-2> 2014-07-21 19:33:26.690279 7f8b6ee64700 10 osd.2 650 dequeue_op 0x3e72d00 finish
-1> 2014-07-21 19:33:26.690288 7f8b6ee64700 5 – op tracker – , seq: 1214638, time: 2014-07-21 19:33:26.690285, event: done, op: pg_backfill(progress 4.1a e 650/650 lb c291001a/rb.0.10e6.238e1f29.000000030470/head//4)
0> 2014-07-21 19:33:26.708142 7f8b6fe66700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f8b6fe66700 time 2014-07-21 19:33:26.544445
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c50) [0x7a2500]
2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x9c) [0x72981c]
3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x482) [0x762342]
4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x795ea4]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7832fb]
6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1c0) [0x735d40]
7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x294) [0x64afa4]
8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x69da98]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa4b351]
10: (ThreadPool::WorkThread::entry()+0x10) [0xa4c440]
11: (()+0x8182) [0x7f8b891ae182]
12: (clone()+0x6d) [0x7f8b8754f30d]

Attched logs of osd.2


Files

ceph.osd.2.logs (42.5 MB) ceph.osd.2.logs Sahana Lokeshappa, 07/30/2014 01:46 PM
Actions #1

Updated by Samuel Just over 9 years ago

We can probably make some progress if you reproduce with

debug ms = 1
debug osd = 20
debug filestore = 20

on all osds and attach a tarball of all of the logs.

Actions #2

Updated by Samuel Just over 9 years ago

  • Status changed from New to Need More Info
Actions #3

Updated by Samuel Just over 9 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF