Project

General

Profile

Bug #11016

After restart OSD, another OSD crashes with error "FAILED assert(log.head >= olog.tail && olog.head >= log.tail)"

Added by YingHsun Kao about 9 years ago. Updated almost 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a cluster with 195 osd configured on 9 different OSD nodes, original version 0.80.5.
After some issue of the datacenter, at least 5 OSD nodes rebooted within very close timeframe and after reboot not all OSDs goes up then trigger a lot of recovery, also many PGs goes into dead / incomplete state.

Then we try to restart OSD, and found OSD keep crashes with error "FAILED assert(log.head >= olog.tail && olog.head >= log.tail)"

So we upgrade to 0.80.7 which covers fix of #9482, however we still see the error with different behavior:
0.80.5: once OSD crashes with this error, any trial to restart the OSD, it will crash with same error at the end
0.80.7: OSD can be restarted, but after some time, there will be another OSD will crash with this error

So the cluster get stuck that we cannot bring more osd back.

Some osd log attached, and the action before crash looks like peering:
2015-03-04 10:49:58.016561 7f85694fc700 0 -- 10.137.36.30:6826/29409 >> 10.137.36.32:6806/22359 pipe(0x178fd280 sd=24 :55830 s=1 pgs=26207 cs=2 l=0 c=0x179770c0).connect claims to be 10.137.36.32:6806/45099 not 10.137.36.32:6806/22359 - wrong node!
2015-03-04 10:49:58.021770 7f85694fc700 0 -- 10.137.36.30:6826/29409 >> 10.137.36.32:6806/22359 pipe(0x178fd280 sd=24 :55837 s=1 pgs=26207 cs=2 l=0 c=0x179770c0).connect claims to be 10.137.36.32:6806/45099 not 10.137.36.32:6806/22359 - wrong node!
2015-03-04 10:49:58.059801 7f857483f700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f857483f700 time 2015-03-04 10:49:58.057956
osd/PGLog.cc: 545: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1bd2) [0x6f1a12]
2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0xac) [0x72f69c]
3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x3a0) [0x76aab0]
4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x214) [0x7a4704]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x78f94b]
6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1df) [0x73cbcf]
7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2a4) [0x64a8f4]
8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x6941d8]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa77cb0]
10: (ThreadPool::WorkThread::entry()+0x10) [0xa78ba0]
11: (()+0x7df3) [0x7f858ce72df3]

ceph-osd.49.log.7z (770 KB) YingHsun Kao, 03/04/2015 08:27 AM

History

#1 Updated by YingHsun Kao about 9 years ago

#2 Updated by Samuel Just about 9 years ago

This just about always means that the filestore is inconsistent on the osd. You want to verify that your storage stack under each osd is honoring barriers correctly.

#3 Updated by Samuel Just almost 9 years ago

  • Status changed from New to Can't reproduce
  • Regression set to No

Also available in: Atom PDF