Bug #11016
After restart OSD, another OSD crashes with error "FAILED assert(log.head >= olog.tail && olog.head >= log.tail)"
0%
Description
We have a cluster with 195 osd configured on 9 different OSD nodes, original version 0.80.5.
After some issue of the datacenter, at least 5 OSD nodes rebooted within very close timeframe and after reboot not all OSDs goes up then trigger a lot of recovery, also many PGs goes into dead / incomplete state.
Then we try to restart OSD, and found OSD keep crashes with error "FAILED assert(log.head >= olog.tail && olog.head >= log.tail)"
So we upgrade to 0.80.7 which covers fix of #9482, however we still see the error with different behavior:
0.80.5: once OSD crashes with this error, any trial to restart the OSD, it will crash with same error at the end
0.80.7: OSD can be restarted, but after some time, there will be another OSD will crash with this error
So the cluster get stuck that we cannot bring more osd back.
Some osd log attached, and the action before crash looks like peering:
2015-03-04 10:49:58.016561 7f85694fc700 0 -- 10.137.36.30:6826/29409 >> 10.137.36.32:6806/22359 pipe(0x178fd280 sd=24 :55830 s=1 pgs=26207 cs=2 l=0 c=0x179770c0).connect claims to be 10.137.36.32:6806/45099 not 10.137.36.32:6806/22359 - wrong node!
2015-03-04 10:49:58.021770 7f85694fc700 0 -- 10.137.36.30:6826/29409 >> 10.137.36.32:6806/22359 pipe(0x178fd280 sd=24 :55837 s=1 pgs=26207 cs=2 l=0 c=0x179770c0).connect claims to be 10.137.36.32:6806/45099 not 10.137.36.32:6806/22359 - wrong node!
2015-03-04 10:49:58.059801 7f857483f700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f857483f700 time 2015-03-04 10:49:58.057956
osd/PGLog.cc: 545: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
1: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1bd2) [0x6f1a12]
2: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0xac) [0x72f69c]
3: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x3a0) [0x76aab0]
4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x214) [0x7a4704]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x78f94b]
6: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1df) [0x73cbcf]
7: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2a4) [0x64a8f4]
8: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x6941d8]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa77cb0]
10: (ThreadPool::WorkThread::entry()+0x10) [0xa78ba0]
11: (()+0x7df3) [0x7f858ce72df3]
History
#1 Updated by YingHsun Kao about 9 years ago
- File ceph-osd.49.log.7z added
#2 Updated by Samuel Just about 9 years ago
This just about always means that the filestore is inconsistent on the osd. You want to verify that your storage stack under each osd is honoring barriers correctly.
#3 Updated by Samuel Just almost 9 years ago
- Status changed from New to Can't reproduce
- Regression set to No