Project

General

Profile

Bug #9481

osd/PGLog.h: 87: FAILED assert(rollback_info_trimmed_to == head)

Added by Samuel Just over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
firefly,giant
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Bug is PGLog::claim_log_and_clear_rollback_info sets rollback_info_trimmed_to before setting head.

ceph-osd.42.log View (1.95 MB) Sahana Lokeshappa, 09/18/2014 09:16 PM

Associated revisions

Revision 0769310c (diff)
Added by Samuel Just over 9 years ago

PGLog::claim_log_and_clear_rollback_info: fix rollback_info_trimmed_to

We have been setting it to the old head value. This is usually
harmless since the new head will virtually always be ahead of the
old head for claim_log_and_clear_rollback_info, but can cause trouble
in some edge cases.

Fixes: #9481
Backport: firefly
Signed-off-by: Samuel Just <>

Revision c4685075 (diff)
Added by Samuel Just over 9 years ago

PGLog::claim_log_and_clear_rollback_info: fix rollback_info_trimmed_to

We have been setting it to the old head value. This is usually
harmless since the new head will virtually always be ahead of the
old head for claim_log_and_clear_rollback_info, but can cause trouble
in some edge cases.

Fixes: #9481
Backport: firefly
Signed-off-by: Samuel Just <>
(cherry picked from commit 0769310ccd4e0dceebd8ea601e8eb5c0928e0603)

History

#1 Updated by Samuel Just over 9 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Samuel Just

#2 Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Pending Backport

#3 Updated by Sahana Lokeshappa over 9 years ago

ceph cluster with 8 osd nodes each having 64 osds, few osds were crashing with this assert .As one node had timestamp issue, few osds went down, hence triggered backfilling, and hit this assert in other osds.
2014-09-18 18:08:28.330059 7f4c97241700 0 log_channel(default) log [INF] : 2.410 restarting backfill on osd.6 from (0'0,0'0] MAX to 6840'31228
2014-09-18 18:08:30.680342 7f4c97241700 0 log_channel(default) log [INF] : 2.178 restarting backfill on osd.20 from (0'0,0'0] MAX to 6874'18440
2014-09-18 18:08:30.681038 7f4c96a40700 0 log_channel(default) log [INF] : 2.6f8 restarting backfill on osd.35 from (0'0,0'0] MAX to 6788'24342
2014-09-18 18:08:30.684715 7f4c97241700 0 log_channel(default) log [INF] : 2.179 restarting backfill on osd.35 from (0'0,0'0] MAX to 6752'29654
2014-09-18 18:08:30.688104 7f4c97241700 0 log_channel(default) log [INF] : 2.410 restarting backfill on osd.56 from (0'0,0'0] MAX to 6840'31228
2014-09-18 18:08:30.688533 7f4c96a40700 0 log_channel(default) log [INF] : 2.11b restarting backfill on osd.9 from (0'0,0'0] MAX to 6775'28078
2014-09-18 18:08:30.975106 7f4ca853b700 0 – 10.242.42.178:6807/2483 >> 10.242.42.164:6854/39317 pipe(0x10b64940 sd=127 :6807 s=2 pgs=32 cs=1 l=0 c=0x12b3aec0).fault with nothing to send, going to standby
2014-09-18 18:08:30.975610 7f4c71984700 0 – 10.242.42.178:0/2483 >> 10.242.42.164:6856/39317 pipe(0x13d50680 sd=199 :0 s=1 pgs=0 cs=0 l=1 c=0x23a746e0).fault
2014-09-18 18:08:30.975635 7f4c790fb700 0 – 10.242.42.178:0/2483 >> 10.242.42.164:6855/39317 pipe(0x10b64680 sd=29 :0 s=1 pgs=0 cs=0 l=1 c=0x23a75760).fault
2014-09-18 18:08:31.338216 7f4c773de700 0 – 10.242.42.178:6807/2483 >> 10.242.42.164:6854/39317 pipe(0x10b64940 sd=93 :6807 s=1 pgs=32 cs=2 l=0 c=0x12b3aec0).fault
2014-09-18 18:08:32.435228 7f4c97241700 0 log_channel(default) log [INF] : 2.519 restarting backfill on osd.4 from (0'0,0'0] MAX to 6868'25546
2014-09-18 18:08:36.959506 7f4c8e22f700 0 – 10.242.42.178:6807/2483 >> 10.242.42.164:6857/39817 pipe(0x10b659c0 sd=59 :6807 s=2 pgs=32 cs=1 l=0 c=0x12b3b2e0).fault with nothing to send, going to standby
2014-09-18 18:08:36.959913 7f4c773de700 0 – 10.242.42.178:0/2483 >> 10.242.42.164:6859/39817 pipe(0x14b50100 sd=70 :0 s=1 pgs=0 cs=0 l=1 c=0x23a77b20).fault
2014-09-18 18:08:36.959917 7f4c71984700 0 – 10.242.42.178:0/2483 >> 10.242.42.164:6858/39817 pipe(0x14b4e000 sd=54 :0 s=1 pgs=0 cs=0 l=1 c=0x23a779c0).fault
2014-09-18 18:08:37.077672 7f4c775e0700 0 – 10.242.42.178:6807/2483 >> 10.242.42.164:6857/39817 pipe(0x10b659c0 sd=59 :6807 s=1 pgs=32 cs=2 l=0 c=0x12b3b2e0).fault
2014-09-18 18:08:37.691722 7f4c97241700 0 log_channel(default) log [INF] : 2.2d8 restarting backfill on osd.2 from (0'0,0'0] MAX to 6825'27712
2014-09-18 18:08:37.691823 7f4c96a40700 0 log_channel(default) log [INF] : 2.204 restarting backfill on osd.0 from (0'0,0'0] MAX to 6810'26722
2014-09-18 18:08:37.717934 7f4c96a40700 -1 osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7f4c96a40700 time 2014-09-18 18:08:37.701817
osd/PGLog.h: 87: FAILED assert(rollback_info_trimmed_to == head)
ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb776db]
2: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x1f9) [0x7e1d19]
3: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x81a944]
4: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x805d6b]
5: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1ce) [0x7b5b8e]
6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2b0) [0x6a3140]
7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x6f83c8]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb68866]
9: (ThreadPool::WorkThread::entry()+0x10) [0xb69910]
10: (()+0x8182) [0x7f4cb21da182]
11: (clone()+0x6d) [0x7f4cb05c030d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#4 Updated by Samuel Just over 9 years ago

wip-sam-testing-firefly

#5 Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF