Project

General

Profile

Actions

Bug #12536

closed

"FAILED assert(!log.null() || olog.tail == eversion_t())"

Added by Yuri Weinstein over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is only one test to validate fix for #12410

Command line:

teuthology-suite -v -c wip-12410 -k distro -m vps -s upgrade/hammer-x ~/vps.yaml --suite-dir ~/yuriw2/ceph-qa-suite --filter="upgrade:hammer-x/stress-split/{0-cluster/start.yaml 1-hammer-install/hammer.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml}" -p 70

Modified test is in teuthology@teuthology:~/yuriw2/ceph-qa-suite/suites/upgrade/hammer-x/stress-split$

Run: http://pulpito.ceph.com/teuthology-2015-07-30_07:20:58-upgrade:hammer-x-wip-12410-distro-basic-vps/
Job: 992087
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-07-30_07:20:58-upgrade:hammer-x-wip-12410-distro-basic-vps/992087/teuthology.log

Assertion: osd/PGLog.cc: 564: FAILED assert(!log.null() || olog.tail == eversion_t())
ceph version 9.0.2-682-g1320e29 (1320e29dfaee9995409a6d99b9ccaa748dc67b90)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xafed0f]
 2: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x13fd) [0x73d76d]
 3: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0xa9) [0x7914d9]
 4: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x3ed) [0x7bdbed]
 5: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x182) [0x7ee542]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7d25ab]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x1e) [0x7d28ae]
 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x788453]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x260) [0x673270]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6cb642]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xaedbee]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xaf0a50]
 13: (()+0x7e9a) [0x7f68d07f7e9a]
 14: (clone()+0x6d) [0x7f68ced9438d]


Related issues 4 (0 open4 closed)

Related to Ceph - Bug #12410: OSDMonitor::preprocess_get_osdmap: must send the last map as wellResolvedSamuel Just07/20/2015

Actions
Related to Ceph - Bug #12427: OSD dies after a couple of secondsCan't reproduce07/21/2015

Actions
Related to Ceph - Bug #12613: hammer needs to set pool in hobject_t::get_boundaryResolvedSamuel Just08/04/2015

Actions
Copied to Ceph - Backport #12571: "FAILED assert(!log.null() || olog.tail == eversion_t())"Resolved07/30/2015Actions
Actions #1

Updated by Samuel Just over 8 years ago

  • Priority changed from Normal to Urgent

Not actually related to the above bug, just revealed by the same test. Bah, it's from wip-temp, some difference between hobject_t::min() between hammer and current master with wip-temp.

master osd receiving log from hammer primary:

-646> 2015-07-30 17:02:30.882331 7f68b769b700 10 osd.3 pg_epoch: 2618 pg[31.35( empty local-les=0 n=0 ec=1556 les/c 2597/2598 2616/2617/2616) [9,3]/[9,8] r=-1 lpr=2618 pi=2410-2616/6 crt=0'0 inactive] state&lt;Started/Stray&gt;: got info+log from osd.9 31.35( v 2614'3213 (1692'254,2614'3213] lb -1/0//0 local-les=2618 n=0 ec=1556 les/c 2597/2598 2616/2617/2616) log((1692'254,2614'3213], crt=2614'3209)

lb -1/0//0 doesn't appear to be hobject_t::min() in master.

Actions #2

Updated by Samuel Just over 8 years ago

Ah, we switched to using INT64_MIN (from -1) for pool for min() with wip-temp. We need to make sure that messages which contain hobject_t which may be min are adjusted correctly to and from hammer osds. One option would be to add another field to the hobject_t encoding new_pool and stick -1 in the old field for min(). Or, we could audit all messages and handle it case-by-case. How do we handle transactions on the wire which reference objects with pool <-1?

Actions #3

Updated by Sage Weil over 8 years ago

  • Assignee set to Sage Weil

Samuel Just wrote:

Ah, we switched to using INT64_MIN (from -1) for pool for min() with wip-temp. We need to make sure that messages which contain hobject_t which may be min are adjusted correctly to and from hammer osds. One option would be to add another field to the hobject_t encoding new_pool and stick -1 in the old field for min(). Or, we could audit all messages and handle it case-by-case. How do we handle transactions on the wire which reference objects with pool <-1?

Hrm. I think it's best to handle it in the encode/decode.. we'll never get all the message instances.

As for the transactions on temp objects... yeah, I don't think we do anything to handle that properly. Crap, I thought I ran an upgrade test on this before merge.

I think we need to change the PG code to choose unique objects that aren't in the temp namespace if actingbackfill_features doesn't indicate all osds have wip-temp. We can't guarantee we won't collide with an existing object, but we can be safe enough by using a namespace and obscure object name.

Actions #4

Updated by Yuri Weinstein over 8 years ago

  • Description updated (diff)
Actions #5

Updated by Sage Weil over 8 years ago

  • Description updated (diff)

Okay, I think it's not as bad as I thought.. just the min issue after all.

Actions #6

Updated by Sage Weil over 8 years ago

A simpler fix (with less long-term cruft) would be to patch hammer to correctly interpret INT64_MIN as hobject_t::get_min(), and require that users upgrade to 0.94.3 (or later) before going to infernalis/jewel. Given that we're still a couple months away from infernalis that won't be an additional burden for most users, but would need to be very well documented.

Actions #7

Updated by Loïc Dachary over 8 years ago

  • Backport set to hammer
Actions #9

Updated by Sage Weil over 8 years ago

  • Status changed from New to 7
Actions #10

Updated by Sage Weil over 8 years ago

  • Status changed from 7 to Resolved
Actions #12

Updated by Yuri Weinstein over 8 years ago

Note: https://github.com/ceph/ceph-qa-suite/pull/525 needs to be merged after this ticked is fixed/backported.
See #12625

Actions #13

Updated by Yuri Weinstein over 8 years ago

  • Release set to next
  • ceph-qa-suite upgrade/hammer-x added
Actions #14

Updated by Loïc Dachary over 8 years ago

  • Status changed from Resolved to Pending Backport
Actions #15

Updated by Sage Weil over 8 years ago

  • Assignee deleted (Sage Weil)
Actions #16

Updated by Sage Weil over 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF