Bug #12536
closed"FAILED assert(!log.null() || olog.tail == eversion_t())"
0%
Description
This is only one test to validate fix for #12410
Command line:
teuthology-suite -v -c wip-12410 -k distro -m vps -s upgrade/hammer-x ~/vps.yaml --suite-dir ~/yuriw2/ceph-qa-suite --filter="upgrade:hammer-x/stress-split/{0-cluster/start.yaml 1-hammer-install/hammer.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_12.04.yaml}" -p 70
Modified test is in teuthology@teuthology:~/yuriw2/ceph-qa-suite/suites/upgrade/hammer-x/stress-split$
Run: http://pulpito.ceph.com/teuthology-2015-07-30_07:20:58-upgrade:hammer-x-wip-12410-distro-basic-vps/
Job: 992087
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-07-30_07:20:58-upgrade:hammer-x-wip-12410-distro-basic-vps/992087/teuthology.log
Assertion: osd/PGLog.cc: 564: FAILED assert(!log.null() || olog.tail == eversion_t()) ceph version 9.0.2-682-g1320e29 (1320e29dfaee9995409a6d99b9ccaa748dc67b90) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xafed0f] 2: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x13fd) [0x73d76d] 3: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0xa9) [0x7914d9] 4: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x3ed) [0x7bdbed] 5: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x182) [0x7ee542] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7d25ab] 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x1e) [0x7d28ae] 8: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x788453] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x260) [0x673270] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6cb642] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xaedbee] 12: (ThreadPool::WorkThread::entry()+0x10) [0xaf0a50] 13: (()+0x7e9a) [0x7f68d07f7e9a] 14: (clone()+0x6d) [0x7f68ced9438d]
Updated by Samuel Just almost 9 years ago
- Priority changed from Normal to Urgent
Not actually related to the above bug, just revealed by the same test. Bah, it's from wip-temp, some difference between hobject_t::min() between hammer and current master with wip-temp.
master osd receiving log from hammer primary:
-646> 2015-07-30 17:02:30.882331 7f68b769b700 10 osd.3 pg_epoch: 2618 pg[31.35( empty local-les=0 n=0 ec=1556 les/c 2597/2598 2616/2617/2616) [9,3]/[9,8] r=-1 lpr=2618 pi=2410-2616/6 crt=0'0 inactive] state<Started/Stray>: got info+log from osd.9 31.35( v 2614'3213 (1692'254,2614'3213] lb -1/0//0 local-les=2618 n=0 ec=1556 les/c 2597/2598 2616/2617/2616) log((1692'254,2614'3213], crt=2614'3209)
lb -1/0//0 doesn't appear to be hobject_t::min() in master.
Updated by Samuel Just almost 9 years ago
Ah, we switched to using INT64_MIN (from -1) for pool for min() with wip-temp. We need to make sure that messages which contain hobject_t which may be min are adjusted correctly to and from hammer osds. One option would be to add another field to the hobject_t encoding new_pool and stick -1 in the old field for min(). Or, we could audit all messages and handle it case-by-case. How do we handle transactions on the wire which reference objects with pool <-1?
Updated by Sage Weil almost 9 years ago
- Assignee set to Sage Weil
Samuel Just wrote:
Ah, we switched to using INT64_MIN (from -1) for pool for min() with wip-temp. We need to make sure that messages which contain hobject_t which may be min are adjusted correctly to and from hammer osds. One option would be to add another field to the hobject_t encoding new_pool and stick -1 in the old field for min(). Or, we could audit all messages and handle it case-by-case. How do we handle transactions on the wire which reference objects with pool <-1?
Hrm. I think it's best to handle it in the encode/decode.. we'll never get all the message instances.
As for the transactions on temp objects... yeah, I don't think we do anything to handle that properly. Crap, I thought I ran an upgrade test on this before merge.
I think we need to change the PG code to choose unique objects that aren't in the temp namespace if actingbackfill_features doesn't indicate all osds have wip-temp. We can't guarantee we won't collide with an existing object, but we can be safe enough by using a namespace and obscure object name.
Updated by Sage Weil almost 9 years ago
- Description updated (diff)
Okay, I think it's not as bad as I thought.. just the min issue after all.
Updated by Sage Weil almost 9 years ago
A simpler fix (with less long-term cruft) would be to patch hammer to correctly interpret INT64_MIN as hobject_t::get_min(), and require that users upgrade to 0.94.3 (or later) before going to infernalis/jewel. Given that we're still a couple months away from infernalis that won't be an additional burden for most users, but would need to be very well documented.
Updated by Loïc Dachary over 8 years ago
Updated by Sage Weil over 8 years ago
Updated by Yuri Weinstein over 8 years ago
Note: https://github.com/ceph/ceph-qa-suite/pull/525 needs to be merged after this ticked is fixed/backported.
See #12625
Updated by Yuri Weinstein over 8 years ago
- Release set to next
- ceph-qa-suite upgrade/hammer-x added
Updated by Loïc Dachary over 8 years ago
- Status changed from Resolved to Pending Backport
Updated by Sage Weil over 8 years ago
- Status changed from Pending Backport to Resolved