Project

General

Profile

Bug #19947

osd/PGLog.h: 1235: FAILED assert(miter->second.have == eversion_t())

Added by Sage Weil 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
05/16/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

2017-05-16T00:22:20.018 INFO:tasks.ceph.osd.1.smithi046.stderr:/build/ceph-12.0.1-2399-gaf38d31/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvi
der*, std::set<std::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]' thread 7fca0c9abb80 time 2017-05-16 00:22:20.022564
2017-05-16T00:22:20.019 INFO:tasks.ceph.osd.1.smithi046.stderr:/build/ceph-12.0.1-2399-gaf38d31/src/osd/PGLog.h: 1235: FAILED assert(miter->second.have == eversion_t())
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: ceph version 12.0.1-2399-gaf38d31 (af38d31471040ca88dfad94f7850904bfd3a49bd)
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7fca0c10b47e]
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: 2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set<true>&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, boo
l*, DoutPrefixProvider const*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, bool)+0x2afa) [0x7fca0bce912a]
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x4de) [0x7fca0bca1e9e]
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: 4: (OSD::load_pgs()+0x9c4) [0x7fca0bbf7a74]
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: 5: (OSD::init()+0x1f14) [0x7fca0bc0e824]
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: 6: (main()+0x2ca3) [0x7fca0bb19ca3]
2017-05-16T00:22:20.020 INFO:tasks.ceph.osd.1.smithi046.stderr: 7: (__libc_start_main()+0xf5) [0x7fca08c5cf45]
2017-05-16T00:22:20.021 INFO:tasks.ceph.osd.1.smithi046.stderr: 8: (()+0x46cd76) [0x7fca0bbb2d76]

/a/yuriw-2017-05-15_22:59:10-rados-wip-yuri-testing_2017_5_16-distro-basic-smithi/1181489

also

/a/sage-2017-05-16_02:42:05-rados-wip-sage-testing2---basic-smithi/1182354

gdb.txt View (563 KB) Sage Weil, 05/26/2017 01:49 PM

History

#1 Updated by Sage Weil 6 months ago

/a/sage-2017-05-16_22:22:40-rados-wip-sage-testing2---basic-smithi/1186082

#2 Updated by Kefu Chai 6 months ago

/a/kchai-2017-05-21_04:27:34-rados-wip-bigbang---basic-mira/1203867/remote/mira104/log/ceph-osd.0.log.gz

#3 Updated by Sage Weil 6 months ago

/a/sage-2017-05-23_19:52:01-rados-wip-sage-testing2---basic-smithi/1221342

#4 Updated by Sage Weil 6 months ago

attached gdb dump

#5 Updated by Sage Weil 6 months ago

/a/sage-2017-05-26_06:56:21-rados-wip-sage-testing---basic-smithi/1231195

core file in test dir along with a copy of the centos7 packages that were installed.

#6 Updated by Sage Weil 6 months ago

/a/sage-2017-05-27_03:43:09-rados-wip-sage-testing2---basic-smithi/1235284

#7 Updated by Kefu Chai 6 months ago

/a/kchai-2017-05-27_07:55:19-rados-wip-mgr-stats-kefu---basic-mira/1237260/remote/mira100/log/ceph-osd.2.log.gz

#8 Updated by Sage Weil 6 months ago

/a/sage-2017-05-28_05:00:18-rados-wip-sage-testing---basic-smithi/1238213

rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml msgr-failures/osd-delay.yaml msgr/simple.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/default.yaml workloads/cache-pool-snaps.yaml}

#9 Updated by Sage Weil 6 months ago

/a/sage-2017-05-30_17:22:40-rados-wip-sage-testing---basic-smithi/1244707

rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml msgr-failures/osd-delay.yaml msgr/async.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/pggrow.yaml workloads/pool-snaps-few-objects.yaml}

#10 Updated by Sage Weil 6 months ago

/a/sage-2017-06-01_02:27:12-rados-wip-sage-testing2---basic-smithi/1249818

rados/thrash-erasure-code/{ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml fast/fast.yaml leveldb.yaml msgr-failures/few.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/mapgap.yaml workloads/ec-rados-plugin=jerasure-k=2-m=1.yaml}

#11 Updated by Sage Weil 6 months ago

/a/sage-2017-06-02_08:32:01-rados-wip-sage-testing-distro-basic-smithi/1255551

rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml msgr-failures/osd-delay.yaml msgr/async.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/morepggrow.yaml workloads/pool-snaps-few-objects.yaml}

#12 Updated by Sage Weil 6 months ago

/a/sage-2017-06-05_14:47:27-rados-wip-sage-testing-distro-basic-smithi/1260429

rados/thrash-erasure-code-overwrites/{bluestore.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml fast/normal.yaml leveldb.yaml msgr-failures/osd-delay.yaml rados.yaml thrashers/default.yaml workloads/ec-small-objects-fast-read-overwrites.yaml}

#13 Updated by Sage Weil 6 months ago

/a/sage-2017-06-05_18:36:01-rados-wip-sage-testing2-distro-basic-smithi/1261877
rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml backoff/peering_and_degraded.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml msgr-failures/osd-delay.yaml msgr/simple.yaml objectstore/filestore-xfs.yaml rados.yaml rocksdb.yaml thrashers/pggrow.yaml workloads/pool-snaps-few-objects.yaml}

#14 Updated by Sage Weil 6 months ago

/a/sage-2017-06-05_18:36:01-rados-wip-sage-testing2-distro-basic-smithi/1261940

rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml backoff/peering_and_degraded.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml msgr-failures/osd-delay.yaml msgr/simple.yaml objectstore/filestore-xfs.yaml rados.yaml rocksdb.yaml thrashers/default.yaml workloads/cache-pool-snaps.yaml}

#15 Updated by Sage Weil 6 months ago

trim_object() returns an OpContextUPtr that we feed to simple_opc_submit(), but we aren't calling finish_ctx() like everyone else. As a result, when we trim a whiteout, the oi isn't getting cleared, and the version field isn't cleared. This means the prior_version is screwed up for subsequent events.

#16 Updated by Sage Weil 6 months ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil

#17 Updated by Sage Weil 6 months ago

  • Status changed from In Progress to Need Review

#18 Updated by Sage Weil 6 months ago

Guessing this is related, since it's also an incorrect prior_version:

2017-06-07T18:59:40.728 INFO:tasks.ceph.osd.2.smithi037.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.2-2485-gc8340cd/rpm/el7/BUILD/ceph-12.0.2-2485-gc8340cd/src/osd/PGLog.h: In function 'static void PGLog::_merg
e_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, mempool::osd::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, missing_type&, PGLog::LogEntryHandler*, const DoutPrefixProvider*) [with missing_type = pg_missing_set<true>; mempool::osd::list<pg_log_entry_t> = std::list<pg_log_entry_t, mempo
ol::pool_allocator<(mempool::pool_index_t)12u, pg_log_entry_t> >]' thread 7f6961877700 time 2017-06-07 18:59:40.734025
2017-06-07T18:59:40.728 INFO:tasks.ceph.osd.2.smithi037.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.2-2485-gc8340cd/rpm/el7/BUILD/ceph-12.0.2-2485-gc8340cd/src/osd/PGLog.h: 775: FAILED assert(i->prior_version =
= last)
2017-06-07T18:59:40.733 INFO:tasks.ceph.osd.2.smithi037.stderr: ceph version  12.0.2-2485-gc8340cd (c8340cde85674f8d9506d602368c2fd9a6307580) luminous (dev)
2017-06-07T18:59:40.733 INFO:tasks.ceph.osd.2.smithi037.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f697bce79d0]
2017-06-07T18:59:40.733 INFO:tasks.ceph.osd.2.smithi037.stderr: 2: (void PGLog::_merge_object_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, hobject_t const&, std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)12, pg_log_entry_t> > const&, pg_info_t const&, eversion_t, pg_m
issing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x620) [0x7f697b8c1470]
2017-06-07T18:59:40.733 INFO:tasks.ceph.osd.2.smithi037.stderr: 3: (void PGLog::_merge_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)12, pg_log_entry_t> >&, pg_info_t const&, eversion_t, pg_missing_set<true>&, PGLog::LogEn
tryHandler*, DoutPrefixProvider const*)+0x1cc) [0x7f697b8c33bc]
2017-06-07T18:59:40.733 INFO:tasks.ceph.osd.2.smithi037.stderr: 4: (PGLog::rewind_divergent_log(eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x349) [0x7f697b8ba909]
2017-06-07T18:59:40.733 INFO:tasks.ceph.osd.2.smithi037.stderr: 5: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xd48) [0x7f697b8bb808]
2017-06-07T18:59:40.733 INFO:tasks.ceph.osd.2.smithi037.stderr: 6: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x6c) [0x7f697b82063c]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 7: (PG::proc_master_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_missing_set<false>&, pg_shard_t)+0x1bc) [0x7f697b8694cc]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 8: (PG::RecoveryState::GetLog::react(PG::RecoveryState::GotLog const&)+0xd1) [0x7f697b8697d1]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 9: (boost::statechart::simple_state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x242) [0x7f697b8b48a2]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 10: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x7f697b88eebb]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x91) [0x7f697b88f011]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 12: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1fe) [0x7f697b85aeae]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 13: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22a) [0x7f697b7abdda]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 14: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x17) [0x7f697b809da7]
2017-06-07T18:59:40.734 INFO:tasks.ceph.osd.2.smithi037.stderr: 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7f697bcee6d9]
2017-06-07T18:59:40.735 INFO:tasks.ceph.osd.2.smithi037.stderr: 16: (ThreadPool::WorkThread::entry()+0x10) [0x7f697bcef6b0]
2017-06-07T18:59:40.735 INFO:tasks.ceph.osd.2.smithi037.stderr: 17: (()+0x7dc5) [0x7f697892bdc5]
2017-06-07T18:59:40.735 INFO:tasks.ceph.osd.2.smithi037.stderr: 18: (clone()+0x6d) [0x7f6977a1f73d]

/a/sage-2017-06-07_16:25:35-rados-wip-sage-testing2-distro-basic-smithi/1268235

#19 Updated by Sage Weil 6 months ago

  • Status changed from Need Review to Testing

#20 Updated by Sage Weil 5 months ago

  • Status changed from Testing to Resolved

#21 Updated by Sage Weil 5 months ago

  • Status changed from Resolved to Testing

DIdn't fix it fully... see
/a/sage-2017-06-14_19:30:57-rados:thrash-wip-19943---basic-smithi/1286357

#22 Updated by Sage Weil 5 months ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF