Project

General

Profile

Actions

Bug #45991

closed

PG merge: FAILED ceph_assert(info.history.same_interval_since != 0)

Added by xie xingguo almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/xxg-2020-06-13_00:34:59-rados:thrash-wip-nautilus-nnnn-distro-basic-smithi/5143185/

full call stack:

@2020-06-13 03:43:36.682 7f035a25e700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9-805-g95223e478a1/rpm/el7/BUILD/ceph-14.2.9-805-g95223e478a1/src/osd/PG.cc: In function 'void PG::start_peering_interval(OSDMapRef, const std::vector<int>&, int, const std::vector<int>&, int, ObjectStore::Transaction*)' thread 7f035a25e700 time 2020-06-13 03:43:36.680117
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9-805-g95223e478a1/rpm/el7/BUILD/ceph-14.2.9-805-g95223e478a1/src/osd/PG.cc: 6421: FAILED ceph_assert(info.history.same_interval_since != 0)

ceph version 14.2.9-805-g95223e478a1 (95223e478a1c66b5cd59d2e3012b71a3f2c7fb3e) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55e6d43eccb1]
2: (()+0x4cee79) [0x55e6d43ece79]
3: (PG::start_peering_interval(std::shared_ptr&lt;OSDMap const&gt;, std::vector&lt;int, std::allocator&lt;int&gt; > const&, int, std::vector&lt;int, std::allocator&lt;int&gt; > const&, int, ObjectStore::Transaction*)+0x178f) [0x55e6d4593f3f]
4: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x390) [0x55e6d4597670]
5: (boost::statechart::simple_state&lt;PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x75) [0x55e6d45eaa75]
6: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;boost::statechart::none&gt;, boost::statechart::null_exception_translator>::process_queued_events()+0x97) [0x55e6d45ca537]
7: (PG::handle_advance_map(std::shared_ptr&lt;OSDMap const&gt;, std::shared_ptr&lt;OSDMap const&gt;, std::vector&lt;int, std::allocator&lt;int&gt; >&, int, std::vector&lt;int, std::allocator&lt;int&gt; >&, int, PG::RecoveryCtx*)+0x3f8) [0x55e6d4596988]
8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2df) [0x55e6d44f213f]
9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr&lt;PGPeeringEvent&gt;, ThreadPool::TPHandle&)+0xa6) [0x55e6d44f3d16]
10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x51) [0x55e6d475c361]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55e6d44e893f]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e6d4a94fa6]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e6d4a97ac0]
14: (()+0x7ea5) [0x7f0380c55ea5]
15: (clone()+0x6d) [0x7f037fb198dd]@

The root cause is that both merge source and target could be a fabricated PG (aka placeholder), hence merge targe's same_primary_since will remain 0 after merge:

2020-06-13 03:43:36.676 7f0368a7b700 20 osd.1:5.prime_merges prime_merges creating empty merge participant 2.5 for merge in 169
...
2020-06-13 03:43:36.676 7f0368a7b700 20 osd.1:5.prime_merges prime_merges creating empty merge participant 2.15 for merge in 169
...
2020-06-13 03:43:36.679 7f035a25e700 10 osd.1 pg_epoch: 168 pg[2.5( DNE empty lb MIN (NIBBLEWISE) local-lis/les=0/168 n=0 ec=0/0 lis/c 0/0 les/c/f 168/168/0 0/0/0) [3,4] r=-1 lpr=168 crt=0'0 unknown mbc={}] merge_from set les/c to 168/168 from pool last_dec_*, source pg history was ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0


Related issues 3 (1 open2 closed)

Related to RADOS - Bug #57628: osd:PeeringState.cc: FAILED ceph_assert(info.history.same_interval_since != 0)In ProgressMatan Breizman

Actions
Copied to RADOS - Backport #46089: octopus: PG merge: FAILED ceph_assert(info.history.same_interval_since != 0)ResolvedNathan CutlerActions
Copied to RADOS - Backport #46090: nautilus: PG merge: FAILED ceph_assert(info.history.same_interval_since != 0)ResolvedNathan CutlerActions
Actions #1

Updated by xie xingguo almost 4 years ago

  • Pull request ID set to 35558
Actions #2

Updated by Kefu Chai almost 4 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Neha Ojha almost 4 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus,octopus
Actions #4

Updated by Patrick Donnelly almost 4 years ago

  • Copied to Backport #46089: octopus: PG merge: FAILED ceph_assert(info.history.same_interval_since != 0) added
Actions #5

Updated by Patrick Donnelly almost 4 years ago

  • Copied to Backport #46090: nautilus: PG merge: FAILED ceph_assert(info.history.same_interval_since != 0) added
Actions #6

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #7

Updated by Matan Breizman over 1 year ago

  • Related to Bug #57628: osd:PeeringState.cc: FAILED ceph_assert(info.history.same_interval_since != 0) added
Actions

Also available in: Atom PDF