Project

General

Profile

Actions

Bug #20551

closed

LOST_REVERT assert during rados bench+thrash in ReplicatedBackend::prepare_pull()

Added by Josh Durgin almost 7 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Backfill/Recovery
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From osd.0 in:

http://pulpito.ceph.com/yuriw-2017-07-06_20:01:14-rados-wip-yuri-testing3_2017_7_8-distro-basic-smithi/1368208/

=[15,103)/2 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] handle_message: 0x7fe09dab59c0
    -9> 2017-07-06 22:26:22.198839 7fe060845700 20 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi
=[15,103)/2 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] do_op: op osd_op(client.4177.0:308 2.7 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head [read 0~1048576] snapc 0=[] RETRY=1 ondis
k+retry+read+known_if_redirected e104) v8
    -8> 2017-07-06 22:26:22.199584 7fe060845700 20 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi
=[15,103)/2 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] op_has_sufficient_caps pool=2 (unique_pool_0 ) owner=0 need_read_cap=1 need_write_cap=0 classes=[] -> yes
    -7> 2017-07-06 22:26:22.200415 7fe060845700 10 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi
=[15,103)/2 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] do_op osd_op(client.4177.0:308 2.7 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head [read 0~1048576] snapc 0=[] RETRY=1 ondisk+re
try+read+known_if_redirected e104) v8 may_read -> read-ordered flags ondisk+retry+read+known_if_redirected
    -6> 2017-07-06 22:26:22.201200 7fe060845700  7 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi
=[15,103)/2 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] object 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head v 23'831, recovering.
    -5> 2017-07-06 22:26:22.202009 7fe060845700 10 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi
=[15,103)/2 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] start_recovery_op 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head
    -4> 2017-07-06 22:26:22.202777 7fe060845700 10 osd.0 105 start_recovery_op pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi=[15,103)/2 rops=1 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head (1/3 rops)
    -3> 2017-07-06 22:26:22.203545 7fe060845700 10 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi
=[15,103)/2 rops=1 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] recover_object: 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head
    -2> 2017-07-06 22:26:22.204314 7fe060845700  7 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi=[15,103)/2 rops=1 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] pull 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head v 23'831 on osds 1 from osd.1
    -1> 2017-07-06 22:26:22.205080 7fe060845700 10 osd.0 pg_epoch: 104 pg[2.7( v 38'1778 lc 16'4 (0'0,38'1778] local-lis/les=103/104 n=1778 ec=15/15 lis/c 103/0 les/c/f 104/0/0 102/103/103) [0,4] r=0 lpr=103 pi
=[15,103)/2 rops=1 crt=38'1778 mlcod 0'0 active+recovery_wait+degraded m=48] pulling soid 2:e042da5c:::benchmark_data_smithi145_3328_object13360:head from osd 1 at version 0'0 rather than at version 23'831
     0> 2017-07-06 22:26:22.208929 7fe060845700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3-2604-g5511846/rpm/el7/BUILD/ceph-12.0.3-2604-g5511846/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*
)' thread 7fe060845700 time 2017-07-06 22:26:22.205888
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.3-2604-g5511846/rpm/el7/BUILD/ceph-12.0.3-2604-g551184
6/src/osd/ReplicatedBackend.cc: 1417: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( soid)->second->reverting_to == v))

 ceph version 12.0.3-2604-g5511846 (55118463108180eb1807bbfeebf439fd61e430bb) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7fe082d0fc20]
 2: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x779) [0x7fe082a61459]
 3: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0x25e) [0x7fe082a63f9e]
 4: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x5be) [0x7fe082910a0e]
 5: (PrimaryLogPG::maybe_kick_recovery(hobject_t const&)+0x30d) [0x7fe082911aed]
 6: (PrimaryLogPG::wait_for_unreadable_object(hobject_t const&, boost::intrusive_ptr<OpRequest>)+0x44) [0x7fe082911cb4]
 7: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x1d7f) [0x7fe082947cbf]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xeba) [0x7fe08290610a]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x7fe0827a7509]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x7fe082a00167]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x7fe0827d1a0e]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8e9) [0x7fe082d15769]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fe082d178f0]
 14: (()+0x7dc5) [0x7fe07f908dc5]
 15: (clone()+0x6d) [0x7fe07e9fc73d]

No patches in this test run touched anything near osd recovery logic, so I think this is a bug in master.


Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #20694: osd/ReplicatedBackend.cc: 1417: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( sCan't reproduce07/19/2017

Actions
Actions #1

Updated by Kefu Chai over 6 years ago

  • Is duplicate of Bug #20694: osd/ReplicatedBackend.cc: 1417: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( s added
Actions #2

Updated by Sage Weil over 6 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF