Bug #23031
FAILED assert(!parent->get_log().get_missing().is_missing(soid))
0%
Description
Using vstart to start 3 OSDs with -o filestore debug inject read err=1
Manually injectdataerr on all replicas of object "foo."
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.0 1 1 1 0 1 72750205 18 18 active+recovery_unfound+degraded+inconsistent 2018-02-16 18:01:18.212552 12'18 32:289 [1,0] 1 [1,0] 1 12'18 2018-02-16 17:44:18.140880 12'18 2018-02-16 17:44:18.140880 0
Marked all OSDS out and back in
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.0 1 1 2 0 1 72750205 18 18 active+recovery_unfound+degraded+remapped 2018-02-16 18:03:44.924607 12'18 41:301 [1] 1 [1,2] 1 12'18 2018-02-16 17:44:18.140880 12'18 2018-02-16 17:44:18.140880 0
Killed all OSDs and then restarted them in order 0, 1 and 2. osd.0 crashed as below.
- killall ceph-osd
- for i in $(seq 0 2); do bin/ceph-osd -i $i -c ceph.conf; done
-2> 2018-02-16 18:11:49.233 7f2643a9e700 10 osd.0 pg_epoch: 44 pg[1.0( v 12'18 lc 12'17 (0'0,12'18] local-lis/les=43/44 n=1 ec=10/10 lis/c 43/31 les/c/f 44/32/0 43/43/43) [1,0] r=1 lpr=43 pi=[31,43)/3 luod=0'0 crt=12'18 lcod 0'0 active m=1] _handle_message: 0x55644b334bc0 -1> 2018-02-16 18:11:49.233 7f2643a9e700 10 osd.0 pg_epoch: 44 pg[1.0( v 12'18 lc 12'17 (0'0,12'18] local-lis/les=43/44 n=1 ec=10/10 lis/c 43/31 les/c/f 44/32/0 43/43/43) [1,0] r=1 lpr=43 pi=[31,43)/3 luod=0'0 crt=12'18 lcod 0'0 active m=1] do_repop 1:602f83fe:::foo:head v 44'19 (transaction) 143 0> 2018-02-16 18:11:49.233 7f2643a9e700 -1 /home/dzafman/ceph/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::do_repop(OpRequestRef)' thread 7f2643a9e700 time 2018-02-16 18:11:49.235446 /home/dzafman/ceph/src/osd/ReplicatedBackend.cc: 1085: FAILED assert(!parent->get_log().get_missing().is_missing(soid)) ceph version 13.0.1-2022-gcb1cf5e (cb1cf5ef1b559f2bffe54abdf6b6cae453db6ebf) mimic (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x7f26633c3615] 2: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0x11c) [0x5564494375ac] 3: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x237) [0x55644943a347] 4: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55644934bee7] 5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x675) [0x5564492f9605] 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x341) [0x556449155d21] 7: (PGOpItem::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5564493cdb52] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf24) [0x55644915df94] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4f2) [0x7f26633c90e2] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f26633cb440] 11: (()+0x76ba) [0x7f2661edb6ba] 12: (clone()+0x6d) [0x7f266116582d]
History
#1 Updated by David Zafman about 5 years ago
- Description updated (diff)
#2 Updated by David Zafman about 5 years ago
- Description updated (diff)
#3 Updated by David Zafman about 5 years ago
osd.0 was the primary before it crashed came back up and crashed again as original indicated in this bug. This is the first crash:
2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 36 do_recovery starting 1 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] 2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_primary recovering 0 in pg, missing missing(1 may_include_deletes = 1) 2018-02-16 18:03:40.359 7f9c0355e700 25 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_primary {1:602f83fe:::foo:head=12'18 flags = none} 2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_primary 1:602f83fe:::foo:head 12'18 (missing) (missing head) 2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] start_recovery_op 1:602f83fe:::foo:head 2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 36 start_recovery_op pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] 1:602f83fe:::foo:head (0/3 rops) 2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_object: 1:602f83fe:::foo:head 2018-02-16 18:03:40.359 7f9c0355e700 7 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] pull 1:602f83fe:::foo:head v 12'18 on osds 1 from osd.1 2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] pulling soid 1:602f83fe:::foo:head from osd 1 at version 0'0 rather than at version 12'18 2018-02-16 18:03:40.375 7f9c0355e700 -1 /home/dzafman/ceph/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f9c0355e700 time 2018-02-16 18:03:40.362938 /home/dzafman/ceph/src/osd/ReplicatedBackend.cc: 1411: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( soid)->second->reverting_to == v)) ceph version 13.0.1-2022-gcb1cf5e (cb1cf5ef1b559f2bffe54abdf6b6cae453db6ebf) mimic (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x7f9c22e83615] 2: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0xf8f) [0x560014561c1f] 3: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0xed) [0x5600145660ad] 4: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x1e9) [0x5600143ecc99] 5: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x14bd) [0x5600144283ed] 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x246) [0x56001442e866] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x3bb) [0x56001429d95b] 8: (PGRecovery::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1c) [0x5600144fa91c] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf24) [0x56001428af94] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4f2) [0x7f9c22e890e2] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f9c22e8b440] 12: (()+0x76ba) [0x7f9c2199b6ba]
#4 Updated by David Zafman about 5 years ago
- File osd.0.log.gz added
#5 Updated by David Zafman about 5 years ago
- File osd.1.log.1.gz added
- File osd.1.log.2.gz added
#6 Updated by Neha Ojha almost 5 years ago
I hit this issue a couple of times while trying to reproduce #23614
2018-05-18T17:01:29.790 INFO:tasks.ceph.osd.5.smithi093.stderr:/build/ceph-13.1.0-468-gbd982ed/src/osd/ReplicatedBackend.cc: 1315: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( soid)->second->reverting_to == v))
http://pulpito.ceph.com/nojha-2018-05-18_16:20:30-rados-master-distro-basic-smithi/2547989/
http://pulpito.ceph.com/nojha-2018-05-21_17:34:19-rados-master-distro-basic-smithi/2562484/
#7 Updated by Zengran Zhang over 4 years ago
#8 Updated by David Zafman about 4 years ago
- Assignee changed from David Zafman to Neha Ojha
- Pull request ID set to 25219
This should be assigned to http://tracker.ceph.com/users/3114 but doesn't show up on assignee list.
#9 Updated by Nathan Cutler about 4 years ago
- Assignee changed from Neha Ojha to Zengran Zhang
#10 Updated by huang jun over 2 years ago
hi, guys, what's the status of this problem now, does we resolve the assert in qa tests