Project

General

Profile

Actions

Bug #23031

open

FAILED assert(!parent->get_log().get_missing().is_missing(soid))

Added by David Zafman about 6 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Using vstart to start 3 OSDs with -o filestore debug inject read err=1

Manually injectdataerr on all replicas of object "foo."

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.0 1 1 1 0 1 72750205 18 18 active+recovery_unfound+degraded+inconsistent 2018-02-16 18:01:18.212552 12'18 32:289 [1,0] 1 [1,0] 1 12'18 2018-02-16 17:44:18.140880 12'18 2018-02-16 17:44:18.140880 0

Marked all OSDS out and back in

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.0 1 1 2 0 1 72750205 18 18 active+recovery_unfound+degraded+remapped 2018-02-16 18:03:44.924607 12'18 41:301 [1] 1 [1,2] 1 12'18 2018-02-16 17:44:18.140880 12'18 2018-02-16 17:44:18.140880 0

Killed all OSDs and then restarted them in order 0, 1 and 2. osd.0 crashed as below.

  1. killall ceph-osd
  2. for i in $(seq 0 2); do bin/ceph-osd -i $i -c ceph.conf; done

    -2> 2018-02-16 18:11:49.233 7f2643a9e700 10 osd.0 pg_epoch: 44 pg[1.0( v 12'18 lc 12'17 (0'0,12'18] local-lis/les=43/44 n=1 ec=10/10 lis/c 43/31 les/c/f 44/32/0 43/43/43) [1,0] r=1 lpr=43 pi=[31,43)/3 luod=0'0 crt=12'18 lcod 0'0 active m=1] _handle_message: 0x55644b334bc0
    -1> 2018-02-16 18:11:49.233 7f2643a9e700 10 osd.0 pg_epoch: 44 pg[1.0( v 12'18 lc 12'17 (0'0,12'18] local-lis/les=43/44 n=1 ec=10/10 lis/c 43/31 les/c/f 44/32/0 43/43/43) [1,0] r=1 lpr=43 pi=[31,43)/3 luod=0'0 crt=12'18 lcod 0'0 active m=1] do_repop 1:602f83fe:::foo:head v 44'19 (transaction) 143
     0> 2018-02-16 18:11:49.233 7f2643a9e700 -1 /home/dzafman/ceph/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::do_repop(OpRequestRef)' thread 7f2643a9e700 time 2018-02-16 18:11:49.235446
/home/dzafman/ceph/src/osd/ReplicatedBackend.cc: 1085: FAILED assert(!parent->get_log().get_missing().is_missing(soid))

 ceph version 13.0.1-2022-gcb1cf5e (cb1cf5ef1b559f2bffe54abdf6b6cae453db6ebf) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x7f26633c3615]
 2: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0x11c) [0x5564494375ac]
 3: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x237) [0x55644943a347]
 4: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55644934bee7]
 5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x675) [0x5564492f9605]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x341) [0x556449155d21]
 7: (PGOpItem::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5564493cdb52]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf24) [0x55644915df94]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4f2) [0x7f26633c90e2]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f26633cb440]
 11: (()+0x76ba) [0x7f2661edb6ba]
 12: (clone()+0x6d) [0x7f266116582d]


Files

osd.0.log.gz (586 KB) osd.0.log.gz David Zafman, 02/21/2018 12:51 AM
osd.1.log.1.gz (675 KB) osd.1.log.1.gz First part of osd.1.log David Zafman, 02/21/2018 12:54 AM
osd.1.log.2.gz (534 KB) osd.1.log.2.gz Second part of osd.1.log David Zafman, 02/21/2018 12:54 AM
Actions #1

Updated by David Zafman about 6 years ago

  • Description updated (diff)
Actions #2

Updated by David Zafman about 6 years ago

  • Description updated (diff)
Actions #3

Updated by David Zafman about 6 years ago

osd.0 was the primary before it crashed came back up and crashed again as original indicated in this bug. This is the first crash:

2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 36 do_recovery starting 1 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1]
2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_primary recovering 0 in pg, missing missing(1 may_include_deletes = 1)
2018-02-16 18:03:40.359 7f9c0355e700 25 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_primary {1:602f83fe:::foo:head=12'18 flags = none}
2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_primary 1:602f83fe:::foo:head 12'18 (missing) (missing head)
2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] start_recovery_op 1:602f83fe:::foo:head
2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 36 start_recovery_op pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] 1:602f83fe:::foo:head (0/3 rops)
2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] recover_object: 1:602f83fe:::foo:head
2018-02-16 18:03:40.359 7f9c0355e700  7 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] pull 1:602f83fe:::foo:head v 12'18 on osds 1 from osd.1
2018-02-16 18:03:40.359 7f9c0355e700 10 osd.0 pg_epoch: 36 pg[1.0( v 12'18 (0'0,12'18] local-lis/les=35/36 n=1 ec=10/10 lis/c 35/31 les/c/f 36/32/0 34/35/35) []/[0,2] r=0 lpr=35 pi=[31,35)/1 rops=1 crt=12'18 lcod 0'0 mlcod 0'0 active+recovering+degraded+remapped m=1] pulling soid 1:602f83fe:::foo:head from osd 1 at version 0'0 rather than at version 12'18
2018-02-16 18:03:40.375 7f9c0355e700 -1 /home/dzafman/ceph/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f9c0355e700 time 2018-02-16 18:03:40.362938
/home/dzafman/ceph/src/osd/ReplicatedBackend.cc: 1411: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( soid)->second->reverting_to == v))

 ceph version 13.0.1-2022-gcb1cf5e (cb1cf5ef1b559f2bffe54abdf6b6cae453db6ebf) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x7f9c22e83615]
 2: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0xf8f) [0x560014561c1f]
 3: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0xed) [0x5600145660ad]
 4: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x1e9) [0x5600143ecc99]
 5: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x14bd) [0x5600144283ed]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x246) [0x56001442e866]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x3bb) [0x56001429d95b]
 8: (PGRecovery::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1c) [0x5600144fa91c]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf24) [0x56001428af94]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4f2) [0x7f9c22e890e2]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f9c22e8b440]
 12: (()+0x76ba) [0x7f9c2199b6ba]
Actions #4

Updated by David Zafman about 6 years ago

Updated by David Zafman about 6 years ago

Actions #6

Updated by Neha Ojha almost 6 years ago

I hit this issue a couple of times while trying to reproduce #23614

2018-05-18T17:01:29.790 INFO:tasks.ceph.osd.5.smithi093.stderr:/build/ceph-13.1.0-468-gbd982ed/src/osd/ReplicatedBackend.cc: 1315: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( soid)->second->reverting_to == v))

http://pulpito.ceph.com/nojha-2018-05-18_16:20:30-rados-master-distro-basic-smithi/2547989/
http://pulpito.ceph.com/nojha-2018-05-21_17:34:19-rados-master-distro-basic-smithi/2562484/

Actions #8

Updated by David Zafman about 5 years ago

  • Assignee changed from David Zafman to Neha Ojha
  • Pull request ID set to 25219

This should be assigned to http://tracker.ceph.com/users/3114 but doesn't show up on assignee list.

Actions #9

Updated by Nathan Cutler about 5 years ago

  • Assignee changed from Neha Ojha to Zengran Zhang
Actions #10

Updated by huang jun over 3 years ago

hi, guys, what's the status of this problem now, does we resolve the assert in qa tests

Actions

Also available in: Atom PDF