Actions
Bug #18624
closed/home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid))
% Done:
0%
Source:
Development
Tags:
Backport:
kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Tracker http://tracker.ceph.com/issues/18165 pull request https://github.com/ceph/ceph/pull/12888 added this assert.
Ran cluster with osd_min_pg_log_entries=1 osd_max_pg_log_entries=5. Created EC pool with 2k +1m. I created objects and corrupted shard 1 and 2 of one of the objects. Then I marked out the 2 OSDs that have shard 1 and 2 on them. The shard 0 OSD which was also the primary crashed because during the attempt to backfill 2 new OSDs couldn't read the corrupted object.
-9> 2017-01-20 11:55:23.671069 7f08f4d5d700 5 -- op tracker -- seq: 43355, time: 2017-01-20 11:55:23.671068, event: reached_pg, op: MOSDECSubOpReadReply(4.0s0 497 ECSubReadR eply(tid=2714, attrs_read=0)) -8> 2017-01-20 11:55:23.671072 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/ [4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_message: MOSDECSubOpReadReply(4.0s0 497 ECSubReadRep ly(tid=2714, attrs_read=0)) v1 -7> 2017-01-20 11:55:23.671078 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/ [4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=2714, attrs_read=0) -6> 2017-01-20 11:55:23.671083 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply shard=3(1) error=-5 -5> 2017-01-20 11:55:23.671088 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply Complete: ReadOp(tid=2714, to_read={4:b2ec4718:::obj500:head=read_request_t(to_read=[0,8388608,0], need=0(2),3(1), want_attrs=0)}, complete={4:b2ec4718:::obj500:head=read_result_t(r=0, errors={0(2)=-5,3(1)=-5}, noattrs, returned=(0, 8388608, []))}, priority=3, obj_to_source={4:b2ec4718:::obj500:head=0(2),3(1)}, source_to_obj={0(2)=4:b2ec4718:::obj500:head,3(1)=4:b2ec4718:::obj500:head}, in_progress=) -4> 2017-01-20 11:55:23.671097 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] _failed_push: Read error 4:b2ec4718:::obj500:head r=0 errors={0(2)=-5,3(1)=-5} -3> 2017-01-20 11:55:23.671103 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] _failed_push: canceling recovery op for obj 4:b2ec4718:::obj500:head -2> 2017-01-20 11:55:23.671113 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] failed_push: 4:b2ec4718:::obj500:head -1> 2017-01-20 11:55:23.671119 7f08f4d5d700 15 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] requeue_ops 0> 2017-01-20 11:55:23.673850 7f08f4d5d700 -1 /home/dzafman/ceph/src/osd/PG.h: In function 'eversion_t PG::MissingLoc::get_version_needed(const hobject_t&) const' thread 7f08f4d5d700 time 2017-01-20 11:55:23.671136 /home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid)) ceph version 11.1.0-6744-ga6e986b (a6e986bdf8abee79277736e26df6dab85b0372b0) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x84) [0x7f090d68c9b4] 2: (PrimaryLogPG::failed_push(std::list<pg_shard_t, std::allocator<pg_shard_t> > const&, hobject_t const&)+0x72f) [0x7f0916defbdf] 3: (ECBackend::_failed_push(hobject_t const&, std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x299) [0x7f0916f45129] 4: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x3e) [0x7f0916f6c40e] 5: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x6f) [0x7f0916f4466f] 6: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf8d) [0x7f0916f4b8fd] 7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x176) [0x7f0916f56586] 8: (PrimaryLogPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x7f0916e01d1d] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3cd) [0x7f0916cb9d2d]
Updated by David Zafman over 7 years ago
Restarting the primary with gdb the same assert occurred. I was able to look at missing_loc:
#4 PrimaryLogPG::failed_push (this=0x5555573ff000, from=std::list, soid=...) at /home/dzafman/ceph/src/osd/PrimaryLogPG.cc:9723 9723 miter->second.add(soid, missing_loc.get_version_needed(soid), eversion_t()); (gdb) print missing_loc $2 = {needs_recovery_map = std::map with 0 elements, missing_loc = std::map with 1 elements = {[{oid = {name = "obj500"}, snap = {val = 18446744073709551614}, hash = 417478477, max = false, nibblewise_key_cache = 3564318337, hash_reverse_bits = 3001829144, static POOL_META = -1, static POOL_TEMP_START = -2, pool = 4, nspace = "", key = ""}] = std::set with 0 elements}, missing_loc_sources = std::set with 0 elements, pg = 0x5555573ff000, empty_set = std::set with 0 elements, is_readable = { px = 0x55555735cf40}, is_recoverable = {px = 0x555556c87ed0}}
Updated by David Zafman over 7 years ago
- Copied to Backport #18659: kraken: /home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid)) added
Updated by David Zafman over 7 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai over 7 years ago
- Has duplicate Bug #18658: PrimaryLogPG: failed_push(): FAILED assert(miter != peer_missing.end()) added
Updated by Kefu Chai over 7 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler about 7 years ago
- Status changed from Pending Backport to Resolved
Actions