Project

General

Profile

Actions

Bug #18624

closed

/home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid))

Added by David Zafman over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Tracker http://tracker.ceph.com/issues/18165 pull request https://github.com/ceph/ceph/pull/12888 added this assert.

Ran cluster with osd_min_pg_log_entries=1 osd_max_pg_log_entries=5. Created EC pool with 2k +1m. I created objects and corrupted shard 1 and 2 of one of the objects. Then I marked out the 2 OSDs that have shard 1 and 2 on them. The shard 0 OSD which was also the primary crashed because during the attempt to backfill 2 new OSDs couldn't read the corrupted object.

    -9> 2017-01-20 11:55:23.671069 7f08f4d5d700  5 -- op tracker -- seq: 43355, time: 2017-01-20 11:55:23.671068, event: reached_pg, op: MOSDECSubOpReadReply(4.0s0 497 ECSubReadR
eply(tid=2714, attrs_read=0))
    -8> 2017-01-20 11:55:23.671072 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/
[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_message: MOSDECSubOpReadReply(4.0s0 497 ECSubReadRep
ly(tid=2714, attrs_read=0)) v1
    -7> 2017-01-20 11:55:23.671078 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/
[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=2714, attrs_read=0)
    -6> 2017-01-20 11:55:23.671083 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply shard=3(1) error=-5
    -5> 2017-01-20 11:55:23.671088 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply Complete: ReadOp(tid=2714, to_read={4:b2ec4718:::obj500:head=read_request_t(to_read=[0,8388608,0], need=0(2),3(1), want_attrs=0)}, complete={4:b2ec4718:::obj500:head=read_result_t(r=0, errors={0(2)=-5,3(1)=-5}, noattrs, returned=(0, 8388608, []))}, priority=3, obj_to_source={4:b2ec4718:::obj500:head=0(2),3(1)}, source_to_obj={0(2)=4:b2ec4718:::obj500:head,3(1)=4:b2ec4718:::obj500:head}, in_progress=)
    -4> 2017-01-20 11:55:23.671097 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] _failed_push: Read error 4:b2ec4718:::obj500:head r=0 errors={0(2)=-5,3(1)=-5}
    -3> 2017-01-20 11:55:23.671103 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] _failed_push: canceling recovery op for obj 4:b2ec4718:::obj500:head
    -2> 2017-01-20 11:55:23.671113 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] failed_push: 4:b2ec4718:::obj500:head
    -1> 2017-01-20 11:55:23.671119 7f08f4d5d700 15 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling]  requeue_ops
     0> 2017-01-20 11:55:23.673850 7f08f4d5d700 -1 /home/dzafman/ceph/src/osd/PG.h: In function 'eversion_t PG::MissingLoc::get_version_needed(const hobject_t&) const' thread 7f08f4d5d700 time 2017-01-20 11:55:23.671136
/home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid))

 ceph version 11.1.0-6744-ga6e986b (a6e986bdf8abee79277736e26df6dab85b0372b0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x84) [0x7f090d68c9b4]
 2: (PrimaryLogPG::failed_push(std::list<pg_shard_t, std::allocator<pg_shard_t> > const&, hobject_t const&)+0x72f) [0x7f0916defbdf]
 3: (ECBackend::_failed_push(hobject_t const&, std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x299) [0x7f0916f45129]
 4: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x3e) [0x7f0916f6c40e]
 5: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x6f) [0x7f0916f4466f]
 6: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf8d) [0x7f0916f4b8fd]
 7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x176) [0x7f0916f56586]
 8: (PrimaryLogPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x7f0916e01d1d]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3cd) [0x7f0916cb9d2d]

Related issues 2 (0 open2 closed)

Has duplicate Ceph - Bug #18658: PrimaryLogPG: failed_push(): FAILED assert(miter != peer_missing.end())Duplicate01/24/2017

Actions
Copied to Ceph - Backport #18659: kraken: /home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid))ResolvedDavid ZafmanActions
Actions #1

Updated by David Zafman over 7 years ago

Restarting the primary with gdb the same assert occurred. I was able to look at missing_loc:

#4  PrimaryLogPG::failed_push (this=0x5555573ff000, from=std::list, soid=...) at /home/dzafman/ceph/src/osd/PrimaryLogPG.cc:9723
9723        miter->second.add(soid, missing_loc.get_version_needed(soid), eversion_t());
(gdb) print missing_loc
$2 = {needs_recovery_map = std::map with 0 elements, missing_loc = std::map with 1 elements = {[{oid = {name = "obj500"}, snap = {val = 18446744073709551614}, hash = 417478477,
      max = false, nibblewise_key_cache = 3564318337, hash_reverse_bits = 3001829144, static POOL_META = -1, static POOL_TEMP_START = -2, pool = 4, nspace = "",
      key = ""}] = std::set with 0 elements}, missing_loc_sources = std::set with 0 elements, pg = 0x5555573ff000, empty_set = std::set with 0 elements, is_readable = {
    px = 0x55555735cf40}, is_recoverable = {px = 0x555556c87ed0}}
Actions #2

Updated by David Zafman over 7 years ago

  • Description updated (diff)
Actions #3

Updated by David Zafman over 7 years ago

  • Backport set to kraken
Actions #4

Updated by David Zafman over 7 years ago

  • Copied to Backport #18659: kraken: /home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid)) added
Actions #5

Updated by David Zafman over 7 years ago

  • Status changed from 12 to Fix Under Review
Actions #6

Updated by Kefu Chai over 7 years ago

  • Has duplicate Bug #18658: PrimaryLogPG: failed_push(): FAILED assert(miter != peer_missing.end()) added
Actions #7

Updated by Kefu Chai over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF