Project

General

Profile

Backport #18659

Updated by Nathan Cutler about 7 years ago

https://github.com/ceph/ceph/pull/13091
Tracker http://tracker.ceph.com/issues/18165 pull request https://github.com/ceph/ceph/pull/12888 added this assert.

Ran cluster with osd_min_pg_log_entries=1 osd_max_pg_log_entries=5. Created EC pool with 2k +1m. I created objects and corrupted shard 1 and 2 of one of the objects. Then I marked out the 2 OSDs that have shard 1 and 2 on them. The shard 0 OSD which was also the primary crashed because during the attempt to backfill 2 new OSDs couldn't read the corrupted object.

<pre>
-9> 2017-01-20 11:55:23.671069 7f08f4d5d700 5 -- op tracker -- seq: 43355, time: 2017-01-20 11:55:23.671068, event: reached_pg, op: MOSDECSubOpReadReply(4.0s0 497 ECSubReadR
eply(tid=2714, attrs_read=0))
-8> 2017-01-20 11:55:23.671072 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/
[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_message: MOSDECSubOpReadReply(4.0s0 497 ECSubReadRep
ly(tid=2714, attrs_read=0)) v1
-7> 2017-01-20 11:55:23.671078 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/
[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=2714, attrs_read=0)
-6> 2017-01-20 11:55:23.671083 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply shard=3(1) error=-5
-5> 2017-01-20 11:55:23.671088 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply Complete: ReadOp(tid=2714, to_read={4:b2ec4718:::obj500:head=read_request_t(to_read=[0,8388608,0], need=0(2),3(1), want_attrs=0)}, complete={4:b2ec4718:::obj500:head=read_result_t(r=0, errors={0(2)=-5,3(1)=-5}, noattrs, returned=(0, 8388608, []))}, priority=3, obj_to_source={4:b2ec4718:::obj500:head=0(2),3(1)}, source_to_obj={0(2)=4:b2ec4718:::obj500:head,3(1)=4:b2ec4718:::obj500:head}, in_progress=)
-4> 2017-01-20 11:55:23.671097 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] _failed_push: Read error 4:b2ec4718:::obj500:head r=0 errors={0(2)=-5,3(1)=-5}
-3> 2017-01-20 11:55:23.671103 7f08f4d5d700 10 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] _failed_push: canceling recovery op for obj 4:b2ec4718:::obj500:head
-2> 2017-01-20 11:55:23.671113 7f08f4d5d700 20 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] failed_push: 4:b2ec4718:::obj500:head
-1> 2017-01-20 11:55:23.671119 7f08f4d5d700 15 osd.4 pg_epoch: 497 pg[4.0s0( v 494'1000 (494'900,494'1000] local-les=496 n=1000 ec=493 les/c/f 496/494/0 495/495/493) [4,2,1]/[4,3,0] r=0 lpr=495 pi=493-494/1 rops=1 bft=1(2),2(1) crt=494'1000 lcod 494'999 mlcod 0'0 active+remapped+backfilling] requeue_ops
0> 2017-01-20 11:55:23.673850 7f08f4d5d700 -1 /home/dzafman/ceph/src/osd/PG.h: In function 'eversion_t PG::MissingLoc::get_version_needed(const hobject_t&) const' thread 7f08f4d5d700 time 2017-01-20 11:55:23.671136
/home/dzafman/ceph/src/osd/PG.h: 441: FAILED assert(needs_recovery_map.count(hoid))

ceph version 11.1.0-6744-ga6e986b (a6e986bdf8abee79277736e26df6dab85b0372b0)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x84) [0x7f090d68c9b4]
2: (PrimaryLogPG::failed_push(std::list<pg_shard_t, std::allocator<pg_shard_t> > const&, hobject_t const&)+0x72f) [0x7f0916defbdf]
3: (ECBackend::_failed_push(hobject_t const&, std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x299) [0x7f0916f45129]
4: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x3e) [0x7f0916f6c40e]
5: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x6f) [0x7f0916f4466f]
6: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf8d) [0x7f0916f4b8fd]
7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x176) [0x7f0916f56586]
8: (PrimaryLogPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x7f0916e01d1d]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3cd) [0x7f0916cb9d2d]
</pre>

Back