Bug #41385
closedosd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
0%
Description
-596> 2019-08-21T17:06:40.152+0000 7f6213ac3700 7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=2 m=1 mbc={255={(1+1)=2}}] pull 2:5fac30a9:::benchmark_data_smithi043_11387 _object19:head v 18'32 on osds 3,4 from osd.3 -357> 2019-08-21T17:06:40.155+0000 7f6213ac3700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4082-g5b8890b/rpm/el7/BUILD/ceph-15.0.0-4082-g5b8890b/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f6213ac3700 time 2019-08-21T17:06:40.153563+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4082-g5b8890b/rpm/el7/BUILD/ceph-15.0.0-4082-g5b8890b/src/osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) ceph version 15.0.0-4082-g5b8890b (5b8890baac34824fe9058ef255119c7ef0e78c44) octopus (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55ff74457263] 2: (()+0x4d942b) [0x55ff7445742b] 3: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x67c) [0x55ff7482db2c] 4: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0xd6) [0x55ff748386a6] 5: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x393) [0x55ff74647543] 6: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x884) [0x55ff74690fa4] 7: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x3ee) [0x55ff7469a78e] 8: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x342) [0x55ff74539622] 9: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55ff7476a539] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1427) [0x55ff74555f57] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55ff74afa5c6] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ff74afc720] 13: (()+0x7dd5) [0x7f6234537dd5] 14: (clone()+0x6d) [0x7f62333fe02d]
/a/sage-2019-08-21_15:17:39-rados-wip-sage2-testing-2019-08-20-0935-distro-basic-smithi/4237063
looks like peer_missing has no item for osd.3, but 3 was listed in missing_loc
Updated by Neha Ojha over 4 years ago
Here's the chain of events that causes this:
Two objects go missing on the primary, and we want to recover them from other copies on 3 and 4.
2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean] rep_repair_primary_object 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head peers osd.{1,4} 2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] read got -11 / 65536 bytes from obj 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head 2019-08-21T17:06:40.141+0000 7f6213ac3700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head, will try copies on 3,4 2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head peers osd.{1,4} 2019-08-21T17:06:40.141+0000 7f620fabb700 5 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object: Read error on 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, but already seen errors 2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] read got -11 / 65536 bytes from obj 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head 2019-08-21T17:06:40.141+0000 7f620fabb700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, will try copies on 3,4
We send a PGRemove to osd.3 in purge_strays.
2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] purge_strays 3 2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] sending PGRemove to osd.3
and in this process we remove osd.3 from peer_missing but not from missing_loc.
See PeeringState::purge_strays() https://github.com/ceph/ceph/blob/master/src/osd/PeeringState.cc#L213-L214
then we try to recover those two objects
2019-08-21T17:06:40.147+0000 7f6213ac3700 7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=2}}] pull 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head v 19'224 on osds 3,4 from osd.4
This works fine because we try to recover it using osd.4, which is still present in peer_missing
The next one fails
2019-08-21T17:06:40.152+0000 7f6213ac3700 7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=2 m=1 mbc={255={(1+1)=2}}] pull 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head v 18'32 on osds 3,4 from osd.3
Because we try to get it from osd.3, which is no longer in peer_missing
I think the fix is to clear missing_loc as well when we cleared peer_missing in purge_strays()
Updated by Neha Ojha over 4 years ago
Have been able to reproduce it here: http://pulpito.ceph.com/nojha-2019-08-28_19:12:09-rados:singleton-master-distro-basic-smithi/
Updated by Neha Ojha over 4 years ago
- Status changed from In Progress to Fix Under Review
https://github.com/ceph/ceph/pull/30119 (merged September 4, 2019)
https://github.com/ceph/ceph/pull/30059 (merged September 7, 2019)
https://github.com/ceph/ceph/pull/30226 (merged September 9, 2019)
Updated by Neha Ojha over 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to luminous,mimic,nautilus
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #41730: luminous: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) added
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #41731: nautilus: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) added
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #41732: mimic: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) added
Updated by Neha Ojha over 4 years ago
Nathan Cutler wrote:
@Neha . - backport all three PRs?
Yes, note that the backport of https://github.com/ceph/ceph/pull/30059 should happen after https://github.com/ceph/ceph/pull/30271
Updated by Neha Ojha almost 2 years ago
- Status changed from Pending Backport to Resolved