Project

General

Profile

Bug #41385

osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))

Added by Sage Weil almost 2 years ago. Updated almost 2 years ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  -596> 2019-08-21T17:06:40.152+0000 7f6213ac3700  7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=2 m=1 mbc={255={(1+1)=2}}] pull 2:5fac30a9:::benchmark_data_smithi043_11387
_object19:head v 18'32 on osds 3,4 from osd.3
  -357> 2019-08-21T17:06:40.155+0000 7f6213ac3700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4082-g5b8890b/rpm/el7/BUILD/ceph-15.0.0-4082-g5b8890b/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f6213ac3700 time 2019-08-21T17:06:40.153563+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4082-g5b8890b/rpm/el7/BUILD/ceph-15.0.0-4082-g5b8890b/src/osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))

 ceph version 15.0.0-4082-g5b8890b (5b8890baac34824fe9058ef255119c7ef0e78c44) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55ff74457263]
 2: (()+0x4d942b) [0x55ff7445742b]
 3: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x67c) [0x55ff7482db2c]
 4: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0xd6) [0x55ff748386a6]
 5: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x393) [0x55ff74647543]
 6: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x884) [0x55ff74690fa4]
 7: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x3ee) [0x55ff7469a78e]
 8: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x342) [0x55ff74539622]
 9: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55ff7476a539]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1427) [0x55ff74555f57]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55ff74afa5c6]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ff74afc720]
 13: (()+0x7dd5) [0x7f6234537dd5]
 14: (clone()+0x6d) [0x7f62333fe02d]

/a/sage-2019-08-21_15:17:39-rados-wip-sage2-testing-2019-08-20-0935-distro-basic-smithi/4237063

looks like peer_missing has no item for osd.3, but 3 was listed in missing_loc


Related issues

Copied to RADOS - Backport #41730: luminous: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) Resolved
Copied to RADOS - Backport #41731: nautilus: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) Need More Info
Copied to RADOS - Backport #41732: mimic: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) Rejected

History

#1 Updated by Neha Ojha almost 2 years ago

  • Assignee set to Neha Ojha

#2 Updated by Neha Ojha almost 2 years ago

  • Status changed from 12 to In Progress

#3 Updated by Neha Ojha almost 2 years ago

Here's the chain of events that causes this:

Two objects go missing on the primary, and we want to recover them from other copies on 3 and 4.

2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean] rep_repair_primary_object 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head peers osd.{1,4}
2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1]  read got -11 / 65536 bytes from obj 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head

2019-08-21T17:06:40.141+0000 7f6213ac3700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head, will try copies on 3,4

2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head peers osd.{1,4}
2019-08-21T17:06:40.141+0000 7f620fabb700  5 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object: Read error on 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, but already seen errors
2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1]  read got -11 / 65536 bytes from obj 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head
2019-08-21T17:06:40.141+0000 7f620fabb700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, will try copies on 3,4

We send a PGRemove to osd.3 in purge_strays.

2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] purge_strays 3
2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] sending PGRemove to osd.3


and in this process we remove osd.3 from peer_missing but not from missing_loc.

See PeeringState::purge_strays() https://github.com/ceph/ceph/blob/master/src/osd/PeeringState.cc#L213-L214

then we try to recover those two objects

2019-08-21T17:06:40.147+0000 7f6213ac3700  7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=2}}] pull 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head v 19'224 on osds 3,4 from osd.4

This works fine because we try to recover it using osd.4, which is still present in peer_missing

The next one fails

2019-08-21T17:06:40.152+0000 7f6213ac3700  7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=2 m=1 mbc={255={(1+1)=2}}] pull 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head v 18'32 on osds 3,4 from osd.3

Because we try to get it from osd.3, which is no longer in peer_missing

I think the fix is to clear missing_loc as well when we cleared peer_missing in purge_strays()

#5 Updated by Neha Ojha almost 2 years ago

  • Status changed from In Progress to Fix Under Review

https://github.com/ceph/ceph/pull/30119 (merged September 4, 2019)
https://github.com/ceph/ceph/pull/30059 (merged September 7, 2019)
https://github.com/ceph/ceph/pull/30226 (merged September 9, 2019)

#6 Updated by Neha Ojha almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to luminous,mimic,nautilus

#7 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #41730: luminous: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) added

#8 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #41731: nautilus: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) added

#9 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #41732: mimic: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) added

#10 Updated by Nathan Cutler almost 2 years ago

@Neha - backport all three PRs?

#11 Updated by Neha Ojha almost 2 years ago

Nathan Cutler wrote:

@Neha - backport all three PRs?

Yes, note that the backport of https://github.com/ceph/ceph/pull/30059 should happen after https://github.com/ceph/ceph/pull/30271

Also available in: Atom PDF