Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard)) - RADOS - Ceph

Actions

Copy link

Bug #41385

closed

osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))

Added by Sage Weil over 4 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

Neha Ojha

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

Backport:

luminous,mimic,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

  -596> 2019-08-21T17:06:40.152+0000 7f6213ac3700  7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=2 m=1 mbc={255={(1+1)=2}}] pull 2:5fac30a9:::benchmark_data_smithi043_11387
_object19:head v 18'32 on osds 3,4 from osd.3
  -357> 2019-08-21T17:06:40.155+0000 7f6213ac3700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4082-g5b8890b/rpm/el7/BUILD/ceph-15.0.0-4082-g5b8890b/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f6213ac3700 time 2019-08-21T17:06:40.153563+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/15.0.0-4082-g5b8890b/rpm/el7/BUILD/ceph-15.0.0-4082-g5b8890b/src/osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))

 ceph version 15.0.0-4082-g5b8890b (5b8890baac34824fe9058ef255119c7ef0e78c44) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55ff74457263]
 2: (()+0x4d942b) [0x55ff7445742b]
 3: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x67c) [0x55ff7482db2c]
 4: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0xd6) [0x55ff748386a6]
 5: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x393) [0x55ff74647543]
 6: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x884) [0x55ff74690fa4]
 7: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x3ee) [0x55ff7469a78e]
 8: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x342) [0x55ff74539622]
 9: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55ff7476a539]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1427) [0x55ff74555f57]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55ff74afa5c6]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ff74afc720]
 13: (()+0x7dd5) [0x7f6234537dd5]
 14: (clone()+0x6d) [0x7f62333fe02d]

/a/sage-2019-08-21_15:17:39-rados-wip-sage2-testing-2019-08-20-0935-distro-basic-smithi/4237063

looks like peer_missing has no item for osd.3, but 3 was listed in missing_loc

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Assignee set to Neha Ojha

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Status changed from 12 to In Progress

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Here's the chain of events that causes this:

Two objects go missing on the primary, and we want to recover them from other copies on 3 and 4.

2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean] rep_repair_primary_object 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head peers osd.{1,4}
2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1]  read got -11 / 65536 bytes from obj 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head

2019-08-21T17:06:40.141+0000 7f6213ac3700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head, will try copies on 3,4

2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head peers osd.{1,4}
2019-08-21T17:06:40.141+0000 7f620fabb700  5 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object: Read error on 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, but already seen errors
2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1]  read got -11 / 65536 bytes from obj 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head
2019-08-21T17:06:40.141+0000 7f620fabb700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, will try copies on 3,4

We send a PGRemove to osd.3 in purge_strays.

2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] purge_strays 3
2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] sending PGRemove to osd.3

and in this process we remove osd.3 from peer_missing but not from missing_loc.

See PeeringState::purge_strays() https://github.com/ceph/ceph/blob/master/src/osd/PeeringState.cc#L213-L214

then we try to recover those two objects

2019-08-21T17:06:40.147+0000 7f6213ac3700  7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=2}}] pull 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head v 19'224 on osds 3,4 from osd.4

This works fine because we try to recover it using osd.4, which is still present in peer_missing

The next one fails

2019-08-21T17:06:40.152+0000 7f6213ac3700  7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=2 m=1 mbc={255={(1+1)=2}}] pull 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head v 18'32 on osds 3,4 from osd.3

Because we try to get it from osd.3, which is no longer in peer_missing

I think the fix is to clear missing_loc as well when we cleared peer_missing in purge_strays()

Actions

Copy link