Project

General

Profile

Bug #22570

out of order caused by letting old msg from down peer be processed to RESETSESSION

Added by mingxin liu over 1 year ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
01/05/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

1.slave ack two op(op1,op2) to primary
2.op1 was dropped by con reset
3.op2 was sent to primary
4.primary use service's map to determine if slave is down(need drop this kind of msg, but service's map is laggy then osd's)
5.primary process op2, causing out of order


Related issues

Related to RADOS - Bug #21143: bad RESETSESSION between OSDs? Duplicate 08/26/2017

History

#1 Updated by mingxin liu over 1 year ago

assert(repop_queue.front() == repop);

#2 Updated by Greg Farnum over 1 year ago

Do you have logs or more about how this happened? There are a bunch of guards to prevent exactly this in cases where a connection reset happens. They might be leaky, but we'll need a little more to go on in identifying what went wrong.

#3 Updated by Sage Weil over 1 year ago

  • Subject changed from out of order caused by letting old msg from down peer be processed to RESETSESSION and OSD peer connections fundamentally racy
  • Status changed from New to Verified

Last time I looked at this I came to the conclusion that (1) there was a fundamental problem, (2) the best hope for properly fixing it is moving peer connection managment into the OSD and out of hte messenger, and (3) that the workaround in the can_discard_request() (or whatever it is) is a good enough workaround for now.

https://github.com/ceph/ceph/pull/17217#issuecomment-324997960

Note that mingxin's improvement merged: https://github.com/ceph/ceph/pull/19796

Leaving this ticket open for now.

#4 Updated by Sage Weil over 1 year ago

  • Related to Bug #21143: bad RESETSESSION between OSDs? added

#5 Updated by Sage Weil over 1 year ago

  • Subject changed from RESETSESSION and OSD peer connections fundamentally racy to out of order caused by letting old msg from down peer be processed to RESETSESSION
  • Status changed from Verified to Resolved

actaully, see existing ticket #21143

#6 Updated by mingxin liu over 1 year ago

#7 Updated by Greg Farnum 4 months ago

  • Project changed from RADOS to Messengers

Also available in: Atom PDF