Bug #22570
out of order caused by letting old msg from down peer be processed to RESETSESSION
0%
Description
1.slave ack two op(op1,op2) to primary
2.op1 was dropped by con reset
3.op2 was sent to primary
4.primary use service's map to determine if slave is down(need drop this kind of msg, but service's map is laggy then osd's)
5.primary process op2, causing out of order
Related issues
History
#1 Updated by mingxin liu about 3 years ago
assert(repop_queue.front() == repop);
#2 Updated by Greg Farnum about 3 years ago
Do you have logs or more about how this happened? There are a bunch of guards to prevent exactly this in cases where a connection reset happens. They might be leaky, but we'll need a little more to go on in identifying what went wrong.
#3 Updated by Sage Weil about 3 years ago
- Subject changed from out of order caused by letting old msg from down peer be processed to RESETSESSION and OSD peer connections fundamentally racy
- Status changed from New to 12
Last time I looked at this I came to the conclusion that (1) there was a fundamental problem, (2) the best hope for properly fixing it is moving peer connection managment into the OSD and out of hte messenger, and (3) that the workaround in the can_discard_request() (or whatever it is) is a good enough workaround for now.
https://github.com/ceph/ceph/pull/17217#issuecomment-324997960
Note that mingxin's improvement merged: https://github.com/ceph/ceph/pull/19796
Leaving this ticket open for now.
#4 Updated by Sage Weil about 3 years ago
- Related to Bug #21143: bad RESETSESSION between OSDs? added
#5 Updated by Sage Weil about 3 years ago
- Subject changed from RESETSESSION and OSD peer connections fundamentally racy to out of order caused by letting old msg from down peer be processed to RESETSESSION
- Status changed from 12 to Resolved
actaully, see existing ticket #21143
#6 Updated by mingxin liu about 3 years ago
i wonder if http://tracker.ceph.com/issues/21287 related.
#7 Updated by Greg Farnum almost 2 years ago
- Project changed from RADOS to Messengers
#8 Updated by Nathan Cutler over 1 year ago
- Status changed from Resolved to Pending Backport
- Backport set to luminous, mimic
#9 Updated by Nathan Cutler over 1 year ago
- Copied to Backport #42586: luminous: out of order caused by letting old msg from down peer be processed to RESETSESSION added
#11 Updated by Nathan Cutler over 1 year ago
- Pull request ID set to 19796
#12 Updated by Nathan Cutler over 1 year ago
- Backport changed from luminous, mimic to luminous
#13 Updated by Nathan Cutler over 1 year ago
"git describe" on the https://github.com/ceph/ceph/pull/19796 merge commit:
v13.0.1-845-ga7dc224536
#14 Updated by Nathan Cutler over 1 year ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".