out of order caused by letting old msg from down peer be processed to RESETSESSION
1.slave ack two op(op1,op2) to primary
2.op1 was dropped by con reset
3.op2 was sent to primary
4.primary use service's map to determine if slave is down(need drop this kind of msg, but service's map is laggy then osd's)
5.primary process op2, causing out of order
#3 Updated by Sage Weil over 1 year ago
- Subject changed from out of order caused by letting old msg from down peer be processed to RESETSESSION and OSD peer connections fundamentally racy
- Status changed from New to Verified
Last time I looked at this I came to the conclusion that (1) there was a fundamental problem, (2) the best hope for properly fixing it is moving peer connection managment into the OSD and out of hte messenger, and (3) that the workaround in the can_discard_request() (or whatever it is) is a good enough workaround for now.
Note that mingxin's improvement merged: https://github.com/ceph/ceph/pull/19796
Leaving this ticket open for now.