Bug #22570: out of order caused by letting old msg from down peer be processed to RESETSESSION - Messengers - Ceph

Custom queries

Bug queue
Bug triage
Crash queue
Crash triage
Feedback
My issues
Need Review
Pending backports
Product Backlog Scrub

Actions

Copy link

Bug #22570

closed

out of order caused by letting old msg from down peer be processed to RESETSESSION

Added by mingxin liu over 6 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

19796

Crash signature (v1):

Crash signature (v2):

Description

1.slave ack two op(op1,op2) to primary
2.op1 was dropped by con reset
3.op2 was sent to primary
4.primary use service's map to determine if slave is down(need drop this kind of msg, but service's map is laggy then osd's)
5.primary process op2, causing out of order

Related issues 2 (0 open — 2 closed)

Related to RADOS - Bug #21143: bad RESETSESSION between OSDs?

Duplicate

Haomai Wang

08/26/2017

Actions

Copied to Messengers - Backport #42586: luminous: out of order caused by letting old msg from down peer be processed to RESETSESSION

Resolved

Nathan Cutler

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by mingxin liu over 6 years ago

assert(repop_queue.front() == repop);

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Do you have logs or more about how this happened? There are a bunch of guards to prevent exactly this in cases where a connection reset happens. They might be leaky, but we'll need a little more to go on in identifying what went wrong.

Actions

Copy link

Updated by Sage Weil over 6 years ago

Subject changed from out of order caused by letting old msg from down peer be processed to RESETSESSION and OSD peer connections fundamentally racy
Status changed from New to 12

Last time I looked at this I came to the conclusion that (1) there was a fundamental problem, (2) the best hope for properly fixing it is moving peer connection managment into the OSD and out of hte messenger, and (3) that the workaround in the can_discard_request() (or whatever it is) is a good enough workaround for now.

https://github.com/ceph/ceph/pull/17217#issuecomment-324997960

Note that mingxin's improvement merged: https://github.com/ceph/ceph/pull/19796

Leaving this ticket open for now.

Actions

Copy link

Updated by Sage Weil over 6 years ago

Related to Bug #21143: bad RESETSESSION between OSDs? added

Actions

Copy link

Updated by Sage Weil over 6 years ago

Subject changed from RESETSESSION and OSD peer connections fundamentally racy to out of order caused by letting old msg from down peer be processed to RESETSESSION
Status changed from 12 to Resolved

actaully, see existing ticket #21143

Actions

Copy link

Updated by mingxin liu over 6 years ago

i wonder if http://tracker.ceph.com/issues/21287 related.

Actions

Copy link

Updated by Greg Farnum about 5 years ago

Project changed from RADOS to Messengers

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Status changed from Resolved to Pending Backport
Backport set to luminous, mimic

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42586: luminous: out of order caused by letting old msg from down peer be processed to RESETSESSION added

Actions

Copy link

#11

Updated by Nathan Cutler over 4 years ago

Pull request ID set to 19796

Actions

Copy link

#12

Updated by Nathan Cutler over 4 years ago

Backport changed from luminous, mimic to luminous

Actions

Copy link

#13

Updated by Nathan Cutler over 4 years ago

"git describe" on the https://github.com/ceph/ceph/pull/19796 merge commit:

v13.0.1-845-ga7dc224536

Actions

Copy link

#14

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Messengers

Custom queries

Bug #22570

out of order caused by letting old msg from down peer be processed to RESETSESSION

Updated by mingxin liu over 6 years ago

Updated by Greg Farnum over 6 years ago

Updated by Sage Weil over 6 years ago

Updated by Sage Weil over 6 years ago

Updated by Sage Weil over 6 years ago

Updated by mingxin liu over 6 years ago

Updated by Greg Farnum about 5 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago