Bug #14912: msg/async: bad RESETSESSION - Ceph - Ceph

Actions

Copy link

Bug #14912

closed

msg/async: bad RESETSESSION

Added by Sage Weil about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Haomai Wang

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2016-02-26 19:07:53.618211 7f876efc1700 0 -- 172.21.15.37:6805/3358 >> 172.21.15.37:6801/3355 conn(0x7f8782b7c000 sd=64 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=1 l=0).handle_connect_reply connect got RESETSESSION

shouldn't have happened. that's osd.2's log from

/a/sage-2016-02-26_07:11:10-rados-wip-sage-testing---basic-smithi/28701

Actions

Copy link

Updated by Haomai Wang about 8 years ago

From the log message, we can infer the sequence:

1. osd.2 connect to osd.0
2. osd.0 connect to osd.2
3. osd.2 get WAIT, and goto STANDBY
4. osd.2's caller send message with STANDBY connection, so this connection change state from STANDBY to CONNECTING, and send connect_msg
5. osd.2 receive osd.0 connect message, but we already has a CONNECTING connection with seq 0. So if incoming connect_seq 0 and existing_seq 0, osd.2 reply to osd.0 with RETRYSESSION, it will reply a seq+1
6. osd.0 receive RETRYSESSION tag and increase its connect_seq
7. osd.0 send connect message with connect_seq1 again
8. osd.2 receive the connect message(seq==1) and find existing connect_seq 0, so mark down the existing connection which from step 4. Now osd.2 accept this connect_msg(seq1)
9. osd.0 receive the previous connect msg(seq==0) from step 4, so existing->seq > 0 and connect_seq 0, accept progress trigger the peer reset handler, it will mark down existing connection(seq1) and accept this connect_msg
10. ods.2 will read error because step 9 mark down its initiate sider, so it will enter fault.
11. osd.0 got peer closed, so closing its connection
12. osd.2 has inqueue message, so it will enter reconnect progress with connect_seq++
13. osd.0 got the new connect_msg(seq==1), and doesn't have any existing connection(because step 11 closed existing connection), it will send RESETSESSION tag
14. osd.2 receive RESETSESSION tag, so it discards all inqueue messages, so step 4's message(pg_query) is lost.

the progress is wrong at step 3 already, if receive WAIT tag, Pipe will enter STATE_WAIT and never do any action!

From commit history, I find #12912 add the wrong state transition(WAIT->STANDBY). From the bug describe, it should be a wrong test fix. So we just need revert this commit to prevent this case.

Actions

Copy link

Updated by Haomai Wang about 8 years ago

Status changed from New to Fix Under Review

https://github.com/ceph/ceph/pull/7831

Actions

Copy link

Updated by Haomai Wang about 8 years ago

I think this is a safe revert. Because from #12912 described, initiate side will never send message to peer, this should be safe because it indicates the peer side never need to connect initiate side. In the actual osd inter-connect case, it won't be problem. So this should be a incorrect test case. If the #12912 happen again, we can verify the actual process and fix the test itself.

Actions

Copy link