Project

General

Profile

Actions

Bug #37799

closed

msg/async: RESETSESSION due to connection reset during initial connection

Added by Sage Weil over 5 years ago. Updated about 5 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

first, osd.3 tries to connection to osd.1, due to some queued messages, and there is a connection race, ..., but we fault before we finish

2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 create_connect v2:172.21.15.23:6814/11443, creating connection and registering
2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_NONE l=0)._connect
...

2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept existing 0x55cb457d1200.gseq 0 <= 7, looks ok
2019-01-06 01:48:07.516 7fb7f17fd700  1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTION_ESTABLISHED
2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0x55cb457d1200.cseq 0 == 0, or we are server, replacing my attempt
...
2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 r=-1
2019-01-06 01:48:07.516 7fb7f17fd700  1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 read connect msg failed
2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault
2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault with nothing to send, going to standby

then we do nothing for a long time due to #37779
2019-01-06 01:48:07.524 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:07.528 7fb7f0ffc700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:09.224 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
..
2019-01-06 01:48:25.516 7fb7f07fb700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:25.520 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:26.328 7fb7d174c700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200

then we send something else and we try to connect again,
2019-01-06 01:48:26.328 7fb7d174c700 10 osd.3 17 send_incremental_map 16 -> 17 to 0x55cb457d1200 v2:172.21.15.23:6814/11443
2019-01-06 01:48:26.328 7fb7d174c700  1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- osd_map(17..17 src has 1..17) v4 -- 0x55cb456c4d00 con 0x55cb457d1200
2019-01-06 01:48:26.328 7fb7d174c700 15 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STANDBY pgs=0 cs=0 l=0).send_message inline write is denied, reschedule m=0x55cb456c4d00
2019-01-06 01:48:26.328 7fb7d174c700  1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- pg_query(2.b epoch 17) v4 -- 0x55cb458d5200 con 0x55cb457d1200

we fault during hte connect phase,
2019-01-06 01:48:26.340 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=26
2019-01-06 01:48:26.340 7fb7f17fd700  1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 91
2019-01-06 01:48:26.340 7fb7f17fd700  1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=-1
2019-01-06 01:48:26.340 7fb7f17fd700  1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 read connect reply failed
2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).fault
2019-01-06 01:48:26.340 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=START_CONNECT pgs=0 cs=1 l=0).fault waiting 0.200000

and retry,
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=0
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 connect got reply tag 2 connect_seq 0 global_seq 0 proto 10 flags 0 features 4611087854031142911
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth
2019-01-06 01:48:26.544 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth reply.authorizer_len=36
2019-01-06 01:48:26.544 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=36
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_auth r=0
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2
2019-01-06 01:48:26.544 7fb7f17fd700  0 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION

and get RESETSESSION.

on the other end (osd.1), that's because we initially had our existing, but closed it

2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x555c846ae400 on v2:172.21.15.23:6810/11440
2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept new session
2019-01-06 01:48:26.340 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).open
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept success, connect_seq = 2 in_seq=0, sending READY
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept features 4611087854031142911 authorizer_protocol 2
2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 accept_conn 0x555c846ae400 v2:172.21.15.23:6810/11440
2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002
2019-01-06 01:48:26.344 7f95b3f88700  1 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write r=-32
2019-01-06 01:48:26.344 7f95b3f88700  1 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write write ready connect message reply failed
2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault
2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault with nothing to send and in the half  accept state just closed
2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop
2019-01-06 01:48:26.344 7f95b3f88700  2 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop
2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).discard_out_queue started

and on the reconnect attempt there was then no existing:
2019-01-06 01:48:26.544 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0 on v2:172.21.15.23:6810/11440
2019-01-06 01:48:26.544 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002
2019-01-06 01:48:26.544 7f95b3f88700  0 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION

/a/sage-2019-01-05_23:50:21-rados-wip-sage4-testing-2019-01-05-1233-distro-basic-smithi/3429089


Related issues 1 (0 open1 closed)

Related to Messengers - Bug #37779: msg/async: connection race + winner fault can leave connection in standbyResolvedSage Weil

Actions
Actions

Also available in: Atom PDF