Project

General

Profile

Actions

Bug #37799

closed

msg/async: RESETSESSION due to connection reset during initial connection

Added by Sage Weil over 5 years ago. Updated about 5 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

first, osd.3 tries to connection to osd.1, due to some queued messages, and there is a connection race, ..., but we fault before we finish

2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 create_connect v2:172.21.15.23:6814/11443, creating connection and registering
2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_NONE l=0)._connect
...

2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept existing 0x55cb457d1200.gseq 0 <= 7, looks ok
2019-01-06 01:48:07.516 7fb7f17fd700  1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTION_ESTABLISHED
2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0x55cb457d1200.cseq 0 == 0, or we are server, replacing my attempt
...
2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 r=-1
2019-01-06 01:48:07.516 7fb7f17fd700  1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 read connect msg failed
2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault
2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault with nothing to send, going to standby

then we do nothing for a long time due to #37779
2019-01-06 01:48:07.524 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:07.528 7fb7f0ffc700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:09.224 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
..
2019-01-06 01:48:25.516 7fb7f07fb700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:25.520 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
2019-01-06 01:48:26.328 7fb7d174c700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200

then we send something else and we try to connect again,
2019-01-06 01:48:26.328 7fb7d174c700 10 osd.3 17 send_incremental_map 16 -> 17 to 0x55cb457d1200 v2:172.21.15.23:6814/11443
2019-01-06 01:48:26.328 7fb7d174c700  1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- osd_map(17..17 src has 1..17) v4 -- 0x55cb456c4d00 con 0x55cb457d1200
2019-01-06 01:48:26.328 7fb7d174c700 15 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STANDBY pgs=0 cs=0 l=0).send_message inline write is denied, reschedule m=0x55cb456c4d00
2019-01-06 01:48:26.328 7fb7d174c700  1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- pg_query(2.b epoch 17) v4 -- 0x55cb458d5200 con 0x55cb457d1200

we fault during hte connect phase,
2019-01-06 01:48:26.340 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=26
2019-01-06 01:48:26.340 7fb7f17fd700  1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 91
2019-01-06 01:48:26.340 7fb7f17fd700  1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=-1
2019-01-06 01:48:26.340 7fb7f17fd700  1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 read connect reply failed
2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).fault
2019-01-06 01:48:26.340 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=START_CONNECT pgs=0 cs=1 l=0).fault waiting 0.200000

and retry,
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=0
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 connect got reply tag 2 connect_seq 0 global_seq 0 proto 10 flags 0 features 4611087854031142911
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth
2019-01-06 01:48:26.544 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth reply.authorizer_len=36
2019-01-06 01:48:26.544 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=36
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_auth r=0
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2
2019-01-06 01:48:26.544 7fb7f17fd700  0 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION

and get RESETSESSION.

on the other end (osd.1), that's because we initially had our existing, but closed it

2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x555c846ae400 on v2:172.21.15.23:6810/11440
2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept new session
2019-01-06 01:48:26.340 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).open
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept success, connect_seq = 2 in_seq=0, sending READY
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept features 4611087854031142911 authorizer_protocol 2
2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 accept_conn 0x555c846ae400 v2:172.21.15.23:6810/11440
2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002
2019-01-06 01:48:26.344 7f95b3f88700  1 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write r=-32
2019-01-06 01:48:26.344 7f95b3f88700  1 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write write ready connect message reply failed
2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault
2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault with nothing to send and in the half  accept state just closed
2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop
2019-01-06 01:48:26.344 7f95b3f88700  2 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop
2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).discard_out_queue started

and on the reconnect attempt there was then no existing:
2019-01-06 01:48:26.544 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0 on v2:172.21.15.23:6810/11440
2019-01-06 01:48:26.544 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002
2019-01-06 01:48:26.544 7f95b3f88700  0 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION

/a/sage-2019-01-05_23:50:21-rados-wip-sage4-testing-2019-01-05-1233-distro-basic-smithi/3429089


Related issues 1 (0 open1 closed)

Related to Messengers - Bug #37779: msg/async: connection race + winner fault can leave connection in standbyResolvedSage Weil

Actions
Actions #1

Updated by Sage Weil over 5 years ago

  • Related to Bug #37779: msg/async: connection race + winner fault can leave connection in standby added
Actions #2

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Can't reproduce

Closing this one. Haven't seen it since we rewrote the v2 implementation. It's possible something liek this still exists in the protocolv1 implementation... if so we can reopen if/when it occurs.

Actions #3

Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers
Actions

Also available in: Atom PDF