Actions
Bug #37799
closedmsg/async: RESETSESSION due to connection reset during initial connection
Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
first, osd.3 tries to connection to osd.1, due to some queued messages, and there is a connection race, ..., but we fault before we finish
2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 create_connect v2:172.21.15.23:6814/11443, creating connection and registering 2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_NONE l=0)._connect ... 2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept existing 0x55cb457d1200.gseq 0 <= 7, looks ok 2019-01-06 01:48:07.516 7fb7f17fd700 1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTION_ESTABLISHED 2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0x55cb457d1200.cseq 0 == 0, or we are server, replacing my attempt ... 2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 r=-1 2019-01-06 01:48:07.516 7fb7f17fd700 1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 read connect msg failed 2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault 2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault with nothing to send, going to standby
then we do nothing for a long time due to #37779
2019-01-06 01:48:07.524 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:07.528 7fb7f0ffc700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:09.224 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 .. 2019-01-06 01:48:25.516 7fb7f07fb700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:25.520 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:26.328 7fb7d174c700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
then we send something else and we try to connect again,
2019-01-06 01:48:26.328 7fb7d174c700 10 osd.3 17 send_incremental_map 16 -> 17 to 0x55cb457d1200 v2:172.21.15.23:6814/11443 2019-01-06 01:48:26.328 7fb7d174c700 1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- osd_map(17..17 src has 1..17) v4 -- 0x55cb456c4d00 con 0x55cb457d1200 2019-01-06 01:48:26.328 7fb7d174c700 15 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STANDBY pgs=0 cs=0 l=0).send_message inline write is denied, reschedule m=0x55cb456c4d00 2019-01-06 01:48:26.328 7fb7d174c700 1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- pg_query(2.b epoch 17) v4 -- 0x55cb458d5200 con 0x55cb457d1200
we fault during hte connect phase,
2019-01-06 01:48:26.340 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=26 2019-01-06 01:48:26.340 7fb7f17fd700 1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 91 2019-01-06 01:48:26.340 7fb7f17fd700 1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=-1 2019-01-06 01:48:26.340 7fb7f17fd700 1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 read connect reply failed 2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).fault 2019-01-06 01:48:26.340 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=START_CONNECT pgs=0 cs=1 l=0).fault waiting 0.200000
and retry,
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=0 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 connect got reply tag 2 connect_seq 0 global_seq 0 proto 10 flags 0 features 4611087854031142911 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth 2019-01-06 01:48:26.544 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth reply.authorizer_len=36 2019-01-06 01:48:26.544 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=36 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_auth r=0 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 2019-01-06 01:48:26.544 7fb7f17fd700 0 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
and get RESETSESSION.
on the other end (osd.1), that's because we initially had our existing, but closed it
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x555c846ae400 on v2:172.21.15.23:6810/11440 2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002 2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept new session 2019-01-06 01:48:26.340 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).open 2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept success, connect_seq = 2 in_seq=0, sending READY 2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept features 4611087854031142911 authorizer_protocol 2 2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 accept_conn 0x555c846ae400 v2:172.21.15.23:6810/11440 2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002 2019-01-06 01:48:26.344 7f95b3f88700 1 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe 2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write r=-32 2019-01-06 01:48:26.344 7f95b3f88700 1 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write write ready connect message reply failed 2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault 2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault with nothing to send and in the half accept state just closed 2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop 2019-01-06 01:48:26.344 7f95b3f88700 2 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop 2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).discard_out_queue started
and on the reconnect attempt there was then no existing:
2019-01-06 01:48:26.544 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0 on v2:172.21.15.23:6810/11440 2019-01-06 01:48:26.544 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002 2019-01-06 01:48:26.544 7f95b3f88700 0 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
/a/sage-2019-01-05_23:50:21-rados-wip-sage4-testing-2019-01-05-1233-distro-basic-smithi/3429089
Actions