Actions
Bug #37799
closedmsg/async: RESETSESSION due to connection reset during initial connection
Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
first, osd.3 tries to connection to osd.1, due to some queued messages, and there is a connection race, ..., but we fault before we finish
2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 create_connect v2:172.21.15.23:6814/11443, creating connection and registering 2019-01-06 01:48:07.512 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_NONE l=0)._connect ... 2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept existing 0x55cb457d1200.gseq 0 <= 7, looks ok 2019-01-06 01:48:07.516 7fb7f17fd700 1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTION_ESTABLISHED 2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d0480 msgr2 :38520 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0x55cb457d1200.cseq 0 == 0, or we are server, replacing my attempt ... 2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 r=-1 2019-01-06 01:48:07.516 7fb7f17fd700 1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 read connect msg failed 2019-01-06 01:48:07.516 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault 2019-01-06 01:48:07.516 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault with nothing to send, going to standby
then we do nothing for a long time due to #37779
2019-01-06 01:48:07.524 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:07.528 7fb7f0ffc700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:09.224 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 .. 2019-01-06 01:48:25.516 7fb7f07fb700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:25.520 7fb7f17fd700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200 2019-01-06 01:48:26.328 7fb7d174c700 10 -- v2:172.21.15.23:6810/11440 connect_to v2:172.21.15.23:6814/11443 existing 0x55cb457d1200
then we send something else and we try to connect again,
2019-01-06 01:48:26.328 7fb7d174c700 10 osd.3 17 send_incremental_map 16 -> 17 to 0x55cb457d1200 v2:172.21.15.23:6814/11443 2019-01-06 01:48:26.328 7fb7d174c700 1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- osd_map(17..17 src has 1..17) v4 -- 0x55cb456c4d00 con 0x55cb457d1200 2019-01-06 01:48:26.328 7fb7d174c700 15 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STANDBY pgs=0 cs=0 l=0).send_message inline write is denied, reschedule m=0x55cb456c4d00 2019-01-06 01:48:26.328 7fb7d174c700 1 -- v2:172.21.15.23:6810/11440 --> v2:172.21.15.23:6814/11443 -- pg_query(2.b epoch 17) v4 -- 0x55cb458d5200 con 0x55cb457d1200
we fault during hte connect phase,
2019-01-06 01:48:26.340 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=26 2019-01-06 01:48:26.340 7fb7f17fd700 1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 91 2019-01-06 01:48:26.340 7fb7f17fd700 1 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=-1 2019-01-06 01:48:26.340 7fb7f17fd700 1 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 read connect reply failed 2019-01-06 01:48:26.340 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).fault 2019-01-06 01:48:26.340 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=START_CONNECT pgs=0 cs=1 l=0).fault waiting 0.200000
and retry,
2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 r=0 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_1 connect got reply tag 2 connect_seq 0 global_seq 0 proto 10 flags 0 features 4611087854031142911 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth 2019-01-06 01:48:26.544 7fb7f17fd700 10 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth reply.authorizer_len=36 2019-01-06 01:48:26.544 7fb7f17fd700 20 -- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=36 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_auth r=0 2019-01-06 01:48:26.544 7fb7f17fd700 20 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 2019-01-06 01:48:26.544 7fb7f17fd700 0 --2- v2:172.21.15.23:6810/11440 >> v2:172.21.15.23:6814/11443 conn(0x55cb457d1200 msgr2 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
and get RESETSESSION.
on the other end (osd.1), that's because we initially had our existing, but closed it
2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x555c846ae400 on v2:172.21.15.23:6810/11440 2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002 2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept new session 2019-01-06 01:48:26.340 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).open 2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept success, connect_seq = 2 in_seq=0, sending READY 2019-01-06 01:48:26.340 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).open accept features 4611087854031142911 authorizer_protocol 2 2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 accept_conn 0x555c846ae400 v2:172.21.15.23:6810/11440 2019-01-06 01:48:26.340 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002 2019-01-06 01:48:26.344 7f95b3f88700 1 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe 2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write r=-32 2019-01-06 01:48:26.344 7f95b3f88700 1 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).handle_ready_connect_message_reply_write write ready connect message reply failed 2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault 2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).fault with nothing to send and in the half accept state just closed 2019-01-06 01:48:26.344 7f95b3f88700 20 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop 2019-01-06 01:48:26.344 7f95b3f88700 2 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).stop 2019-01-06 01:48:26.344 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c846ae400 msgr2 :-1 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=18 cs=2 l=0).discard_out_queue started
and on the reconnect attempt there was then no existing:
2019-01-06 01:48:26.544 7f95b3f88700 10 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0 on v2:172.21.15.23:6810/11440 2019-01-06 01:48:26.544 7f95b3f88700 10 -- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=STATE_CONNECTION_ESTABLISHED l=0).inject_delay sleep for 0.002 2019-01-06 01:48:26.544 7f95b3f88700 0 --2- v2:172.21.15.23:6814/11443 >> v2:172.21.15.23:6810/11440 conn(0x555c847c8400 msgr2 :48996 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
/a/sage-2019-01-05_23:50:21-rados-wip-sage4-testing-2019-01-05-1233-distro-basic-smithi/3429089
Updated by Sage Weil over 5 years ago
- Related to Bug #37779: msg/async: connection race + winner fault can leave connection in standby added
Updated by Sage Weil about 5 years ago
- Status changed from 12 to Can't reproduce
Closing this one. Haven't seen it since we rewrote the v2 implementation. It's possible something liek this still exists in the protocolv1 implementation... if so we can reopen if/when it occurs.
Updated by Greg Farnum about 5 years ago
- Project changed from RADOS to Messengers
Actions