Project

General

Profile

Bug #37779

msg/async: connection race + winner fault can leave connection in standby

Added by Sage Weil about 1 year ago. Updated 7 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
AsyncMessenger
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

on one end,

2019-01-02 16:28:53.247 7f13ea225700 10 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5ee3800 legacy :6802 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept existing 0x557db5f00000.gseq 0 <= 1, looks ok
2019-01-02 16:28:53.247 7f13ea225700  1 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5ee3800 legacy :6802 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_RE
2019-01-02 16:28:53.247 7f13ea225700 10 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5ee3800 legacy :6802 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0x557db5f00000.cseq 0 == 0, sending WAIT

but then,
2019-01-02 16:28:53.247 7f13eba28700  0 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5f00000 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send injecting socket failure
2019-01-02 16:28:53.247 7f13eba28700  1 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5f00000 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2019-01-02 16:28:53.247 7f13eba28700 20 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5f00000 legacy :-1 s=CONNECTING pgs=0 cs=0 l=0).handle_client_banner_write r=-32
2019-01-02 16:28:53.247 7f13eba28700  1 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5f00000 legacy :-1 s=CONNECTING pgs=0 cs=0 l=0).handle_client_banner_write write client banner failed
2019-01-02 16:28:53.247 7f13eba28700 20 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5f00000 legacy :-1 s=CONNECTING pgs=0 cs=0 l=0).fault
2019-01-02 16:28:53.247 7f13eba28700 10 -- 172.21.15.125:6802/33738 >> 172.21.15.6:6809/33741 conn(0x557db5f00000 legacy :-1 s=CONNECTING pgs=0 cs=0 l=0).fault with nothing to send, going to standby

and then nothing.

/a/sage-2019-01-02_14:51:32-rados-master-distro-basic-smithi/3414810


Related issues

Related to Messengers - Bug #37799: msg/async: RESETSESSION due to connection reset during initial connection Can't reproduce 01/06/2019
Copied to Messengers - Backport #38241: mimic: msg/async: connection race + winner fault can leave connection in standby New
Copied to Messengers - Backport #38242: luminous: msg/async: connection race + winner fault can leave connection in standby Resolved

History

#1 Updated by Sage Weil about 1 year ago

  • Status changed from 12 to Fix Under Review

#2 Updated by Sage Weil about 1 year ago

  • Backport set to luminous,mimic

#3 Updated by Sage Weil about 1 year ago

  • Assignee set to Sage Weil

#4 Updated by Sage Weil about 1 year ago

  • Related to Bug #37799: msg/async: RESETSESSION due to connection reset during initial connection added

#5 Updated by Sage Weil about 1 year ago

SCRATCH THAT, THE BELOW IS SOMETHING DIFFERENT

a newer instance of this, after several other fixes were applied:

2019-01-09 06:18:16.957 7fa97fff5e40 10 mon.s@-1(probing) e0 monmap e0: 21 mons at {a=[v2:172.21.15.181:3300/0,v1:172.21.15.181:6789/0],b=[v2:172.21.15.125:3300/0,v1:172.21.15.125:6789/0],c=[v2:172.21.15.90:3300/0,v1:172.21.15.90:6789/0],d=[v2:172.21.15.181:3301/0,v1:172.21.15.181:6790/0],e=[v2:172.21.15.125:3301/0
,v1:172.21.15.125:6790/0],f=[v2:172.21.15.90:3301/0,v1:172.21.15.90:6790/0],g=[v2:172.21.15.181:3302/0,v1:172.21.15.181:6791/0],h=[v2:172.21.15.125:3302/0,v1:172.21.15.125:6791/0],i=[v2:172.21.15.90:3302/0,v1:172.21.15.90:6791/0],j=[v2:172.21.15.181:3303/0,v1:172.21.15.181:6792/0],k=[v2:172.21.15.125:3303/0,v1:172.
21.15.125:6792/0],l=[v2:172.21.15.90:3303/0,v1:172.21.15.90:6792/0],m=[v2:172.21.15.181:3304/0,v1:172.21.15.181:6793/0],n=[v2:172.21.15.125:3304/0,v1:172.21.15.125:6793/0],o=[v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0],p=[v2:172.21.15.181:3305/0,v1:172.21.15.181:6794/0],q=[v2:172.21.15.125:3305/0,v1:172.21.15.12
5:6794/0],r=[v2:172.21.15.90:3305/0,v1:172.21.15.90:6794/0],s=[v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0],t=[v2:172.21.15.125:3306/0,v1:172.21.15.125:6795/0],u=[v2:172.21.15.90:3306/0,v1:172.21.15.90:6795/0]}
2019-01-09 06:18:16.958 7fa97fff5e40  1 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] _send_to--> mon [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] -- mon_probe(probe e45a7d54-1462-4a6a-b138-5e03b8543640 name s new) v6 -- ?+0 0x55e3ab9d9b80
2019-01-09 06:18:16.958 7fa97fff5e40 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] submit_message mon_probe(probe e45a7d54-1462-4a6a-b138-5e03b8543640 name s new) v6 remote, [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0], new connection.
2019-01-09 06:18:16.958 7fa97fff5e40 10 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] create_connect [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0], creating connection and registering
2019-01-09 06:18:16.958 7fa97fff5e40 10 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=STATE_NONE l=0)._connect
2019-01-09 06:18:16.958 7fa97fff5e40 10 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] create_connect 0x55e3aba0f000 [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0]
2019-01-09 06:18:16.958 7fa97fff5e40  1 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] --> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] -- mon_probe(probe e45a7d54-1462-4a6a-b138-5e03b8543640 name s new) v6 -- 0x55e3ab9d9b80 con 0x55e3aba0f000
2019-01-09 06:18:16.958 7fa97fff5e40 15 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=START_CONNECT pgs=0 cs=0 l=0).send_message inline write is denied, reschedule m=0x55e3ab9d9b80
2019-01-09 06:18:16.958 7fa965919700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=STATE_CONNECTING l=0).process
2019-01-09 06:18:16.958 7fa965919700 10 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=STATE_CONNECTING_RE l=0).process nonblock connect inprogress
2019-01-09 06:18:16.958 7fa965919700  4 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=STATE_CONNECTING_RE l=0).handle_write
2019-01-09 06:18:16.958 7fa965919700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=START_CONNECT pgs=0 cs=0 l=0).write_event
2019-01-09 06:18:16.959 7fa965919700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=STATE_CONNECTING_RE l=0).process
2019-01-09 06:18:16.959 7fa965919700  1 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed 
2019-01-09 06:18:16.959 7fa965919700  2 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=STATE_CONNECTING_RE l=0).process connection refused!
2019-01-09 06:18:16.959 7fa965919700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=START_CONNECT pgs=0 cs=0 l=0).fault
2019-01-09 06:18:16.959 7fa965919700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3aba0f000 msgr2 :-1 s=START_CONNECT pgs=0 cs=0 l=0).fault waiting 0.200000
...
2019-01-09 06:18:17.063 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=33
2019-01-09 06:18:17.063 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).process
2019-01-09 06:18:17.063 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=33
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 r=0
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).wait_connect_message_auth
2019-01-09 06:18:17.063 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=158
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_auth r=0
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept got peer connect_seq 0 global_seq 21
2019-01-09 06:18:17.063 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 1, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=1
2019-01-09 06:18:17.063 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 13, their proto 13
2019-01-09 06:18:17.063 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 158
2019-01-09 06:18:17.063 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: challenging authorizer
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).send_connect_message_reply
2019-01-09 06:18:17.063 7fa965118700 10 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 58 remaining bytes 0
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_reply_write r=0
2019-01-09 06:18:17.063 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).wait_connect_message
2019-01-09 06:18:17.063 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=33
2019-01-09 06:18:17.068 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).process
2019-01-09 06:18:17.068 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=33
2019-01-09 06:18:17.068 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_1 r=0
2019-01-09 06:18:17.068 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).wait_connect_message_auth
2019-01-09 06:18:17.068 7fa965118700 20 -- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=158
2019-01-09 06:18:17.068 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_auth r=0
2019-01-09 06:18:17.068 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2
2019-01-09 06:18:17.068 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept got peer connect_seq 0 global_seq 21
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 1, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=1
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 13, their proto 13
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 158
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept setting up session_security.
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x55e3aba0f000 on [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0]
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept existing 0x55e3aba0f000.gseq 0 <= 21, looks ok
2019-01-09 06:18:17.068 7fa965118700  1 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0x55e3aba0f000.cseq 0 == 0, or we are server, replacing my attempt
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).replace accept replacing 0x55e3aba0f000
2019-01-09 06:18:17.068 7fa965118700 20 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop
2019-01-09 06:18:17.068 7fa965118700  2 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop
2019-01-09 06:18:17.068 7fa965118700 10 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).discard_out_queue started
2019-01-09 06:18:17.068 7fa965118700  1 --2- [v2:172.21.15.181:3306/0,v1:172.21.15.181:6795/0] >> [v2:172.21.15.90:3304/0,v1:172.21.15.90:6793/0] conn(0x55e3abb4e000 msgr2 :45792 s=CLOSED pgs=0 cs=0 l=0).replace stop myself to swap existing

/a/sage-2019-01-08_22:26:18-rados:multimon-wip-sage2-testing-2019-01-08-1310-distro-basic-smithi/3437300

#6 Updated by Greg Farnum about 1 year ago

  • Status changed from Fix Under Review to 7

#7 Updated by Sage Weil about 1 year ago

again,

2019-01-12 04:33:12.615 7fc167fa4700 10 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce9336900 msgr2 :49106 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept existing 0x55bce92e7600.gseq 0 <= 12, looks ok
2019-01-12 04:33:12.615 7fc167fa4700  1 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce9336900 msgr2 :49106 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTION_ESTABLISHED
2019-01-12 04:33:12.615 7fc167fa4700 10 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce9336900 msgr2 :49106 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0x55bce92e7600.cseq 0 == 0, or we are server, replacing my attempt
2019-01-12 04:33:12.615 7fc167fa4700 10 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce9336900 msgr2 :49106 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).replace accept replacing 0x55bce92e7600

but then
2019-01-12 04:33:12.623 7fc167fa4700 20 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce92e7600 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 r=-1
2019-01-12 04:33:12.623 7fc167fa4700  1 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce92e7600 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_connect_message_1 read connect msg failed
2019-01-12 04:33:12.623 7fc167fa4700 20 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce92e7600 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault
2019-01-12 04:33:12.623 7fc167fa4700 10 --2- v2:172.21.15.197:6806/11137 >> v2:172.21.15.12:6810/11067 conn(0x55bce92e7600 msgr2 :-1 s=ACCEPTING pgs=0 cs=0 l=0).fault with nothing to send, going to standby
2019-01-12 04:33:12.631 7fc168fa6700 10 -- v2:172.21.15.197:6806/11137 connect_to v2:172.21.15.12:6810/11067 existing 0x55bce92e7600
2019-01-12 04:33:14.291 7fc168fa6700 10 -- v2:172.21.15.197:6806/11137 connect_to v2:172.21.15.12:6810/11067 existing 0x55bce92e7600
2019-01-12 04:33:14.291 7fc167fa4700 10 -- v2:172.21.15.197:6806/11137 connect_to v2:172.21.15.12:6810/11067 existing 0x55bce92e7600

/a/sage-2019-01-11_13:08:30-rados-wip-sage-testing-2019-01-10-1950-distro-basic-smithi/3446989

#8 Updated by Sage Weil about 1 year ago

  • Status changed from 7 to Pending Backport

#9 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #38241: mimic: msg/async: connection race + winner fault can leave connection in standby added

#10 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #38242: luminous: msg/async: connection race + winner fault can leave connection in standby added

#11 Updated by Greg Farnum about 1 year ago

  • Project changed from RADOS to Messengers

#12 Updated by Greg Farnum about 1 year ago

  • Category set to AsyncMessenger

#13 Updated by Nathan Cutler 7 months ago

  • Pull request ID set to 25754

Also available in: Atom PDF