Actions
Bug #37778
closedmsg/async: mark_down vs accept race leaves connection registered
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Command failed on smithi069 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status'
is the symptom.
issue is that some mons don't talk due to a messenger weirdness:
2019-01-02 16:09:21.214 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x368ed80 172.21.15.66:6792/0 2019-01-02 16:09:21.214 7f127b5e8700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x368ed80 legacy :6791 s=CLOSED pgs=11 cs=1 l=0).open state changed while accept_conn, it must be mark_down 2019-01-02 16:09:21.214 7f128df67f00 1 -- 172.21.15.66:6791/0 _send_to--> mon 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- ?+0 0x3180940 2019-01-02 16:09:21.214 7f128df67f00 1 -- 172.21.15.66:6791/0 --> 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- 0x3180940 con 0 2019-01-02 16:09:21.418 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed. 2019-01-02 16:09:21.418 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this) 2019-01-02 16:09:21.418 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0 2019-01-02 16:09:21.822 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed. 2019-01-02 16:09:21.822 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this) 2019-01-02 16:09:21.822 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0 2019-01-02 16:09:22.626 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed. 2019-01-02 16:09:22.626 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this) 2019-01-02 16:09:22.626 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371ed80 172.21.15.66:6792/0 ...
but that existing connection doesn't try to connect.
/a/sage-2019-01-02_14:51:32-rados-master-distro-basic-smithi/3414672
Actions