Project

General

Profile

Actions

Bug #37778

closed

msg/async: mark_down vs accept race leaves connection registered

Added by Sage Weil over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Command failed on smithi069 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status'

is the symptom.

issue is that some mons don't talk due to a messenger weirdness:

2019-01-02 16:09:21.214 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x368ed80 172.21.15.66:6792/0
2019-01-02 16:09:21.214 7f127b5e8700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x368ed80 legacy :6791 s=CLOSED pgs=11 cs=1 l=0).open state changed while accept_conn, it must be mark_down
2019-01-02 16:09:21.214 7f128df67f00  1 -- 172.21.15.66:6791/0 _send_to--> mon 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- ?+0 0x3180940
2019-01-02 16:09:21.214 7f128df67f00  1 -- 172.21.15.66:6791/0 --> 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- 0x3180940 con 0
2019-01-02 16:09:21.418 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed.
2019-01-02 16:09:21.418 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this)
2019-01-02 16:09:21.418 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0
2019-01-02 16:09:21.822 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed.
2019-01-02 16:09:21.822 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this)
2019-01-02 16:09:21.822 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0
2019-01-02 16:09:22.626 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed.
2019-01-02 16:09:22.626 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this)
2019-01-02 16:09:22.626 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371ed80 172.21.15.66:6792/0
...

but that existing connection doesn't try to connect.

/a/sage-2019-01-02_14:51:32-rados-master-distro-basic-smithi/3414672


Related issues 3 (0 open3 closed)

Has duplicate Messengers - Bug #36175: msg/async: heartbeat timed out caused by connection register failureDuplicate09/25/2018

Actions
Copied to Messengers - Backport #37896: mimic: msg/async: mark_down vs accept race leaves connection registeredResolvedxie xingguoActions
Copied to Messengers - Backport #37897: luminous: msg/async: mark_down vs accept race leaves connection registeredResolvedxie xingguoActions
Actions

Also available in: Atom PDF