Bug #37778
msg/async: mark_down vs accept race leaves connection registered
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Command failed on smithi069 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status'
is the symptom.
issue is that some mons don't talk due to a messenger weirdness:
2019-01-02 16:09:21.214 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x368ed80 172.21.15.66:6792/0 2019-01-02 16:09:21.214 7f127b5e8700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x368ed80 legacy :6791 s=CLOSED pgs=11 cs=1 l=0).open state changed while accept_conn, it must be mark_down 2019-01-02 16:09:21.214 7f128df67f00 1 -- 172.21.15.66:6791/0 _send_to--> mon 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- ?+0 0x3180940 2019-01-02 16:09:21.214 7f128df67f00 1 -- 172.21.15.66:6791/0 --> 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- 0x3180940 con 0 2019-01-02 16:09:21.418 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed. 2019-01-02 16:09:21.418 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this) 2019-01-02 16:09:21.418 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0 2019-01-02 16:09:21.822 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed. 2019-01-02 16:09:21.822 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this) 2019-01-02 16:09:21.822 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0 2019-01-02 16:09:22.626 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed. 2019-01-02 16:09:22.626 7f127ade7700 1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this) 2019-01-02 16:09:22.626 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371ed80 172.21.15.66:6792/0 ...
but that existing connection doesn't try to connect.
/a/sage-2019-01-02_14:51:32-rados-master-distro-basic-smithi/3414672
Related issues
History
#1 Updated by Sage Weil about 5 years ago
the mark_down comes from mark_down_all() in bootstrap(), which is possibly no longer necessary, since the msgr tries reasonably hard not to identify itself beyond the type.
if (newrank != rank) { dout(0) << " my rank is now " << newrank << " (was " << rank << ")" << dendl; messenger->set_myname(entity_name_t::MON(newrank)); rank = newrank; // reset all connections, or else our peers will think we are someone else. messenger->mark_down_all(); }
#2 Updated by Sage Weil about 5 years ago
- Status changed from 12 to Fix Under Review
#3 Updated by Greg Farnum about 5 years ago
So the quorum forms correctly, but just there is a monitor that doesn't get in because it's not connecting to anybody?
#4 Updated by Sage Weil about 5 years ago
- Priority changed from High to Urgent
#5 Updated by Sage Weil about 5 years ago
another more recent instance
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_1 r=0 2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).wait_connect_message_auth 2019-01-10 02:29:57.299 7f62ea13e700 20 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=158 2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_auth r=0 2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept got peer connect_seq 0 global_seq 14 2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 1, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=1 2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 13, their proto 13 2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 158 2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept setting up session_security. 2019-01-10 02:29:57.299 7f6302e9df00 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop 2019-01-10 02:29:57.299 7f6302e9df00 2 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop 2019-01-10 02:29:57.299 7f6302e9df00 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).discard_out_queue started 2019-01-10 02:29:57.299 7f6302e9df00 5 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] shutdown_connections delete 0x4b6f600 2019-01-10 02:29:57.299 7f6302e9df00 5 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] shutdown_connections delete 0x4b6fa80 2019-01-10 02:29:57.299 7f62e993d700 10 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] accept_conn 0x4b6f600 [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 2019-01-10 02:29:57.299 7f62e993d700 1 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).open state changed while accept_conn, it must be mark_down 2019-01-10 02:29:57.299 7f62e993d700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).accept fault after register 2019-01-10 02:29:57.299 7f62e993d700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).fault 2019-01-10 02:29:57.299 7f62e993d700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).fault connection is already closed
//a/sage-2019-01-10_00:51:23-rados:multimon-wip-sage2-testing-2019-01-09-1610-distro-basic-smithi/3440889
mon.l, connection with mon.o
#6 Updated by Sage Weil about 5 years ago
- Subject changed from ceph quorum_status fail from multimon 21.yaml to msg/async: mark_down vs accept race leaves connection registered
#7 Updated by Sage Weil about 5 years ago
just a note, i was able to reliably reproduce this with rados/multimon (no subsets), filtering to only do the 21.yaml, mon_recovery.yaml, and async messenger.. usually 2-3 failures per run (of 92 tests), with the symptom that the quorum_status command timed out.
#8 Updated by Sage Weil about 5 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to mimic,luminous
#9 Updated by Nathan Cutler about 5 years ago
- Copied to Backport #37896: mimic: msg/async: mark_down vs accept race leaves connection registered added
#10 Updated by Nathan Cutler about 5 years ago
- Copied to Backport #37897: luminous: msg/async: mark_down vs accept race leaves connection registered added
#11 Updated by Nathan Cutler about 5 years ago
- Duplicated by Bug #36175: msg/async: heartbeat timed out caused by connection register failure added
#12 Updated by Nathan Cutler about 5 years ago
- Status changed from Pending Backport to Resolved
#13 Updated by Greg Farnum about 5 years ago
- Project changed from RADOS to Messengers