Project

General

Profile

Bug #37778

msg/async: mark_down vs accept race leaves connection registered

Added by Sage Weil 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
01/02/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Command failed on smithi069 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status'

is the symptom.

issue is that some mons don't talk due to a messenger weirdness:

2019-01-02 16:09:21.214 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x368ed80 172.21.15.66:6792/0
2019-01-02 16:09:21.214 7f127b5e8700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x368ed80 legacy :6791 s=CLOSED pgs=11 cs=1 l=0).open state changed while accept_conn, it must be mark_down
2019-01-02 16:09:21.214 7f128df67f00  1 -- 172.21.15.66:6791/0 _send_to--> mon 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- ?+0 0x3180940
2019-01-02 16:09:21.214 7f128df67f00  1 -- 172.21.15.66:6791/0 --> 172.21.15.66:6792/0 -- mon_probe(probe 7f30b969-bf2d-48db-81a8-df6b01bd60fb name i) v6 -- 0x3180940 con 0
2019-01-02 16:09:21.418 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed.
2019-01-02 16:09:21.418 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this)
2019-01-02 16:09:21.418 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0
2019-01-02 16:09:21.822 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed.
2019-01-02 16:09:21.822 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371e480 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this)
2019-01-02 16:09:21.822 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371e480 172.21.15.66:6792/0
2019-01-02 16:09:22.626 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 existing 0x368ed80 already closed.
2019-01-02 16:09:22.626 7f127ade7700  1 -- 172.21.15.66:6791/0 >> 172.21.15.66:6792/0 conn(0x371ed80 legacy :6791 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=11 cs=1 l=0).open existing race replacing process for addr = 172.21.15.66:6792/0 just fail later one(this)
2019-01-02 16:09:22.626 7f1276ddf700 10 mon.i@7(probing) e1 ms_handle_reset 0x371ed80 172.21.15.66:6792/0
...

but that existing connection doesn't try to connect.

/a/sage-2019-01-02_14:51:32-rados-master-distro-basic-smithi/3414672


Related issues

Duplicated by Messengers - Bug #36175: msg/async: heartbeat timed out caused by connection register failure Duplicate 09/25/2018
Copied to Messengers - Backport #37896: mimic: msg/async: mark_down vs accept race leaves connection registered Resolved
Copied to Messengers - Backport #37897: luminous: msg/async: mark_down vs accept race leaves connection registered Resolved

History

#1 Updated by Sage Weil 10 months ago

the mark_down comes from mark_down_all() in bootstrap(), which is possibly no longer necessary, since the msgr tries reasonably hard not to identify itself beyond the type.

  if (newrank != rank) {
    dout(0) << " my rank is now " << newrank << " (was " << rank << ")" << dendl;
    messenger->set_myname(entity_name_t::MON(newrank));
    rank = newrank;

    // reset all connections, or else our peers will think we are someone else.
    messenger->mark_down_all();
  }

#2 Updated by Sage Weil 10 months ago

  • Status changed from Verified to Need Review

#3 Updated by Greg Farnum 10 months ago

So the quorum forms correctly, but just there is a monitor that doesn't get in because it's not connecting to anybody?

#4 Updated by Sage Weil 9 months ago

  • Priority changed from High to Urgent

#5 Updated by Sage Weil 9 months ago

another more recent instance

2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_1 r=0
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).wait_connect_message_auth
2019-01-10 02:29:57.299 7f62ea13e700 20 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=158
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_auth r=0
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept got peer connect_seq 0 global_seq 14
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 1, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=1
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 13, their proto 13
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 158
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept setting up session_security.
2019-01-10 02:29:57.299 7f6302e9df00 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop
2019-01-10 02:29:57.299 7f6302e9df00  2 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop
2019-01-10 02:29:57.299 7f6302e9df00 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).discard_out_queue started
2019-01-10 02:29:57.299 7f6302e9df00  5 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] shutdown_connections delete 0x4b6f600
2019-01-10 02:29:57.299 7f6302e9df00  5 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] shutdown_connections delete 0x4b6fa80
2019-01-10 02:29:57.299 7f62e993d700 10 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] accept_conn 0x4b6f600 [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0]
2019-01-10 02:29:57.299 7f62e993d700  1 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).open state changed while accept_conn, it must be mark_down
2019-01-10 02:29:57.299 7f62e993d700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).accept fault after register
2019-01-10 02:29:57.299 7f62e993d700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).fault
2019-01-10 02:29:57.299 7f62e993d700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).fault connection is already closed

//a/sage-2019-01-10_00:51:23-rados:multimon-wip-sage2-testing-2019-01-09-1610-distro-basic-smithi/3440889
mon.l, connection with mon.o

#6 Updated by Sage Weil 9 months ago

  • Subject changed from ceph quorum_status fail from multimon 21.yaml to msg/async: mark_down vs accept race leaves connection registered

#7 Updated by Sage Weil 9 months ago

just a note, i was able to reliably reproduce this with rados/multimon (no subsets), filtering to only do the 21.yaml, mon_recovery.yaml, and async messenger.. usually 2-3 failures per run (of 92 tests), with the symptom that the quorum_status command timed out.

#8 Updated by Sage Weil 9 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to mimic,luminous

#9 Updated by Nathan Cutler 9 months ago

  • Copied to Backport #37896: mimic: msg/async: mark_down vs accept race leaves connection registered added

#10 Updated by Nathan Cutler 9 months ago

  • Copied to Backport #37897: luminous: msg/async: mark_down vs accept race leaves connection registered added

#11 Updated by Nathan Cutler 9 months ago

  • Duplicated by Bug #36175: msg/async: heartbeat timed out caused by connection register failure added

#12 Updated by Nathan Cutler 8 months ago

  • Status changed from Pending Backport to Resolved

#13 Updated by Greg Farnum 7 months ago

  • Project changed from RADOS to Messengers

Also available in: Atom PDF