https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2019-01-02T18:56:50Z
Ceph
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=126686
2019-01-02T18:56:50Z
Sage Weil
sage@newdream.net
<ul></ul><p>the mark_down comes from mark_down_all() in bootstrap(), which is possibly no longer necessary, since the msgr tries reasonably hard not to identify itself beyond the type.<br /><pre>
if (newrank != rank) {
dout(0) << " my rank is now " << newrank << " (was " << rank << ")" << dendl;
messenger->set_myname(entity_name_t::MON(newrank));
rank = newrank;
// reset all connections, or else our peers will think we are someone else.
messenger->mark_down_all();
}
</pre></p>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=126696
2019-01-02T22:12:48Z
Sage Weil
sage@newdream.net
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Fix Under Review</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/25755">https://github.com/ceph/ceph/pull/25755</a></p>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=126698
2019-01-02T22:13:30Z
Greg Farnum
gfarnum@redhat.com
<ul></ul><p>So the quorum forms correctly, but just there is a monitor that doesn't get in because it's not connecting to anybody?</p>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127125
2019-01-09T13:38:36Z
Sage Weil
sage@newdream.net
<ul><li><strong>Priority</strong> changed from <i>High</i> to <i>Urgent</i></li></ul>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127192
2019-01-10T13:31:50Z
Sage Weil
sage@newdream.net
<ul></ul><p>another more recent instance<br /><pre>
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_1 r=0
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).wait_connect_message_auth
2019-01-10 02:29:57.299 7f62ea13e700 20 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=158
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_auth r=0
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2
2019-01-10 02:29:57.299 7f62ea13e700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept got peer connect_seq 0 global_seq 14
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 1, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=1
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 13, their proto 13
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 158
2019-01-10 02:29:57.299 7f62ea13e700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept setting up session_security.
2019-01-10 02:29:57.299 7f6302e9df00 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop
2019-01-10 02:29:57.299 7f6302e9df00 2 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).stop
2019-01-10 02:29:57.299 7f6302e9df00 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3306/0,v1:172.21.15.150:6795/0] 0x4a7ac00 conn(0x4b6fa80 msgr2=0x4a7ac00 :36622 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).discard_out_queue started
2019-01-10 02:29:57.299 7f6302e9df00 5 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] shutdown_connections delete 0x4b6f600
2019-01-10 02:29:57.299 7f6302e9df00 5 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] shutdown_connections delete 0x4b6fa80
2019-01-10 02:29:57.299 7f62e993d700 10 -- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] accept_conn 0x4b6f600 [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0]
2019-01-10 02:29:57.299 7f62e993d700 1 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).open state changed while accept_conn, it must be mark_down
2019-01-10 02:29:57.299 7f62e993d700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).accept fault after register
2019-01-10 02:29:57.299 7f62e993d700 20 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).fault
2019-01-10 02:29:57.299 7f62e993d700 10 --2- [v2:172.21.15.150:3303/0,v1:172.21.15.150:6792/0] >> [v2:172.21.15.150:3304/0,v1:172.21.15.150:6793/0] 0x4a7a600 conn(0x4b6f600 msgr2=0x4a7a600 :36618 s=CLOSED pgs=12 cs=1 l=0).fault connection is already closed
</pre><br />//a/sage-2019-01-10_00:51:23-rados:multimon-wip-sage2-testing-2019-01-09-1610-distro-basic-smithi/3440889<br />mon.l, connection with mon.o</p>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127212
2019-01-10T16:45:05Z
Sage Weil
sage@newdream.net
<ul><li><strong>Subject</strong> changed from <i>ceph quorum_status fail from multimon 21.yaml</i> to <i>msg/async: mark_down vs accept race leaves connection registered</i></li></ul>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127213
2019-01-10T16:52:08Z
Sage Weil
sage@newdream.net
<ul></ul><p>just a note, i was able to reliably reproduce this with rados/multimon (no subsets), filtering to only do the 21.yaml, mon_recovery.yaml, and async messenger.. usually 2-3 failures per run (of 92 tests), with the symptom that the quorum_status command timed out.</p>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127247
2019-01-11T00:13:07Z
Sage Weil
sage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>mimic,luminous</i></li></ul>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127356
2019-01-14T10:41:30Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/37896">Backport #37896</a>: mimic: msg/async: mark_down vs accept race leaves connection registered</i> added</li></ul>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127358
2019-01-14T10:41:38Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/37897">Backport #37897</a>: luminous: msg/async: mark_down vs accept race leaves connection registered</i> added</li></ul>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=127653
2019-01-17T16:45:17Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/36175">Bug #36175</a>: msg/async: heartbeat timed out caused by connection register failure</i> added</li></ul>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=130450
2019-03-01T15:26:13Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>
Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered
https://tracker.ceph.com/issues/37778?journal_id=131837
2019-03-12T23:16:43Z
Greg Farnum
gfarnum@redhat.com
<ul><li><strong>Project</strong> changed from <i>RADOS</i> to <i>Messengers</i></li></ul>