Project

General

Profile

Actions

Bug #38493

closed

msg/async: connection race + winner fault can leave connection stuck at replacing forever

Added by xie xingguo about 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
AsyncMessenger
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-02-02 09:31:03.402291 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0)._process_connection accept got peer connect_seq 55 global_seq 29026144

2019-02-02 09:31:03.402085 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept my proto 13, their proto 13
2019-02-02 09:31:03.402090 7f5f4935e700 10 mon.host-192-168-7-118@0(electing) e1 ms_verify_authorizer 100.100.7.122:6789/0 mon protocol 2
2019-02-02 09:31:03.402124 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept setting up session_security.
2019-02-02 09:31:03.402131 7f5f4935e700 1 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg existing racing replace happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2019-02-02 09:31:03.402144 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0)._try_send sent bytes 62 remaining bytes 0
2019-02-02 09:31:03.402154 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)
.process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
2019-02-02 09:31:03.402278 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)
.process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
2019-02-02 09:31:03.402286 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
2019-02-02 09:31:03.402291 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0)._process_connection accept got peer connect_seq 55 global_seq 29026144
2019-02-02 09:31:03.402298 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0)._process_connection accept of host_type 1, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=1
2019-02-02 09:31:03.402303 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept my proto 13, their proto 13
2019-02-02 09:31:03.402308 7f5f4935e700 10 mon.host-192-168-7-118@0(electing) e1 ms_verify_authorizer 100.100.7.122:6789/0 mon protocol 2
2019-02-02 09:31:03.402342 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept setting up session_security.
2019-02-02 09:31:03.402349 7f5f4935e700 1 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg existing racing replace happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2019-02-02 09:31:03.402363 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0)._try_send sent bytes 62 remaining bytes 0
2019-02-02 09:31:03.402378 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)
.process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
2019-02-02 09:31:03.402503 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)
.process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
2019-02-02 09:31:03.402515 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
2019-02-02 09:31:03.402522 7f5f4935e700 20 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0)._process_connection accept got peer connect_seq 55 global_seq 29026145
2019-02-02 09:31:03.402527 7f5f4935e700 10 -- 100.100.7.118:6789/0 >> 100.100.7.122:6789/0 conn(0x55ec02821000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0)._process_connection accept of host_type 1, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=1


Related issues 3 (0 open3 closed)

Copied to Messengers - Backport #39241: nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreverResolvedPrashant DActions
Copied to Messengers - Backport #39242: mimic: msg/async: connection race + winner fault can leave connection stuck at replacing foreverRejectedxie xingguoActions
Copied to Messengers - Backport #39243: luminous: msg/async: connection race + winner fault can leave connection stuck at replacing foreverResolvedxie xingguoActions
Actions #1

Updated by Greg Farnum about 5 years ago

Hmm, I thought Sage just fixed this bug, what's the exact sha1?

Actions #2

Updated by xie xingguo about 5 years ago

Greg Farnum wrote:

Hmm, I thought Sage just fixed this bug, what's the exact sha1?

You mean http://tracker.ceph.com/issues/37779, right? Sadly it is not the same issue :-(

Actions #3

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions #4

Updated by Greg Farnum about 5 years ago

  • Category set to AsyncMessenger
Actions #5

Updated by xie xingguo about 5 years ago

  • Status changed from New to Pending Backport
  • Backport set to luminous,mimic,nautilus
Actions #6

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39241: nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing forever added
Actions #7

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39242: mimic: msg/async: connection race + winner fault can leave connection stuck at replacing forever added
Actions #8

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39243: luminous: msg/async: connection race + winner fault can leave connection stuck at replacing forever added
Actions #9

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF