Project

General

Profile

Actions

Bug #36602

closed

osd: race condition opening heartbeat connection

Added by Sage Weil over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-10-25 15:38:56.698 7f2701c6d700 10 osd.3 13  adding random peer osd.7
2018-10-25 15:38:56.698 7f27124f9700  0 --  >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure
2018-10-25 15:38:56.698 7f27124f9700  1 --  >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 53
2018-10-25 15:38:56.698 7f27124f9700  1 --  >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2018-10-25 15:38:56.698 7f27124f9700  1 -- - >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).handle_server_banner_and_identify read banner and identify addresses failed
2018-10-25 15:38:56.698 7f27124f9700  1 -- - >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).fault on lossy channel, failing
2018-10-25 15:38:56.698 7f27124f9700 10 osd.3 13 OSD::ms_get_authorizer type=osd
2018-10-25 15:38:56.698 7f2701c6d700 10 osd.3 13 _add_heartbeat_peer: new peer osd.7 172.21.15.173:6811/10887 172.21.15.173:6810/10887

/a/sage-2018-10-24_19:59:09-rados-wip-sage-testing-2018-10-24-1258-distro-basic-smithi/3179821

The problem is that the code is

    pair<ConnectionRef,ConnectionRef> cons = service.get_con_osd_hb(p, osdmap->get_epoch());
    if (!cons.first)
      return;
    hi = &heartbeat_peers[p];
    hi->peer = p;
    RefCountedPtr s{new HeartbeatSession{p}, false};
    hi->con_back = cons.first.get();
    hi->con_back->set_priv(s);

i.e., connection opened, then session attached. but the heartbeat handler is

bool OSD::heartbeat_reset(Connection *con)
{
  auto s = con->get_priv();
  if (s) {
    heartbeat_lock.Lock();

so we may ignore the reset

Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #36636: luminous: osd: race condition opening heartbeat connectionResolvedNathan CutlerActions
Copied to RADOS - Backport #36637: mimic: osd: race condition opening heartbeat connectionResolvedNathan CutlerActions
Actions #1

Updated by Sage Weil over 5 years ago

  • Status changed from In Progress to Fix Under Review
Actions #2

Updated by Greg Farnum over 5 years ago

  • Status changed from Fix Under Review to 7
Actions #3

Updated by Sage Weil over 5 years ago

  • Status changed from 7 to Pending Backport
Actions #4

Updated by Patrick Donnelly over 5 years ago

  • Copied to Backport #36636: luminous: osd: race condition opening heartbeat connection added
Actions #5

Updated by Patrick Donnelly over 5 years ago

  • Copied to Backport #36637: mimic: osd: race condition opening heartbeat connection added
Actions #6

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF