Project

General

Profile

Bug #36602

osd: race condition opening heartbeat connection

Added by Sage Weil 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
10/26/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

2018-10-25 15:38:56.698 7f2701c6d700 10 osd.3 13  adding random peer osd.7
2018-10-25 15:38:56.698 7f27124f9700  0 --  >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure
2018-10-25 15:38:56.698 7f27124f9700  1 --  >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 53
2018-10-25 15:38:56.698 7f27124f9700  1 --  >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2018-10-25 15:38:56.698 7f27124f9700  1 -- - >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).handle_server_banner_and_identify read banner and identify addresses failed
2018-10-25 15:38:56.698 7f27124f9700  1 -- - >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).fault on lossy channel, failing
2018-10-25 15:38:56.698 7f27124f9700 10 osd.3 13 OSD::ms_get_authorizer type=osd
2018-10-25 15:38:56.698 7f2701c6d700 10 osd.3 13 _add_heartbeat_peer: new peer osd.7 172.21.15.173:6811/10887 172.21.15.173:6810/10887

/a/sage-2018-10-24_19:59:09-rados-wip-sage-testing-2018-10-24-1258-distro-basic-smithi/3179821

The problem is that the code is

    pair<ConnectionRef,ConnectionRef> cons = service.get_con_osd_hb(p, osdmap->get_epoch());
    if (!cons.first)
      return;
    hi = &heartbeat_peers[p];
    hi->peer = p;
    RefCountedPtr s{new HeartbeatSession{p}, false};
    hi->con_back = cons.first.get();
    hi->con_back->set_priv(s);

i.e., connection opened, then session attached. but the heartbeat handler is

bool OSD::heartbeat_reset(Connection *con)
{
  auto s = con->get_priv();
  if (s) {
    heartbeat_lock.Lock();

so we may ignore the reset

Related issues

Copied to RADOS - Backport #36636: luminous: osd: race condition opening heartbeat connection Resolved
Copied to RADOS - Backport #36637: mimic: osd: race condition opening heartbeat connection Resolved

History

#1 Updated by Sage Weil 3 months ago

  • Status changed from In Progress to Need Review

#2 Updated by Greg Farnum 3 months ago

  • Status changed from Need Review to Testing

#3 Updated by Sage Weil 3 months ago

  • Status changed from Testing to Pending Backport

#4 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #36636: luminous: osd: race condition opening heartbeat connection added

#5 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #36637: mimic: osd: race condition opening heartbeat connection added

#6 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF