Actions
Bug #36602
closedosd: race condition opening heartbeat connection
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2018-10-25 15:38:56.698 7f2701c6d700 10 osd.3 13 adding random peer osd.7 2018-10-25 15:38:56.698 7f27124f9700 0 -- >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure 2018-10-25 15:38:56.698 7f27124f9700 1 -- >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 53 2018-10-25 15:38:56.698 7f27124f9700 1 -- >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2018-10-25 15:38:56.698 7f27124f9700 1 -- - >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).handle_server_banner_and_identify read banner and identify addresses failed 2018-10-25 15:38:56.698 7f27124f9700 1 -- - >> 172.21.15.173:6811/10887 conn(0x55f7f3490d80 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).fault on lossy channel, failing 2018-10-25 15:38:56.698 7f27124f9700 10 osd.3 13 OSD::ms_get_authorizer type=osd 2018-10-25 15:38:56.698 7f2701c6d700 10 osd.3 13 _add_heartbeat_peer: new peer osd.7 172.21.15.173:6811/10887 172.21.15.173:6810/10887
/a/sage-2018-10-24_19:59:09-rados-wip-sage-testing-2018-10-24-1258-distro-basic-smithi/3179821
The problem is that the code is
pair<ConnectionRef,ConnectionRef> cons = service.get_con_osd_hb(p, osdmap->get_epoch()); if (!cons.first) return; hi = &heartbeat_peers[p]; hi->peer = p; RefCountedPtr s{new HeartbeatSession{p}, false}; hi->con_back = cons.first.get(); hi->con_back->set_priv(s);
i.e., connection opened, then session attached. but the heartbeat handler is
bool OSD::heartbeat_reset(Connection *con) { auto s = con->get_priv(); if (s) { heartbeat_lock.Lock();
so we may ignore the reset
Updated by Sage Weil over 5 years ago
- Status changed from In Progress to Fix Under Review
Updated by Greg Farnum over 5 years ago
- Status changed from Fix Under Review to 7
Updated by Patrick Donnelly over 5 years ago
- Copied to Backport #36636: luminous: osd: race condition opening heartbeat connection added
Updated by Patrick Donnelly over 5 years ago
- Copied to Backport #36637: mimic: osd: race condition opening heartbeat connection added
Updated by Nathan Cutler over 5 years ago
- Status changed from Pending Backport to Resolved
Actions