Bug #36175: msg/async: heartbeat timed out caused by connection register failure - Messengers - Ceph

Actions

Copy link

Bug #36175

closed

msg/async: heartbeat timed out caused by connection register failure

Added by Yan Jun over 5 years ago. Updated about 5 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We see some osds' heartbeat time out for quite a few long time as following(5 days or longer until restart osd):

Sep 16 15:29:56 NFV-D-HEBAD-ZX-A-51B-S-DBS-ZX-00007 ceph-osd[26807]: 2018-09-16 15:29:56.002296 7ff9b6271700 -1 osd.62 2562 heartbeat_check: no reply from [fd00:0:1::16]:6837 osd.37 ever on either front or back, first ping sent 2018-09-11 23:01:50.773623 (oldest deadline 2018-09-11 23:02:10.773623)
Sep 16 15:29:57 NFV-D-HEBAD-ZX-A-51B-S-DBS-ZX-00007 ceph-osd[26807]: 2018-09-16 15:29:57.002549 7ff9b6271700 -1 osd.62 2562 heartbeat_check: no reply from [fd00:0:1::16]:6837 osd.37 ever on either front or back, first ping sent 2018-09-11 23:01:50.773623 (oldest deadline 2018-09-11 23:02:10.773623)
Sep 16 15:29:58 NFV-D-HEBAD-ZX-A-51B-S-DBS-ZX-00007 ceph-osd[26807]: 2018-09-16 15:29:58.002751 7ff9b6271700 -1 osd.62 2562 heartbeat_check: no reply from [fd00:0:1::16]:6837 osd.37 ever on either front or back, first ping sent 2018-09-11 23:01:50.773623 (oldest deadline 2018-09-11 23:02:10.773623)

and

2018-09-18 13:13:25.370087 7fd7fc5eb700 -1 osd.24 2678 heartbeat_check: no reply from [fd00:0:1::11]:6849 osd.34 ever on either front or back, first ping sent 2018-09-17 17:39:38.495959 (oldest deadline 2018-09-17 17:39:58.495959)
2018-09-18 13:13:26.370306 7fd7fc5eb700 -1 osd.24 2678 heartbeat_check: no reply from [fd00:0:1::11]:6849 osd.34 ever on either front or back, first ping sent 2018-09-17 17:39:38.495959 (oldest deadline 2018-09-17 17:39:58.495959)
2018-09-18 13:13:27.370603 7fd7fc5eb700 -1 osd.24 2678 heartbeat_check: no reply from [fd00:0:1::11]:6849 osd.34 ever on either front or back, first ping sent 2018-09-17 17:39:38.495959 (oldest deadline 2018-09-17 17:39:58.495959)

The cluster status is Health OK.

Take a close look at the heartbeat peer's messenger logs, we found that the closed connection may be still in registered conns map but not in
deleted_conns set, which cause reconnection always failure. but the root cause is still lost.

    2018-09-18 12:42:53.202048 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> - conn(0x7f1743347000 :6847 s=STATE_ACCEPTING_WAIT_
    ER_ADDR pgs=0 cs=0 l=0)._process_connection accept peer addr is [fd00:0:1:48::15]:0/26815
    2018-09-18 12:42:53.202064 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).process prev state is STATE_ACCEPTING_WAIT_BANNER_ADDR
    2018-09-18 12:42:53.202092 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
    2018-09-18 12:42:53.202113 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
    2018-09-18 12:42:53.202126 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0)._process_connection accept got peer connect_seq 0 global_seq 4968
    2018-09-18 12:42:53.202142 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1)._process_connection accept of host_type 4, policy.lossy=1 policy.server=1 poli
    tandby=0 policy.resetcheck=0
    2018-09-18 12:42:53.202157 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept my proto 10, their proto 10
    2018-09-18 12:42:53.202172 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept setting up session_security.
    2018-09-18 12:42:53.202188 7f16fcf59700  1 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg existing already closed.
    2018-09-18 12:42:53.202200 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg accept success, connect_seq = 1 in_seq=0, sending READY
    2018-09-18 12:42:53.202214 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg accept features 2305244844532236283
    2018-09-18 12:42:53.202231 7f16fcf59700  1 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg existing race replacing process for addr=[fd00:0:1:48::1
    /26815 just fail later one(this)
    2018-09-18 12:42:53.202246 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg accept fault after register
    2018-09-18 12:42:53.202258 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg failed to accept.
    2018-09-18 12:42:53.202273 7f16fcf59700  1 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).fault on lossy channel, failing
    2018-09-18 12:42:53.202283 7f16fcf59700  2 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1)._stop
    2018-09-18 12:42:53.202295 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).discard_out_queue started

Related issues 1 (0 open — 1 closed)