Project

General

Profile

Bug #36175

msg/async: heartbeat timed out caused by connection register failure

Added by Yan Jun almost 2 years ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

We see some osds' heartbeat time out for quite a few long time as following(5 days or longer until restart osd):

Sep 16 15:29:56 NFV-D-HEBAD-ZX-A-51B-S-DBS-ZX-00007 ceph-osd[26807]: 2018-09-16 15:29:56.002296 7ff9b6271700 -1 osd.62 2562 heartbeat_check: no reply from [fd00:0:1::16]:6837 osd.37 ever on either front or back, first ping sent 2018-09-11 23:01:50.773623 (oldest deadline 2018-09-11 23:02:10.773623)
Sep 16 15:29:57 NFV-D-HEBAD-ZX-A-51B-S-DBS-ZX-00007 ceph-osd[26807]: 2018-09-16 15:29:57.002549 7ff9b6271700 -1 osd.62 2562 heartbeat_check: no reply from [fd00:0:1::16]:6837 osd.37 ever on either front or back, first ping sent 2018-09-11 23:01:50.773623 (oldest deadline 2018-09-11 23:02:10.773623)
Sep 16 15:29:58 NFV-D-HEBAD-ZX-A-51B-S-DBS-ZX-00007 ceph-osd[26807]: 2018-09-16 15:29:58.002751 7ff9b6271700 -1 osd.62 2562 heartbeat_check: no reply from [fd00:0:1::16]:6837 osd.37 ever on either front or back, first ping sent 2018-09-11 23:01:50.773623 (oldest deadline 2018-09-11 23:02:10.773623)

and
2018-09-18 13:13:25.370087 7fd7fc5eb700 -1 osd.24 2678 heartbeat_check: no reply from [fd00:0:1::11]:6849 osd.34 ever on either front or back, first ping sent 2018-09-17 17:39:38.495959 (oldest deadline 2018-09-17 17:39:58.495959)
2018-09-18 13:13:26.370306 7fd7fc5eb700 -1 osd.24 2678 heartbeat_check: no reply from [fd00:0:1::11]:6849 osd.34 ever on either front or back, first ping sent 2018-09-17 17:39:38.495959 (oldest deadline 2018-09-17 17:39:58.495959)
2018-09-18 13:13:27.370603 7fd7fc5eb700 -1 osd.24 2678 heartbeat_check: no reply from [fd00:0:1::11]:6849 osd.34 ever on either front or back, first ping sent 2018-09-17 17:39:38.495959 (oldest deadline 2018-09-17 17:39:58.495959)

The cluster status is Health OK.

Take a close look at the heartbeat peer's messenger logs, we found that the closed connection may be still in registered conns map but not in
deleted_conns set, which cause reconnection always failure. but the root cause is still lost.

    2018-09-18 12:42:53.202048 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> - conn(0x7f1743347000 :6847 s=STATE_ACCEPTING_WAIT_
    ER_ADDR pgs=0 cs=0 l=0)._process_connection accept peer addr is [fd00:0:1:48::15]:0/26815
    2018-09-18 12:42:53.202064 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).process prev state is STATE_ACCEPTING_WAIT_BANNER_ADDR
    2018-09-18 12:42:53.202092 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
    2018-09-18 12:42:53.202113 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG
    2018-09-18 12:42:53.202126 7f16fcf59700 20 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0)._process_connection accept got peer connect_seq 0 global_seq 4968
    2018-09-18 12:42:53.202142 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1)._process_connection accept of host_type 4, policy.lossy=1 policy.server=1 poli
    tandby=0 policy.resetcheck=0
    2018-09-18 12:42:53.202157 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept my proto 10, their proto 10
    2018-09-18 12:42:53.202172 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept setting up session_security.
    2018-09-18 12:42:53.202188 7f16fcf59700  1 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg existing already closed.
    2018-09-18 12:42:53.202200 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg accept success, connect_seq = 1 in_seq=0, sending READY
    2018-09-18 12:42:53.202214 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg accept features 2305244844532236283
    2018-09-18 12:42:53.202231 7f16fcf59700  1 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg existing race replacing process for addr=[fd00:0:1:48::1
    /26815 just fail later one(this)
    2018-09-18 12:42:53.202246 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg accept fault after register
    2018-09-18 12:42:53.202258 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).handle_connect_msg failed to accept.
    2018-09-18 12:42:53.202273 7f16fcf59700  1 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).fault on lossy channel, failing
    2018-09-18 12:42:53.202283 7f16fcf59700  2 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1)._stop
    2018-09-18 12:42:53.202295 7f16fcf59700 10 -- [fd00:0:1:48::11]:6847/2031400 >> [fd00:0:1:48::15]:0/26815 conn(0x7f1743347000 :6847
    TATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=4968 cs=1 l=1).discard_out_queue started


Related issues

Duplicates Messengers - Bug #37778: msg/async: mark_down vs accept race leaves connection registered Resolved 01/02/2019

History

#1 Updated by Neha Ojha almost 2 years ago

  • Status changed from New to Need More Info

Do you have the cluster status and logs for this failure?

#2 Updated by Yan Jun almost 2 years ago

The cluster status is Health OK.

It's not easy to reproduce and I have no more useful logs now.

#5 Updated by Yan Jun over 1 year ago

this issue has been resloved.

#6 Updated by Nathan Cutler over 1 year ago

  • Status changed from Need More Info to Duplicate

Marking duplicate of #37778 - correct me if I'm wrong!

#7 Updated by Nathan Cutler over 1 year ago

  • Duplicates Bug #37778: msg/async: mark_down vs accept race leaves connection registered added

#8 Updated by Greg Farnum over 1 year ago

  • Project changed from RADOS to Messengers

Also available in: Atom PDF