Project

General

Profile

Actions

Bug #44901

closed

luminous: osd continue down because of the hearbeattimeout

Added by jack ma about 4 years ago. Updated about 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

HI! all! Thanks for reading this msg.

I hava one ceph cluster installed with ceph V12.2.12. It runs well for about half a year.

Last week we add anoher two meachine to this ceph cluster.Then all the osds became unstable.

The osd ansync message complain can not hearbeat to eachother.But the network ping with no drop packages and no error packages.

I use bond0 for the ceph cluster front and back netwrok.Now I set nodown noout the cluster became stable,

but from the log I see a lot for error aysnc message.I have try simple message, It also the smae error.

All the osd error like below:

NG_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.989469 7f42794da700 0 -- 10.255.255.54:6814/1000006 >> 10.255.255.56:0/7 conn(0x55721e0e5800 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.989557 7f42784d8700 0 -- 10.255.255.54:6819/1000006 >> 10.255.255.52:0/7 conn(0x55721e0e8800 :6819 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.989728 7f4278cd9700 0 -- 10.255.255.54:6814/1000006 >> 10.255.255.55:0/7 conn(0x55722973b000 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.989872 7f42794da700 0 -- 10.255.255.54:6819/1000006 >> 10.255.255.55:0/7 conn(0x557225b15000 :6819 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.990111 7f42794da700 0 -- 10.255.255.54:6819/1000006 >> 10.255.255.55:0/7 conn(0x557228506000 :6819 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.990161 7f42784d8700 0 -- 10.255.255.54:6819/1000006 >> 10.255.255.56:0/7 conn(0x557226320000 :6819 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.990196 7f42794da700 0 -- 10.255.255.54:6814/1000006 >> 10.255.255.56:0/7 conn(0x55722650b000 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.991450 7f4278cd9700 0 -- 10.255.255.54:6819/1000006 >> 10.255.255.55:0/7 conn(0x5572298d7800 :6819 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.991458 7f42784d8700 0 -- 10.255.255.54:6814/1000006 >> 10.255.255.52:0/7 conn(0x557226f19000 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.991639 7f4278cd9700 0 -- 10.255.255.54:6819/1000006 >> 10.255.255.52:0/7 conn(0x557226867800 :6819 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.991798 7f42794da700 0 -- 10.255.255.54:6814/1000006 >> 10.255.255.56:0/7 conn(0x55722a20b000 :6814 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2020-04-02 09:59:17.991842 7f42784d8700 0 -- 10.255.255.54:6819/1000006 >> 10.255.255.56:0/7 conn(0x557226869000 :6819 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)

The network config:
bond0 Link encap:Ethernet HWaddr 6c:92:bf:c2:8e:e5
inet6 addr: fe80::6e92:bfff:fec2:8ee5/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:126155520073 errors:0 dropped:3217298 overruns:0 frame:0
TX packets:133297822313 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:57485747361080 (57.4 TB) TX bytes:71267041966300 (71.2 TB)

bond0.38 Link encap:Ethernet HWaddr 6c:92:bf:c2:8e:e5
inet addr:192.168.38.54 Bcast:192.168.38.255 Mask:255.255.255.0
inet6 addr: fe80::6e92:bfff:fec2:8ee5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:60802363 errors:0 dropped:0 overruns:0 frame:0
TX packets:53614452 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:34857574617 (34.8 GB) TX bytes:23829455266 (23.8 GB)

bond0.4000 Link encap:Ethernet HWaddr 6c:92:bf:c2:8e:e5
inet addr:10.255.255.54 Bcast:10.255.255.63 Mask:255.255.255.192
inet6 addr: fe80::6e92:bfff:fec2:8ee5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:107628762285 errors:0 dropped:0 overruns:0 frame:0
TX packets:96091921746 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:54705054842566 (54.7 TB) TX bytes:68763270985565 (68.7 TB)

brq86d8e0ef-fa Link encap:Ethernet HWaddr 26:30:9e:96:7a:71
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:2512246 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:338012546 (338.0 MB) TX bytes:0 (0.0 B)

docker0 Link encap:Ethernet HWaddr 02:42:67:87:8d:a7
inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

eno1 Link encap:Ethernet HWaddr 6c:92:bf:c2:8e:e5
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:62792632732 errors:0 dropped:614221 overruns:0 frame:0
TX packets:66647497482 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:28698305745018 (28.6 TB) TX bytes:35631476375125 (35.6 TB)

eno2 Link encap:Ethernet HWaddr 6c:92:bf:c2:8e:e5
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:63362887347 errors:0 dropped:633400 overruns:0 frame:0
TX packets:66650324833 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:28787441616499 (28.7 TB) TX bytes:35635565591656 (35.6 TB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:3272039707 errors:0 dropped:0 overruns:0 frame:0
TX packets:3272039707 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:753581347467 (753.5 GB) TX bytes:753581347467 (753.5 GB)

tap579ce88c-9e Link encap:Ethernet HWaddr fe:16:3e:32:3b:0d
inet6 addr: fe80::fc16:3eff:fe32:3b0d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:3322983 errors:0 dropped:0 overruns:0 frame:0
TX packets:3480283 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2221592585 (2.2 GB) TX bytes:1380408263 (1.3 GB)

tapa32c35b1-87 Link encap:Ethernet HWaddr fe:16:3e:79:65:9f
inet6 addr: fe80::fc16:3eff:fe79:659f/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:22360161 errors:0 dropped:0 overruns:0 frame:0
TX packets:25585406 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3953751232 (3.9 GB) TX bytes:6985195478 (6.9 GB)

vxlan-100 Link encap:Ethernet HWaddr 26:30:9e:96:7a:71
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:7671113091 errors:0 dropped:0 overruns:0 frame:0
TX packets:6732694121 errors:0 dropped:17713 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2959527236402 (2.9 TB) TX bytes:973862583509 (973.8 GB)

Actions #1

Updated by Brad Hubbard about 4 years ago

  • Status changed from New to Rejected

There is clearly an issue with your network which is not a ceph issue.

Actions #2

Updated by jack ma about 4 years ago

Slove it! It is because we deploy ceph in the docker use kolla asible.

We start some dockers by hand and miss some parameter。

like below:

https://github.com/ceph/ceph-container/issues/436

Actions

Also available in: Atom PDF