Project

General

Profile

Bug #2116

Repeated messages of "heartbeat_check: no heartbeat from"

Added by Wido den Hollander over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
Start date:
02/28/2012
Due date:
% Done:

0%

Spent time:
Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

As discussed on the ml I gathered some logs.

Today I upgraded my whole cluster to 0.42.2 from 0.41.

Due to the on-disk change I formatted my cluster and started it again.

Immediately after the start of the OSD's the "no heartbeat from" started:

2012-02-28 16:30:31.951132    pg v708: 7920 pgs: 7920 active+clean; 8730 bytes data, 164 MB used, 74439 GB / 74520 GB avail
2012-02-28 16:30:31.980965   mds e4: 1/1/1 up {0=alpha=up:active}
2012-02-28 16:30:31.981010   osd e65: 40 osds: 40 up, 40 in
2012-02-28 16:30:31.981192   log 2012-02-28 16:30:30.344847 mon.0 [2a00:f10:11a:408::1]:6789/0 10728 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129)
2012-02-28 16:30:31.981314   mon e1: 3 mons at {pri=[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0,sec=[2a00:f10:11a:408::1]:6789/0,third=[2a00:f10:11a:409::1]:6789/0}
2012-02-28 16:30:33.680286   log 2012-02-28 16:30:33.533593 mon.0 [2a00:f10:11a:408::1]:6789/0 10729 : [INF] osd.0 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6800/12003 failed (by osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/11917)
2012-02-28 16:30:34.705162   log 2012-02-28 16:30:33.709207 mon.0 [2a00:f10:11a:408::1]:6789/0 10730 : [INF] osd.6 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6806/14486 failed (by osd.9 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6809/19176)
2012-02-28 16:30:34.705162   log 2012-02-28 16:30:33.968388 mon.0 [2a00:f10:11a:408::1]:6789/0 10731 : [INF] osd.30 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6806/29719 failed (by osd.37 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6803/26068)
2012-02-28 16:30:35.737775   log 2012-02-28 16:30:35.346912 mon.0 [2a00:f10:11a:408::1]:6789/0 10732 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129)

As you can see, the cluster is completely fresh. No data at all and no I/O load.

One of the things that came to mind was that it might be a clock issue, not being synchronized. I verified all the clocks and those are synchronized.

Information on all the OSD's:

Attached is my ceph.conf and a couple of hours of logging for osd.5 and osd.15

ceph.conf View - My ceph configuration (6.67 KB) Wido den Hollander, 02/28/2012 07:35 AM

osd.5.log_heartbeat.gz (28.8 MB) Wido den Hollander, 02/28/2012 07:35 AM

osd.15.log_heartbeat.gz (25.5 MB) Wido den Hollander, 02/28/2012 07:35 AM

osd.3.heartbeat.log.gz (32.2 MB) Wido den Hollander, 03/01/2012 02:48 AM

osd.8.heartbeat.log.gz (26.2 MB) Wido den Hollander, 03/01/2012 02:48 AM

History

#1 Updated by Sage Weil over 7 years ago

  • Target version set to v0.44

#2 Updated by Sage Weil over 7 years ago

  • Status changed from New to In Progress

looks like a msgr issue?

2012-02-28 12:47:45.167980 7fd09f94e700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 --> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 -- osd_ping(start_heartbeat e0 as_of 20) v1 -- ?+0 0x237cc40 con 0x2181500
2012-02-28 12:47:45.839282 7fd09f94e700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 --> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 -- osd_ping(start_heartbeat e0 as_of 25) v1 -- ?+0 0x237cc40 con 0x21cea00
2012-02-28 12:47:45.855370 7fd097b1d700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 pipe(0x2228500 sd=41 pgs=24 cs=1 l=0).fault initiating reconnect
2012-02-28 12:47:45.871679 7fd098729700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 pipe(0x2228500 sd=41 pgs=24 cs=2 l=0).connect got RESETSESSION
2012-02-28 13:02:45.941694 7fd095e00700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 pipe(0x2228500 sd=41 pgs=26 cs=1 l=0).fault with nothing to send, going to standby

#3 Updated by Sage Weil over 7 years ago

the other side of this conversation is


2012-02-28 12:47:45.200092 7f7097ca1700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 <== osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 1 ==== osd_ping(start_heartbeat e0 as_of 20) v1 ==== 39+0+0 (502699749 0 0) 0x2ce78c0 con 0x26a2780
2012-02-28 12:47:45.228604 7f7090064700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x274ea00 sd=44 pgs=11 cs=1 l=0).fault with nothing to send, going to standby
2012-02-28 12:47:45.630629 7f7093b98700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 --> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 -- osd_ping(heartbeat e0 as_of 20) v1 -- ?+0 0x2d60e00 con 0x26a2780
2012-02-28 12:47:45.847321 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b000 sd=60 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 2 state 1
2012-02-28 12:47:45.847411 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b000 sd=60 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing
2012-02-28 12:47:45.848900 7f7097ca1700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 <== osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 1 ==== osd_ping(start_heartbeat e0 as_of 25) v1 ==== 39+0+0 (2424295025 0 0) 0x2d60e00 con 0x2e8ec80
2012-02-28 12:47:45.870002 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b280 sd=56 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
2012-02-28 13:02:45.940175 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b280 sd=56 pgs=24 cs=1 l=0).fault with nothing to send, going to standby

#4 Updated by Greg Farnum over 7 years ago

This may be a messenger issue, but it's not losing that initial message — notice how osd5 tries to send a ping back to osd5? For some reason it doesn't apparently trigger a reconnect though, which is odd...
Also the bit where osd5 is apparently getting out-of-order connects...

#5 Updated by Sage Weil over 7 years ago

Wido, are you able to reproduce this reliably? I have an idea what the problem is, but have never reproduced this. Hopefully you can verify it goes away?

#6 Updated by Sage Weil over 7 years ago

  • Status changed from In Progress to Feedback

i'm hoping wip-2116 fixes it...

#7 Updated by Wido den Hollander over 7 years ago

I can almost always reproduce it.

I just upgraded my cluster to:

ceph version 0.42.2-206-gd77c579 (commit:d77c5796fc898c8fdf6b38fe887c5c62e366023e)

It all worked fine for about 20 minutes and then I started seeing this again:

2012-03-01 11:44:10.083448   log 2012-03-01 11:44:09.957804 mon.0 [2a00:f10:11a:408::1]:6789/0 132126 : [INF] osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6818/4327 failed (by osd.3 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/20053)
2012-03-01 11:44:10.083448   log 2012-03-01 11:44:09.958113 mon.0 [2a00:f10:11a:408::1]:6789/0 132127 : [INF] osd.39 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6820/11945 failed (by osd.3 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/20053)

I've attached the logs of osd.0 and osd.8 again, hope that helps.

2012-03-01 11:46:09.729570 7f14027af700 osd.3 360 heartbeat_check: no heartbeat from osd.8 since 2012-03-01 10:51:46.452920 (cutoff 2012-03-01 11:45:49.729539)
2012-03-01 11:46:10.014949 7f140f8ca700 osd.3 360 heartbeat_check: no heartbeat from osd.8 since 2012-03-01 10:51:46.452920 (cutoff 2012-03-01 11:45:50.014920)
2012-03-01 11:46:10.015179 7f140f8ca700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/20053 --> [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6818/4327 e360 v360) v2 -- ?+0 0x269d000 con 0x1894780
2012-03-01 11:46:10.432683 7f14027af700 osd.3 360 heartbeat_check: no heartbeat from osd.8 since 2012-03-01 10:51:46.452920 (cutoff 2012-03-01 11:45:50.432651)

I'll leave the cluster running like this, that will probably keep it in this state. If any further logging is needed I can turn that on.

#8 Updated by Sage Weil over 7 years ago

ok, i have a theory what's going on. can you try the new wip-2116, and run with debug ms = 20?

thanks!

#9 Updated by Sage Weil over 7 years ago

  • Priority changed from Normal to High

#10 Updated by Wido den Hollander over 7 years ago

I tried to compile the wip-2116 and master branches today, that failed. So I haven't been able to test the new code yet.

I'll try again tomorrow and see if I get it working.

#11 Updated by Wido den Hollander over 7 years ago

I ran multiple I/O loads on the cluster last night and performed multiple tests, but this problem didn't come back yet.

I wouldn't close this issue yet, let's wait a couple of days to see if it might come back.

#12 Updated by Wido den Hollander over 7 years ago

I'm not willing to say it's completely gone, but it is starting to seem like it.

In the past few days the cluster has been running fine with a moderate I/O load, no issues at all.

For now we could close this issue, I'll re-open it when I see it again.

#13 Updated by Sage Weil over 7 years ago

Hrm... well, if that's the case then we should merge the wip-2116 as is, then... putting that on my list

#14 Updated by Sage Weil over 7 years ago

see new wip-osd-hb branch

#15 Updated by Sage Weil over 7 years ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF