Project

General

Profile

Actions

Bug #2116

closed

Repeated messages of "heartbeat_check: no heartbeat from"

Added by Wido den Hollander about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

As discussed on the ml I gathered some logs.

Today I upgraded my whole cluster to 0.42.2 from 0.41.

Due to the on-disk change I formatted my cluster and started it again.

Immediately after the start of the OSD's the "no heartbeat from" started:

2012-02-28 16:30:31.951132    pg v708: 7920 pgs: 7920 active+clean; 8730 bytes data, 164 MB used, 74439 GB / 74520 GB avail
2012-02-28 16:30:31.980965   mds e4: 1/1/1 up {0=alpha=up:active}
2012-02-28 16:30:31.981010   osd e65: 40 osds: 40 up, 40 in
2012-02-28 16:30:31.981192   log 2012-02-28 16:30:30.344847 mon.0 [2a00:f10:11a:408::1]:6789/0 10728 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129)
2012-02-28 16:30:31.981314   mon e1: 3 mons at {pri=[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0,sec=[2a00:f10:11a:408::1]:6789/0,third=[2a00:f10:11a:409::1]:6789/0}
2012-02-28 16:30:33.680286   log 2012-02-28 16:30:33.533593 mon.0 [2a00:f10:11a:408::1]:6789/0 10729 : [INF] osd.0 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6800/12003 failed (by osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/11917)
2012-02-28 16:30:34.705162   log 2012-02-28 16:30:33.709207 mon.0 [2a00:f10:11a:408::1]:6789/0 10730 : [INF] osd.6 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6806/14486 failed (by osd.9 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6809/19176)
2012-02-28 16:30:34.705162   log 2012-02-28 16:30:33.968388 mon.0 [2a00:f10:11a:408::1]:6789/0 10731 : [INF] osd.30 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6806/29719 failed (by osd.37 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6803/26068)
2012-02-28 16:30:35.737775   log 2012-02-28 16:30:35.346912 mon.0 [2a00:f10:11a:408::1]:6789/0 10732 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129)

As you can see, the cluster is completely fresh. No data at all and no I/O load.

One of the things that came to mind was that it might be a clock issue, not being synchronized. I verified all the clocks and those are synchronized.

Information on all the OSD's:

Attached is my ceph.conf and a couple of hours of logging for osd.5 and osd.15


Files

ceph.conf (6.67 KB) ceph.conf My ceph configuration Wido den Hollander, 02/28/2012 07:35 AM
osd.5.log_heartbeat.gz (28.8 MB) osd.5.log_heartbeat.gz Wido den Hollander, 02/28/2012 07:35 AM
osd.15.log_heartbeat.gz (25.5 MB) osd.15.log_heartbeat.gz Wido den Hollander, 02/28/2012 07:35 AM
osd.3.heartbeat.log.gz (32.2 MB) osd.3.heartbeat.log.gz Wido den Hollander, 03/01/2012 02:48 AM
osd.8.heartbeat.log.gz (26.2 MB) osd.8.heartbeat.log.gz Wido den Hollander, 03/01/2012 02:48 AM
Actions #1

Updated by Sage Weil about 12 years ago

  • Target version set to v0.44
Actions #2

Updated by Sage Weil about 12 years ago

  • Status changed from New to In Progress

looks like a msgr issue?

2012-02-28 12:47:45.167980 7fd09f94e700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 --> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 -- osd_ping(start_heartbeat e0 as_of 20) v1 -- ?+0 0x237cc40 con 0x2181500
2012-02-28 12:47:45.839282 7fd09f94e700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 --> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 -- osd_ping(start_heartbeat e0 as_of 25) v1 -- ?+0 0x237cc40 con 0x21cea00
2012-02-28 12:47:45.855370 7fd097b1d700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 pipe(0x2228500 sd=41 pgs=24 cs=1 l=0).fault initiating reconnect
2012-02-28 12:47:45.871679 7fd098729700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 pipe(0x2228500 sd=41 pgs=24 cs=2 l=0).connect got RESETSESSION
2012-02-28 13:02:45.941694 7fd095e00700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 pipe(0x2228500 sd=41 pgs=26 cs=1 l=0).fault with nothing to send, going to standby

Actions #3

Updated by Sage Weil about 12 years ago

the other side of this conversation is


2012-02-28 12:47:45.200092 7f7097ca1700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 <== osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 1 ==== osd_ping(start_heartbeat e0 as_of 20) v1 ==== 39+0+0 (502699749 0 0) 0x2ce78c0 con 0x26a2780
2012-02-28 12:47:45.228604 7f7090064700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x274ea00 sd=44 pgs=11 cs=1 l=0).fault with nothing to send, going to standby
2012-02-28 12:47:45.630629 7f7093b98700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 --> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 -- osd_ping(heartbeat e0 as_of 20) v1 -- ?+0 0x2d60e00 con 0x26a2780
2012-02-28 12:47:45.847321 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b000 sd=60 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 2 state 1
2012-02-28 12:47:45.847411 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b000 sd=60 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing
2012-02-28 12:47:45.848900 7f7097ca1700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 <== osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 1 ==== osd_ping(start_heartbeat e0 as_of 25) v1 ==== 39+0+0 (2424295025 0 0) 0x2d60e00 con 0x2e8ec80
2012-02-28 12:47:45.870002 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b280 sd=56 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
2012-02-28 13:02:45.940175 7f708eb4f700 -- [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/14395 >> [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:0/12130 pipe(0x282b280 sd=56 pgs=24 cs=1 l=0).fault with nothing to send, going to standby

Actions #4

Updated by Greg Farnum about 12 years ago

This may be a messenger issue, but it's not losing that initial message — notice how osd5 tries to send a ping back to osd5? For some reason it doesn't apparently trigger a reconnect though, which is odd...
Also the bit where osd5 is apparently getting out-of-order connects...

Actions #5

Updated by Sage Weil about 12 years ago

Wido, are you able to reproduce this reliably? I have an idea what the problem is, but have never reproduced this. Hopefully you can verify it goes away?

Actions #6

Updated by Sage Weil about 12 years ago

  • Status changed from In Progress to 4

i'm hoping wip-2116 fixes it...

Updated by Wido den Hollander about 12 years ago

I can almost always reproduce it.

I just upgraded my cluster to:

ceph version 0.42.2-206-gd77c579 (commit:d77c5796fc898c8fdf6b38fe887c5c62e366023e)

It all worked fine for about 20 minutes and then I started seeing this again:

2012-03-01 11:44:10.083448   log 2012-03-01 11:44:09.957804 mon.0 [2a00:f10:11a:408::1]:6789/0 132126 : [INF] osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6818/4327 failed (by osd.3 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/20053)
2012-03-01 11:44:10.083448   log 2012-03-01 11:44:09.958113 mon.0 [2a00:f10:11a:408::1]:6789/0 132127 : [INF] osd.39 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6820/11945 failed (by osd.3 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/20053)

I've attached the logs of osd.0 and osd.8 again, hope that helps.

2012-03-01 11:46:09.729570 7f14027af700 osd.3 360 heartbeat_check: no heartbeat from osd.8 since 2012-03-01 10:51:46.452920 (cutoff 2012-03-01 11:45:49.729539)
2012-03-01 11:46:10.014949 7f140f8ca700 osd.3 360 heartbeat_check: no heartbeat from osd.8 since 2012-03-01 10:51:46.452920 (cutoff 2012-03-01 11:45:50.014920)
2012-03-01 11:46:10.015179 7f140f8ca700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/20053 --> [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6818/4327 e360 v360) v2 -- ?+0 0x269d000 con 0x1894780
2012-03-01 11:46:10.432683 7f14027af700 osd.3 360 heartbeat_check: no heartbeat from osd.8 since 2012-03-01 10:51:46.452920 (cutoff 2012-03-01 11:45:50.432651)

I'll leave the cluster running like this, that will probably keep it in this state. If any further logging is needed I can turn that on.

Actions #8

Updated by Sage Weil about 12 years ago

ok, i have a theory what's going on. can you try the new wip-2116, and run with debug ms = 20?

thanks!

Actions #9

Updated by Sage Weil about 12 years ago

  • Priority changed from Normal to High
Actions #10

Updated by Wido den Hollander about 12 years ago

I tried to compile the wip-2116 and master branches today, that failed. So I haven't been able to test the new code yet.

I'll try again tomorrow and see if I get it working.

Actions #11

Updated by Wido den Hollander about 12 years ago

I ran multiple I/O loads on the cluster last night and performed multiple tests, but this problem didn't come back yet.

I wouldn't close this issue yet, let's wait a couple of days to see if it might come back.

Actions #12

Updated by Wido den Hollander about 12 years ago

I'm not willing to say it's completely gone, but it is starting to seem like it.

In the past few days the cluster has been running fine with a moderate I/O load, no issues at all.

For now we could close this issue, I'll re-open it when I see it again.

Actions #13

Updated by Sage Weil about 12 years ago

Hrm... well, if that's the case then we should merge the wip-2116 as is, then... putting that on my list

Actions #14

Updated by Sage Weil about 12 years ago

see new wip-osd-hb branch

Actions #15

Updated by Sage Weil about 12 years ago

  • Status changed from 4 to Resolved
Actions

Also available in: Atom PDF