Project

General

Profile

Actions

Bug #2116

closed

Repeated messages of "heartbeat_check: no heartbeat from"

Added by Wido den Hollander about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

As discussed on the ml I gathered some logs.

Today I upgraded my whole cluster to 0.42.2 from 0.41.

Due to the on-disk change I formatted my cluster and started it again.

Immediately after the start of the OSD's the "no heartbeat from" started:

2012-02-28 16:30:31.951132    pg v708: 7920 pgs: 7920 active+clean; 8730 bytes data, 164 MB used, 74439 GB / 74520 GB avail
2012-02-28 16:30:31.980965   mds e4: 1/1/1 up {0=alpha=up:active}
2012-02-28 16:30:31.981010   osd e65: 40 osds: 40 up, 40 in
2012-02-28 16:30:31.981192   log 2012-02-28 16:30:30.344847 mon.0 [2a00:f10:11a:408::1]:6789/0 10728 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129)
2012-02-28 16:30:31.981314   mon e1: 3 mons at {pri=[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0,sec=[2a00:f10:11a:408::1]:6789/0,third=[2a00:f10:11a:409::1]:6789/0}
2012-02-28 16:30:33.680286   log 2012-02-28 16:30:33.533593 mon.0 [2a00:f10:11a:408::1]:6789/0 10729 : [INF] osd.0 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6800/12003 failed (by osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/11917)
2012-02-28 16:30:34.705162   log 2012-02-28 16:30:33.709207 mon.0 [2a00:f10:11a:408::1]:6789/0 10730 : [INF] osd.6 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6806/14486 failed (by osd.9 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6809/19176)
2012-02-28 16:30:34.705162   log 2012-02-28 16:30:33.968388 mon.0 [2a00:f10:11a:408::1]:6789/0 10731 : [INF] osd.30 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6806/29719 failed (by osd.37 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6803/26068)
2012-02-28 16:30:35.737775   log 2012-02-28 16:30:35.346912 mon.0 [2a00:f10:11a:408::1]:6789/0 10732 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129)

As you can see, the cluster is completely fresh. No data at all and no I/O load.

One of the things that came to mind was that it might be a clock issue, not being synchronized. I verified all the clocks and those are synchronized.

Information on all the OSD's:

Attached is my ceph.conf and a couple of hours of logging for osd.5 and osd.15


Files

ceph.conf (6.67 KB) ceph.conf My ceph configuration Wido den Hollander, 02/28/2012 07:35 AM
osd.5.log_heartbeat.gz (28.8 MB) osd.5.log_heartbeat.gz Wido den Hollander, 02/28/2012 07:35 AM
osd.15.log_heartbeat.gz (25.5 MB) osd.15.log_heartbeat.gz Wido den Hollander, 02/28/2012 07:35 AM
osd.3.heartbeat.log.gz (32.2 MB) osd.3.heartbeat.log.gz Wido den Hollander, 03/01/2012 02:48 AM
osd.8.heartbeat.log.gz (26.2 MB) osd.8.heartbeat.log.gz Wido den Hollander, 03/01/2012 02:48 AM
Actions

Also available in: Atom PDF