Actions
Bug #2116
closedRepeated messages of "heartbeat_check: no heartbeat from"
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
As discussed on the ml I gathered some logs.
Today I upgraded my whole cluster to 0.42.2 from 0.41.
Due to the on-disk change I formatted my cluster and started it again.
Immediately after the start of the OSD's the "no heartbeat from" started:
2012-02-28 16:30:31.951132 pg v708: 7920 pgs: 7920 active+clean; 8730 bytes data, 164 MB used, 74439 GB / 74520 GB avail 2012-02-28 16:30:31.980965 mds e4: 1/1/1 up {0=alpha=up:active} 2012-02-28 16:30:31.981010 osd e65: 40 osds: 40 up, 40 in 2012-02-28 16:30:31.981192 log 2012-02-28 16:30:30.344847 mon.0 [2a00:f10:11a:408::1]:6789/0 10728 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129) 2012-02-28 16:30:31.981314 mon e1: 3 mons at {pri=[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0,sec=[2a00:f10:11a:408::1]:6789/0,third=[2a00:f10:11a:409::1]:6789/0} 2012-02-28 16:30:33.680286 log 2012-02-28 16:30:33.533593 mon.0 [2a00:f10:11a:408::1]:6789/0 10729 : [INF] osd.0 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6800/12003 failed (by osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/11917) 2012-02-28 16:30:34.705162 log 2012-02-28 16:30:33.709207 mon.0 [2a00:f10:11a:408::1]:6789/0 10730 : [INF] osd.6 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6806/14486 failed (by osd.9 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6809/19176) 2012-02-28 16:30:34.705162 log 2012-02-28 16:30:33.968388 mon.0 [2a00:f10:11a:408::1]:6789/0 10731 : [INF] osd.30 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6806/29719 failed (by osd.37 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6803/26068) 2012-02-28 16:30:35.737775 log 2012-02-28 16:30:35.346912 mon.0 [2a00:f10:11a:408::1]:6789/0 10732 : [INF] osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6803/14395 failed (by osd.15 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6809/12129)
As you can see, the cluster is completely fresh. No data at all and no I/O load.
One of the things that came to mind was that it might be a clock issue, not being synchronized. I verified all the clocks and those are synchronized.
Information on all the OSD's:
- ceph version 0.42.2 (732f3ec94e39d458230b7728b2a936d431e19322)
- Kernel: 3.2.0
- Memory: 4GB
- CPU: Atom D525
- Disks: 2TB Seagate/WD
Attached is my ceph.conf and a couple of hours of logging for osd.5 and osd.15
Files
Actions