Project

General

Profile

Actions

Bug #5172

closed

wrongly marked down heartbeat issues

Added by Samuel Just almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/samuelj-2013-05-23_10:20:47-rados-wip_osd_throttle-master-basic/20593/remote

Despite the branch name, this was a test of master (the ceph logs sha1s are from master)

osd logging was turned on.

Actions #1

Updated by Samuel Just almost 11 years ago

2013-05-23 10:35:22.200882 7fe1668a1700 1 -- 10.214.131.15:0/14951 <== osd.4 10.214.131.14:6803/14730 11 ==== osd_ping(ping_reply e6 stamp 2013-05-23 10:35:22.200152) v2 ==== 47+0+0 (3711961763 0 0) 0x34608c0 con 0x3690c60
2013-05-23 10:35:22.200924 7fe1668a1700 20 osd.0 6 _share_map_outgoing 0x357a840 already has epoch 6
2013-05-23 10:35:22.200932 7fe1668a1700 1 -- 10.214.131.15:0/14951 <== osd.5 10.214.131.14:6806/14732 12 ==== osd_ping(ping_reply e8 stamp 2013-05-23 10:35:22.200152) v2 ==== 47+0+0 (7380999 0 0) 0x337c380 con 0x32c8160
2013-05-23 10:35:22.200946 7fe1668a1700 20 osd.0 6 _share_map_outgoing 0x32ea160 already has epoch 8
2013-05-23 10:35:22.200950 7fe1668a1700 1 -- 10.214.131.15:0/14951 <== osd.4 10.214.131.14:6802/14730 11 ==== osd_ping(ping_reply e6 stamp 2013-05-23 10:35:22.200152) v2 ==== 47+0+0 (3711961763 0 0) 0x3418700 con 0x36909a0

...

2013-05-23 10:35:28.100567 7fe160895700 1 -- 10.214.131.15:0/14951 --> 10.214.131.14:6802/14730 -- osd_ping(ping e6 stamp 2013-05-23 10:35:28.100529) v2 -- ?+0 0x3478380 con 0x3353420
2013-05-23 10:35:28.100590 7fe160895700 1 -- 10.214.131.15:0/14951 --> 10.214.131.14:6803/14730 -- osd_ping(ping e6 stamp 2013-05-23 10:35:28.100529) v2 -- ?+0 0x3478700 con 0x3690c60

...

20s later with no response:

2013-05-23 10:35:43.603424 7fe160895700 -1 osd.0 6 heartbeat_check: no reply from osd.4 since back 2013-05-23 10:35:22.200152 front 2013-05-23 10:35:22.200152 (cutoff 2013-05-23 10:35:23.603423)

On the osd.4 side:

2013-05-23 10:35:22.200650 7f2de4787700 1 -- 10.214.131.14:6803/14730 <== osd.0 10.214.131.15:0/14951 11 ==== osd_ping(ping e6 stamp 2013-05-23 10:35:22.200152) v2 ==== 47+0+0 (2828270235 0 0) 0x24e8540 con 0x23f52c0
2013-05-23 10:35:22.200694 7f2de4787700 1 -- 10.214.131.14:6803/14730 --> 10.214.131.15:0/14951 -- osd_ping(ping_reply e6 stamp 2013-05-23 10:35:22.200152) v2 -- ?+0 0x23aae00 con 0x23f52c0
2013-05-23 10:35:22.200695 7f2de3785700 1 -- 10.214.131.14:6802/14730 <== osd.0 10.214.131.15:0/14951 11 ==== osd_ping(ping e6 stamp 2013-05-23 10:35:22.200152) v2 ==== 47+0+0 (2828270235 0 0) 0x24c2700 con 0x23f5580

20s later with no other pings received in between:

2013-05-23 10:35:49.171781 7f2de4f88700 1 -- 10.214.131.14:0/14730 <== osd.0 10.214.131.15:6815/14951 1 ==== osd_ping(ping_reply e14 stamp 2013-05-23 10:35:49.170962) v2 ==== 47+0+0 (1932066329 0 0) 0x269f380 con 0x2339160
2013-05-23 10:35:49.171827 7f2de4f88700 20 osd.4 14 _share_map_outgoing 0x2482dc0 already has epoch 14
2013-05-23 10:35:49.171837 7f2de4f88700 1 -- 10.214.131.14:0/14730 <== osd.0 10.214.131.15:6816/14951 1 ==== osd_ping(ping_reply e14 stamp 2013-05-23 10:35:49.170962) v2 ==== 47+0+0 (1932066329 0 0) 0x24c2a80 con 0x2675160

Actions #2

Updated by Samuel Just almost 11 years ago

  • Status changed from New to In Progress

wip_5172, going to test later

Actions #3

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Fix Under Review

or wip-5172, don't see wip_5172 :)

Actions #4

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF