Project

General

Profile

Actions

Bug #16092

closed

"heartbeat_check: no reply from osd" ~7h loop in upgrade:hammer-x-jewel-distro-basic-vps

Added by Yuri Weinstein almost 8 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2016-05-30_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/
Job: 223974
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-05-30_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/223974/teuthology.log

http://qa-proxy.ceph.com/teuthology/teuthology-2016-05-30_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/223974/teuthology.log

...

2016-05-30T19:59:38.975 INFO:tasks.rados.rados.0.vpm026.stdout:update_object_version oid 9 v 1365 (ObjNum 1215 snap 384 seq_num 1215) dirty exists
2016-05-30T19:59:38.976 INFO:tasks.rados.rados.0.vpm026.stdout:3862:  left oid 9 (ObjNum 1215 snap 384 seq_num 1215)
2016-05-30T19:59:38.976 INFO:tasks.rados.rados.0.vpm026.stdout:3862: done (15 left)
2016-05-30T19:59:38.976 INFO:tasks.rados.rados.0.vpm026.stdout:3871: snap_remove snap 307
2016-05-30T20:01:18.061 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 03:01:18.058119 7f398196e700 -1 osd.5 1923 heartbeat_check: no reply from osd.0 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 02:59:38.058097)

Job reported dead

Actions #2

Updated by Samuel Just almost 8 years ago

2016-05-31T06:17:43.090 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 13:17:43.085743 7f398196e700 -1 osd.5 1923 heartbeat_check: no reply from osd.1 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 13:16:03.085722)
2016-05-31T06:17:43.090 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 13:17:43.085752 7f398196e700 -1 osd.5 1923 heartbeat_check: no reply from osd.2 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 13:16:03.085722)
2016-05-31T06:17:43.215 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 13:17:43.209633 7f39683cd700 -1 osd.5 1923 heartbeat_check: no reply from osd.0 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 13:16:03.209626)

Ok, so osd 3-5 can't contact 0-2. Sounds like networking to me or a hardware problem to me.

Actions #3

Updated by Samuel Just almost 8 years ago

Also, at some point last week, they killed all of the odd numbered vms, might that have happened at the same time as this?

Actions #4

Updated by Yuri Weinstein almost 8 years ago

I hunch is it's unrelated

Actions #5

Updated by Dan Mick almost 8 years ago

Vms were empty when killed

Actions #7

Updated by Samuel Just almost 8 years ago

Same thing, osds 0-2 can't talk to 3-5. It could be an odd bug in ceph...or it could be an actual network partition. The latter is much more likely given that 0-2 can apparently talk to each other and 3-5 can apparently talk to each other.

Actions #11

Updated by Samuel Just over 7 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF