Bug #16092
closed"heartbeat_check: no reply from osd" ~7h loop in upgrade:hammer-x-jewel-distro-basic-vps
0%
Description
Run: http://pulpito.ceph.com/teuthology-2016-05-30_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/
Job: 223974
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-05-30_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/223974/teuthology.log
http://qa-proxy.ceph.com/teuthology/teuthology-2016-05-30_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/223974/teuthology.log
...
2016-05-30T19:59:38.975 INFO:tasks.rados.rados.0.vpm026.stdout:update_object_version oid 9 v 1365 (ObjNum 1215 snap 384 seq_num 1215) dirty exists 2016-05-30T19:59:38.976 INFO:tasks.rados.rados.0.vpm026.stdout:3862: left oid 9 (ObjNum 1215 snap 384 seq_num 1215) 2016-05-30T19:59:38.976 INFO:tasks.rados.rados.0.vpm026.stdout:3862: done (15 left) 2016-05-30T19:59:38.976 INFO:tasks.rados.rados.0.vpm026.stdout:3871: snap_remove snap 307 2016-05-30T20:01:18.061 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 03:01:18.058119 7f398196e700 -1 osd.5 1923 heartbeat_check: no reply from osd.0 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 02:59:38.058097)
Job reported dead
Updated by Yuri Weinstein almost 8 years ago
Updated by Samuel Just almost 8 years ago
2016-05-31T06:17:43.090 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 13:17:43.085743 7f398196e700 -1 osd.5 1923 heartbeat_check: no reply from osd.1 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 13:16:03.085722)
2016-05-31T06:17:43.090 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 13:17:43.085752 7f398196e700 -1 osd.5 1923 heartbeat_check: no reply from osd.2 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 13:16:03.085722)
2016-05-31T06:17:43.215 INFO:tasks.ceph.osd.5.vpm162.stderr:2016-05-31 13:17:43.209633 7f39683cd700 -1 osd.5 1923 heartbeat_check: no reply from osd.0 since back 2016-05-31 02:59:37.858151 front 2016-05-31 02:59:37.858151 (cutoff 2016-05-31 13:16:03.209626)
Ok, so osd 3-5 can't contact 0-2. Sounds like networking to me or a hardware problem to me.
Updated by Samuel Just almost 8 years ago
Also, at some point last week, they killed all of the odd numbered vms, might that have happened at the same time as this?
Updated by Yuri Weinstein almost 8 years ago
Updated by Samuel Just almost 8 years ago
Same thing, osds 0-2 can't talk to 3-5. It could be an odd bug in ceph...or it could be an actual network partition. The latter is much more likely given that 0-2 can apparently talk to each other and 3-5 can apparently talk to each other.
Updated by Samuel Just over 7 years ago
- Status changed from New to Can't reproduce