Project

General

Profile

Actions

Bug #10675

closed

failures related to clock skew

Added by Andrew Schoen about 9 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I suspect ntp is acting up in sepia. There are quite a few job failures related to clock skew. Clock skew could also explain why teuthology decided to use stale pycs in issue #10651

Look at the failures on this fun for an example:

http://pulpito.front.sepia.ceph.com/sage-2015-01-28_07:07:27-rados-wip-sage-testing-distro-basic-multi/

"2015-01-28 08:04:30.278152 mon.0 10.214.132.17:6789/0 3 : cluster [WRN] message from mon.1 was stamped 1.988812s in the future, clocks not synchronized" in cluster log

Related issues 3 (0 open3 closed)

Related to sepia - Bug #11514: Typica nodes are not reliably clock syncedWon't Fix04/30/2015

Actions
Related to sepia - Bug #11777: machines must be on the same time zoneWon't Fix05/26/2015

Actions
Related to sepia - Bug #11782: lab clocks not reliably syncedCan't reproduce05/27/2015

Actions
Actions #1

Updated by Zack Cerza about 9 years ago

  • Project changed from teuthology to sepia
Actions #2

Updated by Zack Cerza about 9 years ago

I guess DreamHost's ntp server went down. Sage's fix is here:
https://github.com/ceph/ceph-qa-chef/pull/7

I'm still nervous that we're using a single server, though...

Actions #3

Updated by Yuri Weinstein about 9 years ago

Still see in runs.
http://pulpito.front.sepia.ceph.com/teuthology-2015-02-06_11:06:02-upgrade:giant-x:parallel-hammer-distro-basic-vps/742324/

"2015-02-06 19:26:01.125925 mon.0 10.214.130.124:6789/0 8 : cluster [WRN] mon.1 10.214.130.171:6789/0 clock skew 0.817504s > max 0.5s" in cluster log

Actions #4

Updated by Yuri Weinstein about 9 years ago

One more:
Run- http://pulpito.ceph.com/teuthology-2015-02-06_11:06:02-upgrade:giant-x:parallel-hammer-distro-basic-vps/
Job - 742324

"2015-02-06 19:26:01.125925 mon.0 10.214.130.124:6789/0 8 : cluster [WRN] mon.1 10.214.130.171:6789/0 clock skew 0.817504s > max 0.5s" in cluster log
Actions #5

Updated by David Zafman about 9 years ago

Seen again

dzafman-2015-02-19_14:55:37-rados:thrash-wip-10883---basic-multi/771074
On burnupi19 and burnupi25:
2015-02-20 12:52:52.636017 mon.1 10.214.134.14:6789/0 177 : cluster
[WRN] message from mon.0 was stamped 0.501458s in the future, clocks not
synchronized

dzafman-2015-02-19_14:55:37-rados:thrash-wip-10883---basic-multi/770916
On plana62 and plana64:
2015-02-20 10:00:56.842533 mon.0 10.214.132.14:6789/0 3 : cluster [WRN]
message from mon.1 was stamped 0.855106s in the future, clocks not
synchronized

Actions #6

Updated by Yuri Weinstein about 9 years ago

  • Priority changed from Normal to Urgent

This issue comes up in many places, so increasing priority

Actions #8

Updated by Sage Weil almost 9 years ago

  • Priority changed from Urgent to Normal
Actions #9

Updated by Zack Cerza about 8 years ago

  • Status changed from New to Resolved

We are using more reliable NTP servers now and believe this to be resolved

Actions #11

Updated by David Galloway over 7 years ago

  • Category set to Infrastructure Service
  • Status changed from Resolved to In Progress
  • Assignee set to David Galloway

Found this in the latest example:

2016-11-19T16:57:44.391 INFO:teuthology.orchestra.run.smithi116.stderr:19 Nov 16:57:44 ntpdate[7837]: 45.79.10.228 rate limit response from server.

I know we're due for an NTP server in the community cage. It may be a service I need to set up before then though.

Actions

Also available in: Atom PDF