Project

General

Profile

Actions

Bug #18370

closed

cluster [WRN] message from mon.2 was stamped 9.194390s in the future, clocks not synchronized" in cluster log

Added by Sage Weil over 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Actions #1

Updated by Dan Mick over 7 years ago

I think we should add 'clock' to internal tasks by default rather than 'clock.check'. This will make it stop ntpd, run ntpdate, and restart ntp, which gives a better chance of handling wrong times (ntpd takes minutes to settle). Testing that idea now.

Actions #3

Updated by Sage Weil about 7 years ago

  • Status changed from New to 12

the clock goes backward in time by 1s, right at midnight:

2017-02-28 23:59:59.968819 7f6382e6e700 20 mon.b@1(peon).paxos(paxos active c 1256..1825) reset_lease_timeout - setting timeout event
2017-02-28 23:59:59.968826 7f6382e6e700  5 mon.b@1(peon).monmap v2 apply_mon_features wait for service to be writeable
2017-02-28 23:59:59.968829 7f6382e6e700  5 mon.b@1(peon).paxos(paxos active c 1256..1825) is_readable = 1 - now=2017-02-28 23:59:59.968830 lease_expire=2017-03-01 00:00:04.911130 has v0 lc 1825
2017-02-28 23:59:59.968864 7f6386675700 10 _calc_signature seq 4139 front_crc_ = 3915049267 middle_crc = 0 data_crc = 0 sig = 6082498340259657420
2017-02-28 23:59:59.968878 7f6386675700 20 Putting signature in client message(seq # 4139): sig = 6082498340259657420
2017-02-28 23:59:59.985308 7f6380e6a700  1 leveldb: Generated table #54: 275 keys, 2127252 bytes
2017-02-28 23:59:59.035553 7f6380e6a700  1 leveldb: Generated table #55: 1075 keys, 2162697 bytes
2017-02-28 23:59:59.060977 7f6380e6a700  1 leveldb: Generated table #56: 7 keys, 23398 bytes
2017-02-28 23:59:59.060990 7f6380e6a700  1 leveldb: Compacted 1@1 + 2@2 files => 6417432 bytes
2017-02-28 23:59:59.088225 7f6380e6a700  1 leveldb: compacted to: files[ 0 4 4 0 0 0 0 ]
2017-02-28 23:59:59.088344 7f6380e6a700  1 leveldb: Delete type=2 #42

that was smithi181, xenial.

Actions #4

Updated by Sage Weil about 7 years ago

Hrm, it was the fucking leap second.

Mar  1 00:12:50 smithi181 ntpd[1814]: kernel reports leap second has occurred
Mar  1 00:12:50 smithi181 ntpd[1814]: kernel reports leap second has occurred

Actions #5

Updated by Sage Weil about 7 years ago

okay, here is run /a/sage-2017-02-24_06:15:05-rados-wip-sage-testing---basic-smithi/855210

2017-02-24T07:32:31.781 INFO:tasks.ceph.mon.b.smithi191.stdout:starting mon.b rank 1 at 172.21.15.191:6789/0 mon_data /var/lib/ceph/mon/ceph-b fsid b0586099-a577-49e3-a180-bc6dcfce0c8c
2017-02-24T07:32:31.913 INFO:tasks.ceph.mon.a.smithi006.stdout:starting mon.a rank 0 at 172.21.15.6:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid b0586099-a577-49e3-a180-bc6dcfce0c8c
2017-02-24T07:32:31.937 INFO:tasks.ceph.mon.c.smithi006.stdout:starting mon.c rank 2 at 172.21.15.6:6790/0 mon_data /var/lib/ceph/mon/ceph-c fsid b0586099-a577-49e3-a180-bc6dcfce0c8c

teuthology log reports these all happen within 200ms, but the mon logs show
remote/smithi191/log/ceph-mon.b.log:2017-02-24 07:32:31.785947 7f54ca7de880  0 starting mon.b rank 1 at 172.21.15.191:6789/0 mon_data /var/lib/ceph/mon/ceph-b fsid b0586099-a577-49e3-a180-bc6dcfce0c8c
remote/smithi006/log/ceph-mon.a.log:2017-02-24 07:32:41.327674 7f7a5eb09800  0 starting mon.a rank 0 at 172.21.15.6:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid b0586099-a577-49e3-a180-bc6dcfce0c8c
remote/smithi006/log/ceph-mon.c.log:2017-02-24 07:32:41.352288 7f4a114f3800  0 starting mon.c rank 2 at 172.21.15.6:6790/0 mon_data /var/lib/ceph/mon/ceph-c fsid b0586099-a577-49e3-a180-bc6dcfce0c8c

in other words, the smithi006 clock really is 10 seconds slow.

syslog only shows

Feb 24 07:30:48 smithi006 python: ansible-systemd Invoked with name=ntpd enabled=True daemon_reload=False state=started user=False masked=None

which we see here in teuthology.log:
2017-02-24T07:30:38.956 INFO:teuthology.task.ansible.out:ok: [smithi006.front.sepia.ceph.com] => {"changed": false, "enabled": true, "name": "ntpd", "state": "started", "status": {"ActiveEnterTimestamp": "Fri 2017-02-24 04:41:52 UTC", "ActiveEnterTimestampMonotonic": "253965470", "ActiveExitTimestampMonotonic": "0", "ActiveState": "active", "After": "tmp.mount ntpdate.service syslog.target systemd-journald.socket sntp.service system.slice -.mount basic.target", "AllowIsolate": "no", "AssertResult": "yes", "AssertTimestamp": "Fri 2017-02-24 04:41:52 UTC", "AssertTimestampMonotonic": "253900537", "Before": "shutdown.target multi-user.target chronyd.service", "BlockIOAccounting": "no", "BlockIOWeight": "18446744073709551615", "CPUAccounting": "no", "CPUQuotaPerSecUSec": "infinity", "CPUSchedulingPolicy": "0", "CPUSchedulingPriority": "0", "CPUSchedulingResetOnFork": "no", "CPUShares": "18446744073709551615", "CanIsolate": "no", "CanReload": "no", "CanStart": "yes", "CanStop": "yes", "CapabilityBoundingSet": "18446744073709551615", "ConditionResult": "yes", "ConditionTimestamp": "Fri 2017-02-24 04:41:52 UTC", "ConditionTimestampMonotonic": "253900537", "ConflictedBy": "chronyd.service", "Conflicts": "shutdown.target", "ControlGroup": "/system.slice/ntpd.service", "ControlPID": "0", "DefaultDependencies": "yes", "Delegate": "no", "Description": "Network Time Service", "DevicePolicy": "auto", "EnvironmentFile": "/etc/sysconfig/ntpd (ignore_errors=yes)", "ExecMainCode": "0", "ExecMainExitTimestampMonotonic": "0", "ExecMainPID": "18730", "ExecMainStartTimestamp": "Fri 2017-02-24 04:41:52 UTC", "ExecMainStartTimestampMonotonic": "253965414", "ExecMainStatus": "0", "ExecStart": "{ path=/usr/sbin/ntpd ; argv[]=/usr/sbin/ntpd -u ntp:ntp $OPTIONS ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }", "FailureAction": "none", "FileDescriptorStoreMax": "0", "FragmentPath": "/usr/lib/systemd/system/ntpd.service", "GuessMainPID": "yes", "IOScheduling": "0", "Id": "ntpd.service", "IgnoreOnIsolate": " 
2017-02-24T07:30:38.958 INFO:teuthology.task.ansible.out:no", "IgnoreOnSnapshot": "no", "IgnoreSIGPIPE": "yes", "InactiveEnterTimestampMonotonic": "0", "InactiveExitTimestamp": "Fri 2017-02-24 04:41:52 UTC", "InactiveExitTimestampMonotonic": "253913420", "JobTimeoutAction": "none", "JobTimeoutUSec": "0", "KillMode": "control-group", "KillSignal": "15", "LimitAS": "18446744073709551615", "LimitCORE": "18446744073709551615", "LimitCPU": "18446744073709551615", "LimitDATA": "18446744073709551615", "LimitFSIZE": "18446744073709551615", "LimitLOCKS": "18446744073709551615", "LimitMEMLOCK": "65536", "LimitMSGQUEUE": "819200", "LimitNICE": "0", "LimitNOFILE": "4096", "LimitNPROC": "127553", "LimitRSS": "18446744073709551615", "LimitRTPRIO": "0", "LimitRTTIME": "18446744073709551615", "LimitSIGPENDING": "127553", "LimitSTACK": "18446744073709551615", "LoadState": "loaded", "MainPID": "18730", "MemoryAccounting": "no", "MemoryCurrent": "18446744073709551615", "MemoryLimit": "18446744073709551615", "MountFlags": "0", "Names": "ntpd.service", "NeedDaemonReload": "no", "Nice": "0", "NoNewPrivileges": "no", "NonBlocking": "no", "NotifyAccess": "none", "OOMScoreAdjust": "0", "OnFailureJobMode": "replace", "PermissionsStartOnly": "no", "PrivateDevices": "no", "PrivateNetwork": "no", "PrivateTmp": "yes", "ProtectHome": "no", "ProtectSystem": "no", "RefuseManualStart": "no", "RefuseManualStop": "no", "RemainAfterExit": "no", "Requires": "basic.target -.mount", "RequiresMountsFor": "/var/tmp", "Restart": "no", "RestartUSec": "100ms", "Result": "success", "RootDirectoryStartOnly": "no", "RuntimeDirectoryMode": "0755", "SameProcessGroup": "no", "SecureBits": "0", "SendSIGHUP": "no", "SendSIGKILL": "yes", "Slice": "system.slice", "StandardError": "inherit", "StandardInput": "null", "StandardOutput": "journal", "StartLimitAction": "none", "StartLimitBurst": "5", "StartLimitInterval": "10000000", "StartupBlockIOWeight": "18446744073709551615", "StartupCPUShares": "18446744073709551615", "StatusErrno": "0", "StopWhenUnneeded": "no", "SubState": 
2017-02-24T07:30:38.960 INFO:teuthology.task.ansible.out:"running", "SyslogLevelPrefix": "yes", "SyslogPriority": "30", "SystemCallErrorNumber": "0", "TTYReset": "no", "TTYVHangup": "no", "TTYVTDisallocate": "no", "TimeoutStartUSec": "1min 30s", "TimeoutStopUSec": "1min 30s", "TimerSlackNSec": "50000", "Transient": "no", "Type": "forking", "UMask": "0022", "UnitFilePreset": "disabled", "UnitFileState": "enabled", "WantedBy": "multi-user.target", "Wants": "system.slice", "WatchdogTimestamp": "Fri 2017-02-24 04:41:52 UTC", "WatchdogTimestampMonotonic": "253965428", "WatchdogUSec": "0"}, "warnings": []}

but a few seconds later,

2017-02-24T07:30:43.072 INFO:teuthology.orchestra.run.smithi006:Running: 'sudo service ntp stop ; sudo ntpdate 0.us.pool.ntp.org 1.us.pool.ntp.org 2.us.pool.ntp.org 3.us.pool.ntp.org ; sudo service ntp start ; PATH=/usr/bin:/usr/sbin ntpdc -p'
2017-02-24T07:30:43.165 INFO:teuthology.orchestra.run.smithi006.stderr:Redirecting to /bin/systemctl stop  ntp.service
2017-02-24T07:30:43.174 INFO:teuthology.orchestra.run.smithi006.stderr:Failed to stop ntp.service: Unit ntp.service not loaded.
2017-02-24T07:30:43.187 INFO:teuthology.orchestra.run.smithi006.stderr:24 Feb 07:30:52 ntpdate[5001]: the NTP socket is in use, exiting
2017-02-24T07:30:43.206 INFO:teuthology.orchestra.run.smithi006.stderr:Redirecting to /bin/systemctl start  ntp.service
2017-02-24T07:30:43.209 INFO:teuthology.orchestra.run.smithi006.stderr:Failed to start ntp.service: Unit not found.
2017-02-24T07:30:43.214 INFO:teuthology.orchestra.run.smithi006.stdout:     remote           local      st poll reach  delay   offset    disp
2017-02-24T07:30:43.215 INFO:teuthology.orchestra.run.smithi006.stdout:=======================================================================
2017-02-24T07:30:43.215 INFO:teuthology.orchestra.run.smithi006.stdout:*level1g.cs.unc. 172.21.15.6      1 1024  377 0.01782  0.000345 0.10847
2017-02-24T07:30:43.215 INFO:teuthology.orchestra.run.smithi006.stdout:=clock.trit.net  172.21.15.6      2 1024  377 0.06496 -9.418840 0.08762
2017-02-24T07:30:43.215 INFO:teuthology.orchestra.run.smithi006.stdout:=74.120.81.219   172.21.15.6      2  512  374 0.04167 -0.000045 0.12901
2017-02-24T07:30:43.217 INFO:teuthology.orchestra.run.smithi006.stdout:=services.quadra 172.21.15.6      3  512  277 0.06560 -0.004937 0.11520
2017-02-24T07:30:43.226 DEBUG:teuthology.orchestra.remote:smithi191:/etc/ntp.conf is 3KB
2017-02-24T07:30:43.276 INFO:teuthology.orchestra.run.smithi191:Running: 'sudo service ntp stop ; sudo ntpdate 0.us.pool.ntp.org 1.us.pool.ntp.org 2.us.pool.ntp.org 3.us.pool.ntp.org ; sudo service ntp start ; PATH=/usr/bin:/usr/sbin ntpdc -p'
2017-02-24T07:30:45.989 INFO:teuthology.orchestra.run.smithi191.stderr:24 Feb 07:30:46 ntpdate[158412]: 108.61.56.35 rate limit response from server.
2017-02-24T07:30:46.045 INFO:teuthology.orchestra.run.smithi191.stderr:24 Feb 07:30:46 ntpdate[158412]: 138.68.46.177 rate limit response from server.
2017-02-24T07:30:46.637 INFO:teuthology.orchestra.run.smithi191.stderr:24 Feb 07:30:46 ntpdate[158412]: 198.55.111.50 rate limit response from server.
2017-02-24T07:30:54.569 INFO:teuthology.orchestra.run.smithi191.stdout:24 Feb 07:30:54 ntpdate[158412]: adjust time server 132.163.4.102 offset -0.008426 sec
2017-02-24T07:30:54.739 INFO:teuthology.orchestra.run.smithi191.stderr:ntpdc: read: Connection refused

so... maybe the problem is that we're getting rate limited? in which case, maybe we need our own ntp pool?

Actions #6

Updated by Sage Weil about 7 years ago

  • Priority changed from Immediate to Urgent
Actions #7

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to sepia
Actions #8

Updated by David Galloway almost 7 years ago

  • Category set to Infrastructure Service
  • Assignee set to David Galloway

http://tracker.ceph.com/issues/10675

According to community-cage list, they're planning on setting defaults and cabling up the console for an NTP server in the cage this week.

Actions

Also available in: Atom PDF