Project

General

Profile

Actions

Bug #14175

closed

clock skew report is incorrect by "ceph health detail" command

Added by wei qiaomiao over 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I usr "ceph health detail" command to check my cluster health and found below warning:

mon.c7 addr 10.118.202.97:6789/0 clock skew 239.478s > max 0.05s (latency 0.0355416s)

so i modify mon.c7 system time to make it the same as the leader monitor, but the warning is still exist:

mon.c7 addr 10.118.202.97:6789/0 clock skew 191.582s > max 0.05s (latency 0.0286543s)3s)


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #15024: hammer: clock skew report is incorrect by "ceph health detail" commandResolvedXiaoxi ChenActions
Actions #1

Updated by Nathan Cutler over 8 years ago

  • Assignee set to Joao Eduardo Luis
Actions #2

Updated by Nathan Cutler over 8 years ago

Hopefully Joao will chime in with a deeper explanation, but until then I can say that I have run into a similar issue (you don't mention which Ceph version you are using - I was using 0.94.3).

Here is what I remember of Joao's explanation:

As of Hammer there is new clock-skew handling logic in the monitor code that is designed to make the cluster more tolerant of time discrepancies (situations where the clock on one monitor node is slightly ahead of, or behind, the other nodes). This, however, comes with an unintended side effect: it now takes longer for clusters to recover from large time differences.

Whether or not this is a bug is still an open question.

It would be interesting to know how long it takes the cluster to recover from clock skew reported here. In my case, the time discrepancy was arising at boot time and the cluster took 15-60 minutes to recover (i.e. for the "clock skew" warning to disappear).

Actions #3

Updated by wei qiaomiao over 8 years ago

I was using 0.94.5 version
How long it take the cluster recover fron clock skew reported depend on how large time of the clock drift. In my
environment,the cluster took 3-4 hours to recover when clock drift is 2-3 minutes. It‘s too long time for user。
May be we can improve the clock-skew handling mechanism for the scene of cluster’s clock drift is large, for example,
when the absolute of current clock skew value minus the last value is larger than 5s(or other value we can discuss), we drop the last value and only took the current value to report

Actions #4

Updated by Joao Eduardo Luis over 8 years ago

  • Category set to Monitor
  • Status changed from New to Fix Under Review
Actions #5

Updated by Kefu Chai about 8 years ago

  • Status changed from Fix Under Review to Pending Backport

Joao, shall we backport this change to hammer?

Actions #6

Updated by Nathan Cutler about 8 years ago

  • Backport set to hammer
Actions #7

Updated by Nathan Cutler about 8 years ago

  • Copied to Backport #15024: hammer: clock skew report is incorrect by "ceph health detail" command added
Actions #8

Updated by Loïc Dachary almost 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF