Project

General

Profile

Bug #14175

clock skew report is incorrect by "ceph health detail" command

Added by wei qiaomiao over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Category:
Monitor
Target version:
-
Start date:
12/24/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I usr "ceph health detail" command to check my cluster health and found below warning:

mon.c7 addr 10.118.202.97:6789/0 clock skew 239.478s > max 0.05s (latency 0.0355416s)

so i modify mon.c7 system time to make it the same as the leader monitor, but the warning is still exist:

mon.c7 addr 10.118.202.97:6789/0 clock skew 191.582s > max 0.05s (latency 0.0286543s)3s)


Related issues

Copied to Ceph - Backport #15024: hammer: clock skew report is incorrect by "ceph health detail" command Resolved

Associated revisions

Revision 17d8ff42 (diff)
Added by Joao Eduardo Luis about 3 years ago

mon: Monitor: get rid of weighted clock skew reports

By weighting the reports we were making it really hard to get rid of a
clock skew warning once the cause had been fixed.

Instead, as soon as we get a clean bill of health, let's run a new round
and soon as possible and ascertain whether that was a transient fix or
for realsies. That should be better than the alternative of waiting for
an hour or something (for a large enough skew) for the warning to go
away - and with it, the admin's sanity ("WHAT AM I DOING WRONG???").

Fixes: #14175

Signed-off-by: Joao Eduardo Luis <>

Revision 01672b4a (diff)
Added by Joao Eduardo Luis about 3 years ago

mon: Monitor: get rid of weighted clock skew reports

By weighting the reports we were making it really hard to get rid of a
clock skew warning once the cause had been fixed.

Instead, as soon as we get a clean bill of health, let's run a new round
and soon as possible and ascertain whether that was a transient fix or
for realsies. That should be better than the alternative of waiting for
an hour or something (for a large enough skew) for the warning to go
away - and with it, the admin's sanity ("WHAT AM I DOING WRONG???").

Fixes: #14175

Signed-off-by: Joao Eduardo Luis <>

(cherry pick from commit 17d8ff429c7dca8fc1ada6e7cc8a7c4924a22e28)

History

#1 Updated by Nathan Cutler over 3 years ago

  • Assignee set to Joao Eduardo Luis

#2 Updated by Nathan Cutler over 3 years ago

Hopefully Joao will chime in with a deeper explanation, but until then I can say that I have run into a similar issue (you don't mention which Ceph version you are using - I was using 0.94.3).

Here is what I remember of Joao's explanation:

As of Hammer there is new clock-skew handling logic in the monitor code that is designed to make the cluster more tolerant of time discrepancies (situations where the clock on one monitor node is slightly ahead of, or behind, the other nodes). This, however, comes with an unintended side effect: it now takes longer for clusters to recover from large time differences.

Whether or not this is a bug is still an open question.

It would be interesting to know how long it takes the cluster to recover from clock skew reported here. In my case, the time discrepancy was arising at boot time and the cluster took 15-60 minutes to recover (i.e. for the "clock skew" warning to disappear).

#3 Updated by wei qiaomiao over 3 years ago

I was using 0.94.5 version
How long it take the cluster recover fron clock skew reported depend on how large time of the clock drift. In my
environment,the cluster took 3-4 hours to recover when clock drift is 2-3 minutes. It‘s too long time for user。
May be we can improve the clock-skew handling mechanism for the scene of cluster’s clock drift is large, for example,
when the absolute of current clock skew value minus the last value is larger than 5s(or other value we can discuss), we drop the last value and only took the current value to report

#4 Updated by Joao Eduardo Luis over 3 years ago

  • Category set to Monitor
  • Status changed from New to Need Review

#5 Updated by Kefu Chai about 3 years ago

  • Status changed from Need Review to Pending Backport

Joao, shall we backport this change to hammer?

#6 Updated by Nathan Cutler about 3 years ago

  • Backport set to hammer

#7 Updated by Nathan Cutler about 3 years ago

  • Copied to Backport #15024: hammer: clock skew report is incorrect by "ceph health detail" command added

#8 Updated by Loic Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF