Bug #14175: clock skew report is incorrect by "ceph health detail" command - Ceph - Ceph

Actions

Copy link

Bug #14175

closed

clock skew report is incorrect by "ceph health detail" command

Added by wei qiaomiao over 8 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

other

Tags:

Backport:

hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I usr "ceph health detail" command to check my cluster health and found below warning:

mon.c7 addr 10.118.202.97:6789/0 clock skew 239.478s > max 0.05s (latency 0.0355416s)

so i modify mon.c7 system time to make it the same as the leader monitor, but the warning is still exist:

mon.c7 addr 10.118.202.97:6789/0 clock skew 191.582s > max 0.05s (latency 0.0286543s)3s)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Nathan Cutler over 8 years ago

Assignee set to Joao Eduardo Luis

Actions

Copy link

Updated by Nathan Cutler over 8 years ago

Hopefully Joao will chime in with a deeper explanation, but until then I can say that I have run into a similar issue (you don't mention which Ceph version you are using - I was using 0.94.3).

Here is what I remember of Joao's explanation:

As of Hammer there is new clock-skew handling logic in the monitor code that is designed to make the cluster more tolerant of time discrepancies (situations where the clock on one monitor node is slightly ahead of, or behind, the other nodes). This, however, comes with an unintended side effect: it now takes longer for clusters to recover from large time differences.

Whether or not this is a bug is still an open question.

It would be interesting to know how long it takes the cluster to recover from clock skew reported here. In my case, the time discrepancy was arising at boot time and the cluster took 15-60 minutes to recover (i.e. for the "clock skew" warning to disappear).

Actions

Copy link

Updated by wei qiaomiao over 8 years ago

I was using 0.94.5 version
How long it take the cluster recover fron clock skew reported depend on how large time of the clock drift. In my
environment，the cluster took 3-4 hours to recover when clock drift is 2-3 minutes. It‘s too long time for user。
May be we can improve the clock-skew handling mechanism for the scene of cluster’s clock drift is large, for example,
when the absolute of current clock skew value minus the last value is larger than 5s(or other value we can discuss), we drop the last value and only took the current value to report

Actions

Copy link