Bug #3633: mon: clock drift errors not reported by ceph status - Ceph - Ceph

Actions

Copy link

Bug #3633

closed

mon: clock drift errors not reported by ceph status

Added by Corin Langosch over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Bobtail

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Using argonat 0.48.2. Today all ceph commands were randomly slow. So I checked all hosts, all monitors (3) and osds (17) were up and running. ceph status and ceph -w were not reporting any error. Digging a little deeper I found in the logs of one monitor that it complained about too high clock drift, which was caused by a crashed ntp server. After fixing this everything worked fine again. But I'd like to suggest to emit a warning when running ceph status or ceph -w in case of any clock drift errors. Returning HEALTH_OK is a bit misleading when in fact the cluster is not 100% working and randomly hangs for several seconds.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ian Colle over 11 years ago

Assignee set to Joao Eduardo Luis
Priority changed from Normal to High
Backport set to Bobtail

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Subject changed from clock drift errors not reported by ceph status to mon: clock drift errors not reported by ceph status
Category set to Monitor

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Source changed from Development to Community (user)

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Status changed from New to In Progress

I'm looking into an adequate way to make 'ceph -s' return a warning when the clocks have drifted.

However, 'ceph -w' should have shown clock drifting warnings. Have you disabled 'clog_to_monitors'?

Actions

Copy link

Updated by Corin Langosch over 11 years ago

Here's my config: http://pastie.org/5554031

I'm pretty sure there was no warning when I did 'ceph ~~w', because I was really puzzled at first why the cluster randomly hangs and checked quite a lot of things. But I cannot say for sure now, as I didn't save the output :~~(.

To monitor ceph's status I'm having a cronjob which does 'ceph health details | grep HEALTH_OK > /dev/null || ceph health details' every few minutes. So when ceph is not healthy I get an email alert. The result should not be HEALTH_OK if there's any warning/error (clock drift included).

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

'HEALTH_OK' and 'HEALTH_WARN' are assessed in a way that makes it non-trivial to leverage the existing way of doing things to consider the clock drifting messages. Still looking into a couple of options though.

Regarding the "ceph -w" not showing the warning messages, that can easily be explained by the fact that to the drift will be applied an exponential backoff. So you'd see those warnings for (say) the first couple of warning, and then the frequency of warning would decrease. This would make noticing the warning really difficult, no doubt, which makes the need to warn the user in some other way (e.g., on "ceph -s") really important.

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Status changed from In Progress to 4

wip-3633 now has a couple of patches that introduce a mechanism to keep track of clock skews on the monitors.

If severe, the clock skews will be reported on 'ceph health' and 'ceph status' with a HEALTH_WARN. 'ceph health detail' will also report which nodes are suffering from clock skews. With the latest patches, which are yet to be reviewed and to assess if they should go upstream, one will also be able to provide a '--format json' to both commands and obtain detailed information on skews regardless of being severe or not.

These patches also allow us to keep track of the latency between the monitors.

Actions

Copy link

Updated by Corin Langosch over 11 years ago

Reading the patch it looks only the clocks of the mons are checked. So the clocks of the osds are not important to ceph?

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

The objective here was to make sure that clock skews on the monitors were detected and reported, as said skews might affect the monitor's behavior.

Clocks are important as well for the osds. OSDs rely on clocks to, for instance, check if other osds failed. But that was fairly outside what we aimed with these patches.

I'll look into whether or not having the monitors reporting clock skews on other ceph components besides the monitors themselves would be something we want, and open a different issue if it turns out to be the case.

Actions

Copy link

#10

Updated by Greg Farnum over 11 years ago

The OSD clocks are actually fairly unimportant. Everything they use that requires precise timing should be based entirely on local clocks (if there's evidence that is not the case, we have a bug!). If using authentication they do need to be sort of close to the monitors, as the auth keys rotate on an hourly basis (with a bit of overlap for previous, current, next keys).

Actions

Copy link

#11

Updated by Ian Colle over 11 years ago

Priority changed from High to Normal

Actions

Copy link

#12

Updated by Sage Weil over 11 years ago

Status changed from 4 to Resolved

310112f702d14294e6ba48f8af41a306288cba65

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3633

mon: clock drift errors not reported by ceph status

Updated by Ian Colle over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Corin Langosch over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Corin Langosch over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Greg Farnum over 11 years ago

Updated by Ian Colle over 11 years ago

Updated by Sage Weil over 11 years ago