Project

General

Profile

Actions

Bug #3633

closed

mon: clock drift errors not reported by ceph status

Added by Corin Langosch over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Normal
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Bobtail
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Using argonat 0.48.2. Today all ceph commands were randomly slow. So I checked all hosts, all monitors (3) and osds (17) were up and running. ceph status and ceph -w were not reporting any error. Digging a little deeper I found in the logs of one monitor that it complained about too high clock drift, which was caused by a crashed ntp server. After fixing this everything worked fine again. But I'd like to suggest to emit a warning when running ceph status or ceph -w in case of any clock drift errors. Returning HEALTH_OK is a bit misleading when in fact the cluster is not 100% working and randomly hangs for several seconds.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #3695: monitor crashed after an upgrade in Monitor::timecheckResolvedJoao Eduardo Luis12/28/2012

Actions
Actions #1

Updated by Ian Colle over 11 years ago

  • Assignee set to Joao Eduardo Luis
  • Priority changed from Normal to High
  • Backport set to Bobtail
Actions #2

Updated by Joao Eduardo Luis over 11 years ago

  • Subject changed from clock drift errors not reported by ceph status to mon: clock drift errors not reported by ceph status
  • Category set to Monitor
Actions #3

Updated by Joao Eduardo Luis over 11 years ago

  • Source changed from Development to Community (user)
Actions #4

Updated by Joao Eduardo Luis over 11 years ago

  • Status changed from New to In Progress

I'm looking into an adequate way to make 'ceph -s' return a warning when the clocks have drifted.

However, 'ceph -w' should have shown clock drifting warnings. Have you disabled 'clog_to_monitors'?

Actions #5

Updated by Corin Langosch over 11 years ago

Here's my config: http://pastie.org/5554031

I'm pretty sure there was no warning when I did 'ceph w', because I was really puzzled at first why the cluster randomly hangs and checked quite a lot of things. But I cannot say for sure now, as I didn't save the output :(.

To monitor ceph's status I'm having a cronjob which does 'ceph health details | grep HEALTH_OK > /dev/null || ceph health details' every few minutes. So when ceph is not healthy I get an email alert. The result should not be HEALTH_OK if there's any warning/error (clock drift included).

Actions #6

Updated by Joao Eduardo Luis over 11 years ago

'HEALTH_OK' and 'HEALTH_WARN' are assessed in a way that makes it non-trivial to leverage the existing way of doing things to consider the clock drifting messages. Still looking into a couple of options though.

Regarding the "ceph -w" not showing the warning messages, that can easily be explained by the fact that to the drift will be applied an exponential backoff. So you'd see those warnings for (say) the first couple of warning, and then the frequency of warning would decrease. This would make noticing the warning really difficult, no doubt, which makes the need to warn the user in some other way (e.g., on "ceph -s") really important.

Actions #7

Updated by Joao Eduardo Luis over 11 years ago

  • Status changed from In Progress to 4

wip-3633 now has a couple of patches that introduce a mechanism to keep track of clock skews on the monitors.

If severe, the clock skews will be reported on 'ceph health' and 'ceph status' with a HEALTH_WARN. 'ceph health detail' will also report which nodes are suffering from clock skews. With the latest patches, which are yet to be reviewed and to assess if they should go upstream, one will also be able to provide a '--format json' to both commands and obtain detailed information on skews regardless of being severe or not.

These patches also allow us to keep track of the latency between the monitors.

Actions #8

Updated by Corin Langosch over 11 years ago

Reading the patch it looks only the clocks of the mons are checked. So the clocks of the osds are not important to ceph?

Actions #9

Updated by Joao Eduardo Luis over 11 years ago

The objective here was to make sure that clock skews on the monitors were detected and reported, as said skews might affect the monitor's behavior.

Clocks are important as well for the osds. OSDs rely on clocks to, for instance, check if other osds failed. But that was fairly outside what we aimed with these patches.

I'll look into whether or not having the monitors reporting clock skews on other ceph components besides the monitors themselves would be something we want, and open a different issue if it turns out to be the case.

Actions #10

Updated by Greg Farnum over 11 years ago

The OSD clocks are actually fairly unimportant. Everything they use that requires precise timing should be based entirely on local clocks (if there's evidence that is not the case, we have a bug!). If using authentication they do need to be sort of close to the monitors, as the auth keys rotate on an hourly basis (with a bit of overlap for previous, current, next keys).

Actions #11

Updated by Ian Colle over 11 years ago

  • Priority changed from High to Normal
Actions #12

Updated by Sage Weil over 11 years ago

  • Status changed from 4 to Resolved
Actions

Also available in: Atom PDF