Project

General

Profile

Actions

Bug #5652

closed

mon: occasional clock_skew test failure

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

{"summary":[{"severity":"HEALTH_WARN","summary":"24 pgs peering"}],"timechecks":{"epoch":8,"round":2,"round_status":"finished","mons":[{"name":"a","skew":"0.000000","latency":"0.000000","health":"HEALTH_OK"},{"name":"c","skew":"0.000000","latency":"0.051548","health":"HEALTH_OK"},{"name":"b","skew":"9.747621","late
ncy":"0.057928","health":"HEALTH_WARN","details":"clock skew 9.74762s > max 0.25s"},{"name":"d","skew":"0.000000","latency":"0.065936","health":"HEALTH_OK"},{"name":"f","skew":"0.000000","latency":"0.051489","health":"HEALTH_OK"},{"name":"e","skew":"-0.136700","latency":"0.057758","health":"HEALTH_OK"},{"name":"g",
"skew":"0.000000","latency":"0.088119","health":"HEALTH_OK"},{"name":"i","skew":"0.000000","latency":"0.068781","health":"HEALTH_OK"},{"name":"h","skew":"-0.119438","latency":"0.075049","health":"HEALTH_OK"},{"name":"j","skew":"0.000000","latency":"0.082959","health":"HEALTH_OK"},{"name":"l","skew":"0.000000","late
ncy":"0.068527","health":"HEALTH_OK"},{"name":"k","skew":"-0.120045","latency":"0.074799","health":"HEALTH_OK"},{"name":"m","skew":"0.000000","latency":"0.082559","health":"HEALTH_OK"},{"name":"o","skew":"0.000000","latency":"0.068464","health":"HEALTH_OK"},{"name":"n","skew":"-0.120137","latency":"0.074731","healt
h":"HEALTH_OK"},{"name":"p","skew":"0.000000","latency":"0.087807","health":"HEALTH_OK"},{"name":"r","skew":"0.000000","latency":"0.068401","health":"HEALTH_OK"},{"name":"q","skew":"-0.120376","latency":"0.074521","health":"HEALTH_OK"},{"name":"s","skew":"0.000000","latency":"0.082275","health":"HEALTH_OK"},{"name" 
:"u","skew":"0.000000","latency":"0.067723","health":"HEALTH_OK"},{"name":"t","skew":"-0.120706","latency":"0.073991","health":"HEALTH_OK"}]},"health":{"health_services":[{"mons":[]}]},"overall_status":"HEALTH_WARN","detail":["mon.b addr 10.214.133.37:6789\/0 clock skew 9.74762s > max 0.25s (latency 0.057928s)"]}

2013-07-17T09:16:36.201 INFO:teuthology.task.mon_clock_skew_check.mon_clock_skew_check:expected skew: mon.b with skew 9.747621 > max 0.05

job was
ubuntu@teuthology:/a/teuthology-2013-07-17_01:00:18-rados-next-testing-basic/70286$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 77c8bf2f972a9d6ff446c49a41678bf931bbee44
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      mon.b:
        clock offset: 10
    log-whitelist:
    - slow request
    sha1: 884fa2fcb6d707b23317bab1da909586ddc27608
  ceph-deploy:
    conf:
      client:
        debug monc: 20
        debug ms: 1
        debug objecter: 20
        debug rados: 20
        log file: /var/log/ceph/ceph-..log
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 884fa2fcb6d707b23317bab1da909586ddc27608
  s3tests:
    branch: next
  workunit:
    sha1: 884fa2fcb6d707b23317bab1da909586ddc27608
roles:
- - mon.a
  - mon.d
  - mon.g
  - mon.j
  - mon.m
  - mon.p
  - mon.s
  - osd.0
- - mon.b
  - mon.e
  - mon.h
  - mon.k
  - mon.n
  - mon.q
  - mon.t
  - mds.a
- - mon.c
  - mon.f
  - mon.i
  - mon.l
  - mon.o
  - mon.r
  - mon.u
  - osd.1
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - slow request
    - .*clock.*skew.*
    - clocks not synchronized
    wait-for-healthy: false
- mon_clock_skew_check:
    expect-skew: true
Actions #1

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from 12 to In Progress

The cause for the crash has been fixed on teuthology's wip-5652. It was caused by a missing key on a 'str'.format() on an assert.

However, this fix was only noticed as being needed because the test was going to fail on an assert. The assert in question is the one that checks if, for any given clock skew, the monitor with the skew is flagged with an appropriate health status (HEALTH_WARN). This is not the case for most of the monitors on that run.

First of all, none of the monitors in this case appear to be worried about their skews, as it appears that their mon_clock_drift_allowed is set to 0.25, instead of what I perceive as the default 0.05. The teuthology job's yaml isn't setting it, and afaict there's no place in the source setting it to 0.25 instead of 0.05. Nowhere is the value of mon_clock_drift_allowed changed, and all places that matter for assessing clock skews always use this config option (afaict still keeps the 0.05 default); thought we could have erroneously been relying on some twisted value resulting from the backoff config option, but that's not the case.

So, considering that the monitors are assuming their clocks will only be skewed over 0.25 instead of 0.05, the test will always fail unless the test's 'max-skew' option is set accordingly to 0.25 (default: 0.05).

Now, any idea on why or how we would end up with mon_clock_drift_allowed = 0.25 ? This is the biggest wtf and what is slightly scaring me about this bug. All my runs have been successfully finishing; I'm now hoping it will fail at some point so I can try figuring something with that (maybe check the cluster's config options, just to make sure it is magically setting 0.25 instead).

Actions #2

Updated by Joao Eduardo Luis almost 11 years ago

Okay, I accessed the cluster while one of my teuth jobs were running, and

ubuntu@mira051:~$ sudo ceph daemon mon.a config get mon_clock_drift_allowed
{ "mon_clock_drift_allowed": "0.25"}

So it is indeed set to 0.25. I checked out yet again the job's sha, and the option is however set to 0.05, and I don't think we're doing it via teuthology either.

Actions #3

Updated by Joao Eduardo Luis almost 11 years ago

Okay, there's a simple way to make the test independent from its 'max-skew' option, and that's to read the 'mon_clock_allowed_drift' option from the cluster in the beginning and act accordingly. This doesn't solve my current hunt for the 0.25 thingy, but should prevent the test from failing should these parameters be changed in the future.

Actions #4

Updated by Joao Eduardo Luis almost 11 years ago

Greg pointed me in the right direction and it looks like the 0.25 is coming from the base ceph.conf generate by teuthology:

joao@tardis:~/inktank/src/teuthology$ git grep -nFI 'mon clock'
teuthology/ceph.conf.template:8:        mon clock drift allowed = .250
Actions #5

Updated by Joao Eduardo Luis almost 11 years ago

pushed new patch to wip-5652 on teuthology

Actions #6

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from In Progress to Fix Under Review
Actions #7

Updated by Sage Weil almost 11 years ago

this still hard codes the defaults in teuthology.. a more robust way is to run

ceph-mon -i mon.$whatever --show-config-value config_option_name

and see what the mon is actually getting from (config + built-in defaults)

Actions #8

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF