Bug #5652
closedmon: occasional clock_skew test failure
0%
Description
{"summary":[{"severity":"HEALTH_WARN","summary":"24 pgs peering"}],"timechecks":{"epoch":8,"round":2,"round_status":"finished","mons":[{"name":"a","skew":"0.000000","latency":"0.000000","health":"HEALTH_OK"},{"name":"c","skew":"0.000000","latency":"0.051548","health":"HEALTH_OK"},{"name":"b","skew":"9.747621","late ncy":"0.057928","health":"HEALTH_WARN","details":"clock skew 9.74762s > max 0.25s"},{"name":"d","skew":"0.000000","latency":"0.065936","health":"HEALTH_OK"},{"name":"f","skew":"0.000000","latency":"0.051489","health":"HEALTH_OK"},{"name":"e","skew":"-0.136700","latency":"0.057758","health":"HEALTH_OK"},{"name":"g", "skew":"0.000000","latency":"0.088119","health":"HEALTH_OK"},{"name":"i","skew":"0.000000","latency":"0.068781","health":"HEALTH_OK"},{"name":"h","skew":"-0.119438","latency":"0.075049","health":"HEALTH_OK"},{"name":"j","skew":"0.000000","latency":"0.082959","health":"HEALTH_OK"},{"name":"l","skew":"0.000000","late ncy":"0.068527","health":"HEALTH_OK"},{"name":"k","skew":"-0.120045","latency":"0.074799","health":"HEALTH_OK"},{"name":"m","skew":"0.000000","latency":"0.082559","health":"HEALTH_OK"},{"name":"o","skew":"0.000000","latency":"0.068464","health":"HEALTH_OK"},{"name":"n","skew":"-0.120137","latency":"0.074731","healt h":"HEALTH_OK"},{"name":"p","skew":"0.000000","latency":"0.087807","health":"HEALTH_OK"},{"name":"r","skew":"0.000000","latency":"0.068401","health":"HEALTH_OK"},{"name":"q","skew":"-0.120376","latency":"0.074521","health":"HEALTH_OK"},{"name":"s","skew":"0.000000","latency":"0.082275","health":"HEALTH_OK"},{"name" :"u","skew":"0.000000","latency":"0.067723","health":"HEALTH_OK"},{"name":"t","skew":"-0.120706","latency":"0.073991","health":"HEALTH_OK"}]},"health":{"health_services":[{"mons":[]}]},"overall_status":"HEALTH_WARN","detail":["mon.b addr 10.214.133.37:6789\/0 clock skew 9.74762s > max 0.25s (latency 0.057928s)"]} 2013-07-17T09:16:36.201 INFO:teuthology.task.mon_clock_skew_check.mon_clock_skew_check:expected skew: mon.b with skew 9.747621 > max 0.05
job was
ubuntu@teuthology:/a/teuthology-2013-07-17_01:00:18-rados-next-testing-basic/70286$ cat orig.config.yaml kernel: kdb: true sha1: 77c8bf2f972a9d6ff446c49a41678bf931bbee44 machine_type: plana nuke-on-error: true overrides: admin_socket: branch: next ceph: conf: global: ms inject socket failures: 5000 mon: debug mon: 20 debug ms: 20 debug paxos: 20 mon.b: clock offset: 10 log-whitelist: - slow request sha1: 884fa2fcb6d707b23317bab1da909586ddc27608 ceph-deploy: conf: client: debug monc: 20 debug ms: 1 debug objecter: 20 debug rados: 20 log file: /var/log/ceph/ceph-..log mon: debug mon: 20 debug ms: 20 debug paxos: 20 install: ceph: sha1: 884fa2fcb6d707b23317bab1da909586ddc27608 s3tests: branch: next workunit: sha1: 884fa2fcb6d707b23317bab1da909586ddc27608 roles: - - mon.a - mon.d - mon.g - mon.j - mon.m - mon.p - mon.s - osd.0 - - mon.b - mon.e - mon.h - mon.k - mon.n - mon.q - mon.t - mds.a - - mon.c - mon.f - mon.i - mon.l - mon.o - mon.r - mon.u - osd.1 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - slow request - .*clock.*skew.* - clocks not synchronized wait-for-healthy: false - mon_clock_skew_check: expect-skew: true
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from 12 to In Progress
The cause for the crash has been fixed on teuthology's wip-5652. It was caused by a missing key on a 'str'.format() on an assert.
However, this fix was only noticed as being needed because the test was going to fail on an assert. The assert in question is the one that checks if, for any given clock skew, the monitor with the skew is flagged with an appropriate health status (HEALTH_WARN). This is not the case for most of the monitors on that run.
First of all, none of the monitors in this case appear to be worried about their skews, as it appears that their mon_clock_drift_allowed is set to 0.25, instead of what I perceive as the default 0.05. The teuthology job's yaml isn't setting it, and afaict there's no place in the source setting it to 0.25 instead of 0.05. Nowhere is the value of mon_clock_drift_allowed changed, and all places that matter for assessing clock skews always use this config option (afaict still keeps the 0.05 default); thought we could have erroneously been relying on some twisted value resulting from the backoff config option, but that's not the case.
So, considering that the monitors are assuming their clocks will only be skewed over 0.25 instead of 0.05, the test will always fail unless the test's 'max-skew' option is set accordingly to 0.25 (default: 0.05).
Now, any idea on why or how we would end up with mon_clock_drift_allowed = 0.25 ? This is the biggest wtf and what is slightly scaring me about this bug. All my runs have been successfully finishing; I'm now hoping it will fail at some point so I can try figuring something with that (maybe check the cluster's config options, just to make sure it is magically setting 0.25 instead).
Updated by Joao Eduardo Luis almost 11 years ago
Okay, I accessed the cluster while one of my teuth jobs were running, and
ubuntu@mira051:~$ sudo ceph daemon mon.a config get mon_clock_drift_allowed { "mon_clock_drift_allowed": "0.25"}
So it is indeed set to 0.25. I checked out yet again the job's sha, and the option is however set to 0.05, and I don't think we're doing it via teuthology either.
Updated by Joao Eduardo Luis almost 11 years ago
Okay, there's a simple way to make the test independent from its 'max-skew' option, and that's to read the 'mon_clock_allowed_drift' option from the cluster in the beginning and act accordingly. This doesn't solve my current hunt for the 0.25 thingy, but should prevent the test from failing should these parameters be changed in the future.
Updated by Joao Eduardo Luis almost 11 years ago
Greg pointed me in the right direction and it looks like the 0.25 is coming from the base ceph.conf generate by teuthology:
joao@tardis:~/inktank/src/teuthology$ git grep -nFI 'mon clock' teuthology/ceph.conf.template:8: mon clock drift allowed = .250
Updated by Joao Eduardo Luis almost 11 years ago
pushed new patch to wip-5652 on teuthology
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from In Progress to Fix Under Review
Updated by Sage Weil almost 11 years ago
this still hard codes the defaults in teuthology.. a more robust way is to run
ceph-mon -i mon.$whatever --show-config-value config_option_name
and see what the mon is actually getting from (config + built-in defaults)
Updated by Sage Weil almost 11 years ago
- Status changed from Fix Under Review to Resolved