Project

General

Profile

Actions

Bug #854

closed

unsynchronized clocks between kernel-client/cmds cause PJD fstest failures

Added by Brian Chrisman about 13 years ago. Updated about 10 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm seeing a varied number (generally 5-8) of POSIX tests within the PJD fstest suite failing when the tests are being run on a node (atop the ceph kernel client) where that node's clock is not synchronized with the node hosting the active MDS.
Synchronizing the clocks in the cluster using ntpdate/xntpd returns PJD fstests to full success.

This is not likely critical because:
- failures are small corner cases (unexpected ctimes during operations like lchown of a symlink, for example)
- workaround of having clocks set correctly is reasonable
- may be a known design issue (mds creating a timestamp instead of taking the client's time stamp?) for some reason

Here's a histogram of the test failures (out of 21 repeat runs)
(count) (test number) (status) (line number) (test filename)

19 102:fail:[218 /opt/scale/lib/pjdfstests/tests/chown/00.t]
21 112:fail:[236 /opt/scale/lib/pjdfstests/tests/chown/00.t]
13 141:fail:[287 /opt/scale/lib/pjdfstests/tests/chown/00.t]
17 145:fail:[302 /opt/scale/lib/pjdfstests/tests/chown/00.t]
21 153:fail:[332 /opt/scale/lib/pjdfstests/tests/chown/00.t]
13 27:fail:[70 /opt/scale/lib/pjdfstests/tests/chmod/00.t]
18 31:fail:[78 /opt/scale/lib/pjdfstests/tests/chmod/00.t]
11 97:fail:[209 /opt/scale/lib/pjdfstests/tests/chown/00.t]

Different tests will fail in different runs... a few (with 21/21) fail consistently.

Actions #1

Updated by Greg Farnum about 13 years ago

Ah, that makes sense. This is something we're unlikely to fix -- currently a lot of operations occur "on" the MDS (renames, creates, etc) and so sending a client time along for those wouldn't make much sense. But we need to refer to both the kernel client's time and the MDS' time since other operations (like inode data changes) occur "on" the kernel client and are reported to the MDS in batches.
But maybe some brilliant idea will come up in the future!

Actions #2

Updated by Sage Weil about 13 years ago

The only reasonably sane idea I have here is for the client/mds to compare clocks to estimate skew and have some sort of auto-adjustment going on. It's hard to say what that adjustment should be, though. Maybe just periodically spamming the console when the skew is significant is the thing to do.

Actions #3

Updated by Sage Weil over 12 years ago

  • Target version set to 52
Actions #4

Updated by Sage Weil over 12 years ago

  • Target version deleted (52)
Actions #5

Updated by Ian Colle about 11 years ago

  • Project changed from Ceph to CephFS
Actions #6

Updated by Greg Farnum about 10 years ago

  • Status changed from New to Duplicate

I'm closing this in favor of fix ticket #7564.

Actions

Also available in: Atom PDF