Project

General

Profile

Actions

Bug #9514

closed

ceph-fuse pjd test is failing in giant nightlies

Added by Greg Farnum over 9 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
firefly, dumpling
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

commit:0ea20a668cf859881c49b33d1b6db4e636eda18a

http://qa-proxy.ceph.com/teuthology/sage-2014-09-14_18:23:49-smoke-master-distro-basic-multi/483776/

2014-09-14T18:28:10.240 INFO:tasks.workunit.client.0.plana72.stdout:Test Summary Report
2014-09-14T18:28:10.240 INFO:tasks.workunit.client.0.plana72.stdout:-------------------
2014-09-14T18:28:10.240 INFO:tasks.workunit.client.0.plana72.stdout:../pjd-fstest-20090130-RC/tests/chmod/00.t   (Wstat: 0 Tests: 58 Failed: 1)
2014-09-14T18:28:10.240 INFO:tasks.workunit.client.0.plana72.stdout:  Failed test:  27
2014-09-14T18:28:10.240 INFO:tasks.workunit.client.0.plana72.stdout:../pjd-fstest-20090130-RC/tests/chown/00.t   (Wstat: 0 Tests: 171 Failed: 2)
2014-09-14T18:28:10.241 INFO:tasks.workunit.client.0.plana72.stdout:  Failed tests:  97, 141
2014-09-14T18:28:10.241 INFO:tasks.workunit.client.0.plana72.stdout:Files=191, Tests=1964, 105 wallclock secs ( 4.39 usr  1.45 sys + 14.38 cusr 13.39 csys = 33.61 CPU)
2014-09-14T18:28:10.241 INFO:tasks.workunit.client.0.plana72.stdout:Result: FAIL
2014-09-14T18:28:10.241 INFO:tasks.workunit:Stopping ['suites/pjd.sh'] on client.0...

This is new to me, but it's happening in most of the smoke tests. There is presently no logging to examine, but I've turned it up for future runs.

Actions #1

Updated by Greg Farnum over 9 years ago

  • Assignee deleted (Greg Farnum)

This hasn't reproduced since we turned on debug logging. :(
But I did see it on a run without any logging: /a/gregf-2014-09-25_14:26:35-smoke-wip-9008-distro-basic-multi/511561

I spent some time today looking through the code to see if I could find anything just by review. The client-side all looks good to me. It looks like when it flushes time changes to the MDS via cap updates that it's doing so correctly. I'm a little less certain of the direct MDS requests; Server::handle_client_setattr() is unconditionally setting the ctime but isn't grabbing the versionlock it takes when CEPH_SETATTR_CTIME is set. But that code path shouldn't even be in use for these (single-client) tests, so it's definitely not the problem here. :/

Dropping this back in the pool at least until we see this again with some debug logs.

Actions #2

Updated by Greg Farnum over 9 years ago

  • Status changed from New to In Progress

Hah, we got the failure with logs in /a/sage-2014-09-26_17:51:11-smoke-giant-distro-basic-multi/513914

All of the failures are indicating that the ctime has not changed, but the test thinks it should have.

The MDS has four client requests on the inode in question:

2014-09-26 19:22:05.862197 7f80a8f31700  1 -- 10.214.131.19:6810/1042 <== client.4117 10.214.131.31:0/18597 719 ==== client_request(client.4117:447 create #100000000d9/fstest_f5514f5c3274f2d50a688ce2d8ffac4a 2014-09-26 19:22:05.861867) v2 ==== 205+0+0 (201089997 0 0) 0x4bab780 con 0x454c000
2014-09-26 19:22:06.889802 7f80a8f31700  1 -- 10.214.131.19:6810/1042 <== client.4117 10.214.131.31:0/18597 796 ==== client_request(client.4117:448 rmxattr #100000000df security.ima 2014-09-26 19:22:06.889460) v2 ==== 134+0+0 (2238072425 0 0) 0x4babc80 con 0x454c000
2014-09-26 19:22:06.906343 7f80a8f31700  1 -- 10.214.131.19:6810/1042 <== client.4117 10.214.131.31:0/18597 798 ==== client_request(client.4117:449 getattr pAsLsXsFs #100000000df 2014-09-26 19:22:06.906069) v2 ==== 122+0+0 (1147094891 0 0) 0x4b9cc80 con 0x454c000
2014-09-26 19:22:07.257900 7f80a8f31700  1 -- 10.214.131.19:6810/1042 <== client.4117 10.214.131.31:0/18597 861 ==== client_request(client.4117:450 unlink #100000000d9/fstest_f5514f5c3274f2d50a688ce2d8ffac4a 2014-09-26 19:22:07.219382) v2 ==== 249+0+0 (3681911531 0 0) 0x4c0a000 con 0x454c000

2014-09-26 19:22:05.861936 7eff62422780  1 -- 10.214.131.31:0/18597 --> 10.214.131.19:6810/1042 -- client_request(client.4117:447 create #100000000d9/fstest_f5514f5c3274f2d50a688ce2d8ffac4a 2014-09-26 19:22:05.861867) v2 -- ?+0 0x3db7930 con 0x3c96f40
2014-09-26 19:22:06.889525 7eff62422780  1 -- 10.214.131.31:0/18597 --> 10.214.131.19:6810/1042 -- client_request(client.4117:448 rmxattr #100000000df security.ima 2014-09-26 19:22:06.889460) v2 -- ?+0 0x3ef4e60 con 0x3c96f40
2014-09-26 19:22:06.906133 7eff62422780  1 -- 10.214.131.31:0/18597 --> 10.214.131.19:6810/1042 -- client_request(client.4117:449 getattr pAsLsXsFs #100000000df 2014-09-26 19:22:06.906069) v2 -- ?+0 0x3d76ea0 con 0x3c96f40
2014-09-26 19:22:07.219473 7eff62422780  1 -- 10.214.131.31:0/18597 --> 10.214.131.19:6810/1042 -- client_request(client.4117:450 unlink #100000000d9/fstest_f5514f5c3274f2d50a688ce2d8ffac4a 2014-09-26 19:22:07.219382) v2 -- ?+0 0x3f3ae60 con 0x3c96f40

I think that getattr is the one that pjd expects to be returning a different ctime and doesn't.

The only cap update between the client and MDS is when the MDS revokes caps following the rmattr and then re-grants them.

But notice these outputs from the client:

2014-09-26 19:22:06.889336 7eff62422780 10 client.4117 fill_stat on 100000000df snap/devhead mode 0100644 mtime 2014-09-26 19:22:05.861867 ctime 2014-09-26 19:22:05.861867
2014-09-26 19:22:06.889416 7eff62422780 10 client.4117 fill_stat on 100000000df snap/devhead mode 0100111 mtime 2014-09-26 19:22:05.861867 ctime 2014-09-26 19:22:06.889397
2014-09-26 19:22:06.906018 7eff62422780 10 client.4117 fill_stat on 100000000df snap/devhead mode 0100111 mtime 2014-09-26 19:22:05.861867 ctime 2014-09-26 19:22:06.889397
2014-09-26 19:22:07.206693 7eff62422780 10 client.4117 fill_stat on 100000000df snap/devhead mode 0100111 mtime 2014-09-26 19:22:05.861867 ctime 2014-09-26 19:22:05.861867

I think that must be because we're blindly taking a ctime from the MDS, and either failed to flush it to the MDS correctly or are having the MDS incorrectly ignore the update. Trying to figure that out now.

Actions #3

Updated by Greg Farnum over 9 years ago

So here's a question: why does the client (temporarily) remember its ctime as being 2014-09-26 19:22:06.889397, but nobody ever uses (the later) 2014-09-26 19:22:06.889460 that is specified by the rmxattr op_stamp?
Answer: ENODATA means we don't update the inode. Good.

But that means the rmattr cap revoke is a bit weird...and as I look at it a bit more, it looks like it's not updating the ctime for some reason

2014-09-26 19:22:06.891702 7eff567fc700  1 -- 10.214.131.31:0/18597 --> 10.214.131.19:6810/1042 -- client_caps(update ino 100000000df 225 seq 2 tid 1 caps=pAsLsFs dirty=Ax wanted=- follows 1 size 0/0 ts 1 mtime 2014-09-26 19:22:05.861867) v4 -- ?+0 0x7eff440d41d0 con 0x3c96f40

follows the change in ctime to 6.88....
But when it arrives on the MDS and is processed
2014-09-26 19:22:06.891963 7f80a8f31700  1 -- 10.214.131.19:6810/1042 <== client.4117 10.214.131.31:0/18597 797 ==== client_caps(update ino 100000000df 225 seq 2 tid 1 caps=pAsLsFs dirty=Ax wanted=- follows 1 size 0/0 ts 1 mtime 2014-09-26 19:22:05.861867) v4 ==== 192+0+0 (1163796950 0 0) 0x4e35800 con 0x454c000
2014-09-26 19:22:06.892003 7f80a8f31700  7 mds.0.locker handle_client_caps on 100000000df follows 1 op update

there is definitely not a ctime update (which would produce output). And that is because the only dirty cap is Ax (that is, the exclusive auth cap), but _update_cap_fields() only modifies the ctime (and a bunch of other stuff) if either the FILE_WR or FILE_EXCL caps are dirtied...which they here are obviously not.
So the client goes backwards in time because it loses enough caps that it can't trust its ctime, and the MDS never updated its view.
The client will dirty the FILE_EXCL cap (assuming it has it) when changin the mtime or atime, but not the ctime. (see _setattr())

Actions #4

Updated by Greg Farnum over 9 years ago

  • Category set to 47
  • Status changed from In Progress to 7
  • Assignee set to Greg Farnum
Actions #5

Updated by Greg Farnum over 9 years ago

  • Status changed from 7 to Pending Backport

In giant as commit:0ea20a668cf859881c49b33d1b6db4e636eda18a.

Needs to go to firefly as well.

Actions #6

Updated by Greg Farnum over 9 years ago

  • Backport set to firefly, dumpling
Actions #7

Updated by Greg Farnum over 9 years ago

  • Status changed from Pending Backport to Resolved

Dumpling commit:5f601f099be98c2b061cc94fb06917e7543f3efe
Firefly commit:9fee8de25ab5c155cd6a3d32a71e45630a5ded15

Actions #8

Updated by Loïc Dachary about 9 years ago

  • Description updated (diff)
Actions #9

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF