Project

General

Profile

Actions

Bug #1063

closed

dbench breaks if MDS and client times aren't synced

Added by Anonymous almost 13 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://autotest.ceph.newdream.net/afe/#tab_id=view_job&object_id=554

one mds, one osd, cfuse

dbench never completes, just keeps saying "cleanup NNN sec" with increasing numbers, manually poking on the node hangs as soon as you touch the mountpoint.

"ceph health" and "ceph -s" looked normal.

Feel free to clone the above autotest job to get a running instance of this, it seems to fail reliably.

Actions #1

Updated by Sage Weil almost 13 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to High

this is probably a kclient thing.. testing against latest for-linus

Actions #2

Updated by Anonymous almost 13 years ago

Note that the test I ran was on cfuse (most likely because I had kclient trouble, and wanted to isolate that out).

I just cloned the job to http://autotest.ceph.newdream.net/afe/#tab_id=view_job&object_id=555 and that one uses kclient. Results in ~20 minutes.

Actions #3

Updated by Anonymous almost 13 years ago

Job 555 broke, here's a re-run: http://autotest.ceph.newdream.net/afe/#tab_id=view_job&object_id=556

And that confirms that kclient works just fine.

Though it's interesting to see that the "cleanup" phase of dbench took 606 seconds = 10 minutes. On cfuse, I aborted after 744 seconds of "cleanup", after seeing an ls hang. So maybe this is just both being problematic, and cfuse being way slower. I'll rerun cfuse and let it sit for a long time.

kclient dbench log:
http://autotest.ceph.newdream.net/results/556-tv/group0/sepia60.ceph.dreamhost.com/debug/client.0.DEBUG

cfuse dbench log:
http://autotest.ceph.newdream.net/results/554-tv/group0/sepia63.ceph.dreamhost.com/ceph_dbench.cluster0/debug/ceph_dbench.cluster0.DEBUG

Actions #4

Updated by Anonymous almost 13 years ago

My bad, the cleanup phase starts at 600 seconds, so kclient only had a few seconds of cleanup.

The cfuse re-run is at http://autotest.ceph.newdream.net/afe/#tab_id=view_job&object_id=560

Actions #5

Updated by Sage Weil almost 13 years ago

  • Assignee changed from Sage Weil to Greg Farnum
  • Target version set to v0.27.1
Actions #6

Updated by Anonymous almost 13 years ago

Job 560 has spent 1.5 hours in cleanup now, aborting.

14:40:26 DEBUG| [stdout] 2 19 0.00 MB/sec cleanup 6288 sec

http://autotest.ceph.newdream.net/results/560-tv/group0/sepia63.ceph.dreamhost.com/debug/client.0.log

Actions #7

Updated by Greg Farnum almost 13 years ago

I'm unable to reproduce this on my own machine, and after looking through the mds logs from autotest everything looks good. I'll need a debug log from the client to diagnose this, but it's not clear to me where or how to turn that on with our autotest system.

Actions #8

Updated by Greg Farnum almost 13 years ago

  • Status changed from New to In Progress

Ran this with client debugging enabled (job 573). Not certain this is the problem, but it looks like the problem is the client is marking a few inodes' caps dirty and then never sending them to the MDS. I certainly can't find anything else wrong! (no hanging requests, no waiting for max_size to get updated...)

Thought I don't know why keeping some caps dirty would break anything either. :/

Actions #9

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.27.1 to v0.29
  • Translation missing: en.field_position set to 4
Actions #10

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_story_points set to 2
  • Translation missing: en.field_position deleted (4)
  • Translation missing: en.field_position set to 4
Actions #11

Updated by Greg Farnum almost 13 years ago

Okay, hopefully we can rerun this with a time-synced cluster soon and see if that's what is causing the breakage.

I'm seeing a few problems here as well, but not a root cause that would explain what's starting the client down the wrong path in the first place. Most specifically:
It's possible to go into check_caps with an inode popped off the delayed_caps list, then not send the cap update to the MDS. This effectively "loses" the dirtied cap.
There are other things scaring me with those checks. Nothing looks at dirty_caps until you're in the process of sending and while it's possible the checks are valid if all invariants hold, they're obviously fragile.

Actions #12

Updated by Greg Farnum almost 13 years ago

  • Subject changed from ceph_dbench autotest never completes to dbench breaks if MDS and client times aren't synced

Well, job 576 completed successfully after TV time-synced the cluster. Looks like bad mtimes are somehow causing the problem and the uclient is sensitive to them.
I've got that possible fix in my tree but I've discovered a regression in fsstress that I'd like to locate the cause of before I push it, in case they're related.

Actions #13

Updated by Greg Farnum almost 13 years ago

  • Category set to 11
Actions #14

Updated by Greg Farnum almost 13 years ago

On the other hand, adding a clock skew option and setting the MDS into the future doesn't let me reproduce the brokenness when running locally.

Actions #15

Updated by Greg Farnum almost 13 years ago

Scratch that, I did manage to reproduce locally. It just took a bit longer.

Actions #16

Updated by Greg Farnum almost 13 years ago

  • Status changed from In Progress to Can't reproduce

I won't be surprised if this comes back again, but I can't reproduce it and there've been several fixes for client caps, etc in the meantime.

Actions #17

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (11)
  • Target version deleted (v0.29)

Bulk updating project=ceph category=ceph-fuse issues to move to fs project so that we can remove the ceph-fuse category from the ceph project

Actions

Also available in: Atom PDF