Project

General

Profile

Bug #3637

client: not issuing caps for with clients doing shared writes

Added by Sam Lang over 11 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With 3 clients running ceph-fuse, running the ior command:

/tmp/cephtest/binary/usr/local/bin/ior -e -w -r -W -b 10m -a POSIX -o /tmp/cephtest/gmnt/ior.testfile

This is shared file access. All three clients hanging, but it looks like 2 clients successfully stat the file and then wait on a barrier, while the third is hung on the stat.

full3.logs (1.58 MB) Sam Lang, 12/18/2012 08:59 PM

Associated revisions

Revision 9c18fd67 (diff)
Added by Greg Farnum almost 11 years ago

mds: Locker needs to remember requested max_size changes from clients

Previously, if we received an MClientCaps request containing a change
in the inode's max size, and _do_cap_update() was unable to process
the request immediately (due to a locking issue), we would wait-list
the request by adding a call to check_inode_max_size() once the lock
became stable. However, we then tossed out the message without in any
way propagating the new max size which had been requested!

Handle this by extending check_inode_max_size to also accept parameters
for increasing the max size, and by storing all the parameters explicitly
in the C_MDL_CheckMaxSize Context instead of relying on defaults. That
gets us to the point where we can notice we need to increase the max. To
actually do so, we now pass calc_new_client_ranges() the requested max
size instead of the actual size if we're doing an update.

Notice that as a side effect of this, all clients get to see the max size
increase instead of just the requester. This should be okay, but it is
chattier than in the optimal case (where we don't get stuck on a lock).

Fixes #3637

Signed-off-by: Greg Farnum <>

History

#1 Updated by Sam Lang over 11 years ago

The hang occurs because a client requests a max size increase, but doesn't have write caps, so the mds puts it on the waiting list for the lock to reach a stable state, which never gets notified.

#2 Updated by Sam Lang over 11 years ago

  • Status changed from In Progress to 7

Proposed fix in wip-3637. The client's max size request in MClientCaps gets dropped if the file lock is in a non-stable stable (sync->mix). The fix pushes the max size request through to the check_inode_max_size call that is made once the lock moves to a stable state (mix).

#3 Updated by Greg Farnum about 11 years ago

I don't remember where this ended up. Was the proposed fix problematic, or did it never get looked at?

#4 Updated by Sam Lang about 11 years ago

Sage has a different proposed fix than what's in the branch. Still needs to be tested.

#5 Updated by Sage Weil about 11 years ago

  • Status changed from 7 to Fix Under Review
  • Assignee changed from Sam Lang to Sage Weil

#6 Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to In Progress
  • Assignee changed from Sage Weil to Sam Lang

#7 Updated by Greg Farnum about 11 years ago

  • Priority changed from Normal to High

#8 Updated by Greg Farnum about 11 years ago

  • Assignee changed from Sam Lang to Sage Weil

#9 Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to 12

#10 Updated by Sage Weil almost 11 years ago

  • Assignee deleted (Sage Weil)

#11 Updated by Greg Farnum almost 11 years ago

  • Priority changed from High to Urgent

#12 Updated by Greg Farnum almost 11 years ago

#4489 is probably a duplicate of this and has steps to reproduce, if we need alternate angles of attack. (And we should verify that assessment once we fix this bug.)

#13 Updated by Greg Farnum almost 11 years ago

  • Assignee set to Greg Farnum

Starting to look at this now.

#14 Updated by Greg Farnum almost 11 years ago

  • Status changed from 12 to In Progress

Since I apparently forgot to mention it here, this has nothing to do with #4489; I just pattern-matched a little too quickly.

However, I do have a first pass couple patches in wip-max_size-3637. I need to do some testing with it to make sure first that I can reproduce the issue, and second that this branch fixes it without causing other issues (it is passing a single-client fsx.sh smoke test at least).
If anybody can suggest a more elegant solution that doesn't involve reworking the data structures involved I am all ears; the parameter explosion is making me a bit sad, although I'm glad to say it doesn't add any extra branches. Especially if it doesn't involve broadcasting the new max_size to all clients — speaking of which, I may be over-increasing the size due to the different levels of interpretation happening; I need to check that...

#15 Updated by Greg Farnum almost 11 years ago

I'm having difficulty reproducing this at all on current next, but am leaving it churning in the background... :/

I am seeing some other interesting warnings that may need their own bug, though:

INFO:teuthology.orchestra.run.out:access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
INFO:teuthology.orchestra.run.out:------    ---------  ---------- ---------  --------   --------   --------   --------   ----
INFO:teuthology.orchestra.run.out:ior WARNING: inconsistent file size by different tasks.
INFO:teuthology.orchestra.run.out:WARNING: Expected aggregate file size       = 125829120.
INFO:teuthology.orchestra.run.out:WARNING: Stat() of aggregate file size      = 41943040.
INFO:teuthology.orchestra.run.out:WARNING: Using actual aggregate bytes moved = 125829120.
INFO:teuthology.orchestra.run.out:write     200.38     40960      256.00     0.000183   0.598818   0.097357   0.598851   0
INFO:teuthology.orchestra.run.out:ior WARNING: inconsistent file size by different tasks.
INFO:teuthology.orchestra.run.out:WARNING: Expected aggregate file size       = 125829120.
INFO:teuthology.orchestra.run.out:WARNING: Stat() of aggregate file size      = 41943040.

But that doesn't induce ior to fail.

#16 Updated by Zheng Yan almost 11 years ago

If Locker::_do_cap_update can't get wrlock for a given client, the client should have no Fw cap. I think we can make the clients request new max size only when they have Fw cap. So Locker::_do_cap_update can always take wrlock when handling
new max size request.

#17 Updated by Greg Farnum almost 11 years ago

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?
2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:
  • client is writing, and then needs to extend the max size so adds it to caps for flushing
  • client receives cap revoke, removing the Fw perm
  • client returns cap flush (with the max size request stripped out because it no longer has Fw)
  • (time passes and) MDS gives client Fw back
  • (if the client had flushed its max size request, it could start writing now, but it hasn't) client requests max size increase
  • MDS gives back client max size increase after a round-trip to OSD

#18 Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to Fix Under Review

Regarding the testing (which I'm doing now), what those warnings turned out to mean is that each instance had their own file, so it wasn't even being run on Ceph. Which explains why I couldn't reproduce. I'm attempting to deduce wtf ior is doing now.

#19 Updated by Greg Farnum almost 11 years ago

And that wasn't working because teuthology was creating working dirs like /tmp/cephtest/gregf@kai-2013-04-16_12-59-21/, and ior wasn't parsing past the '@' sign — no idea why and Sam didn't have one either. However, he tells me newer teuthology won't create dirs that look like that, so I'll just update and we should remember this if we have issues in the future. Hopefully this'll let me reproduce properly!

#20 Updated by Greg Farnum almost 11 years ago

Reproduced at last. There continues to be a problem with the fix branch too :( but it's not a max_size issue; one of the nodes is failing to get Fr. I'm looking at it now.

#21 Updated by Greg Farnum almost 11 years ago

Okay, it's not quite that simple. This (all following the data writeout; I think this is the data check — anyway, this) node has previously held pAsLsXsFscr.
Then, the MDS revokes Fsc and the client appears to handle that — _release(), _invalidate_inode_cache(), check_caps() and send_cap(). However, it's currently got Fcr marked as "used", so it...doesn't actually drop Fc. I admit I don't quite remember how this part is supposed to work, but the MDS sure seems to think that at this point the client no longer has those caps (but there might just not be any output of that).
Then later on the MDS does give the client Fr, which is enough for the client to do what it wants, but the client still says that it's got Fsc implemented (so it hasn't cleaned that up yet?), and so when it ends up in get_caps() it notices that it has what it needs, but (revoking & butnot) is non-zero (revoking still containing Fc since it has yet to drop the damn cap, as does butnot) so it continues to wait instead of breaking out.
I'll discuss this with Sage and move forward.

#22 Updated by Zheng Yan almost 11 years ago

Greg Farnum wrote:

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?
2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:
  • client is writing, and then needs to extend the max size so adds it to caps for flushing
  • client receives cap revoke, removing the Fw perm
  • client returns cap flush (with the max size request stripped out because it no longer has Fw)
  • (time passes and) MDS gives client Fw back
  • (if the client had flushed its max size request, it could start writing now, but it hasn't) client requests max size increase
  • MDS gives back client max size increase after a round-trip to OSD

thanks for clarification

Greg Farnum wrote:

Okay, it's not quite that simple. This (all following the data writeout; I think this is the data check — anyway, this) node has previously held pAsLsXsFscr.
Then, the MDS revokes Fsc and the client appears to handle that — _release(), _invalidate_inode_cache(), check_caps() and send_cap(). However, it's currently got Fcr marked as "used", so it...doesn't actually drop Fc. I admit I don't quite remember how this part is supposed to work, but the MDS sure seems to think that at this point the client no longer has those caps (but there might just not be any output of that).

This sounds like a bug I recently fixed, see [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued.

#23 Updated by Zheng Yan almost 11 years ago

Greg Farnum wrote:

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?

there are only 4 states that allow Fw caps, they are MIX, MIX_EXCL, EXCL and EXCL_MIX. they all allow Fw cap.

Greg Farnum wrote:

2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:

I thought about this again. I think we can allow force wrlock when Fw caps are revoking. (allow force wrlock for EXCL_XSYN, EXCL_SYNC, EXCL_LOCK MIX_SYNC, MIX_LOCK)

#24 Updated by Zheng Yan almost 11 years ago

Zheng Yan wrote:

there are only 4 states that allow Fw caps, they are MIX, MIX_EXCL, EXCL and EXCL_MIX. they all allow Fw cap.

Sorry, I mean they all allow wrlock

#25 Updated by Greg Farnum almost 11 years ago

That latest issue was #4746. Turning off the callback and testing again...

#26 Updated by Greg Farnum almost 11 years ago

  • Assignee changed from Greg Farnum to Sage Weil

This now appears to be passing (I've got it continuing to loop in the background), but it needs review and merging. Sage said he could review!

#27 Updated by Greg Farnum almost 11 years ago

  • Status changed from Fix Under Review to 7

And he gave me a reviewed-by tag. Will merge this tomorrow morning after some more testing.

#28 Updated by Greg Farnum almost 11 years ago

  • Assignee changed from Sage Weil to Greg Farnum

#29 Updated by Greg Farnum almost 11 years ago

  • Status changed from 7 to Resolved

Merged into next in commit:efbe2e8b55ba735673a3fdb925a6304915f333d8

#30 Updated by Greg Farnum over 7 years ago

  • Component(FS) Client added

Also available in: Atom PDF