Project

General

Profile

Actions

Bug #3637

closed

client: not issuing caps for with clients doing shared writes

Added by Sam Lang over 11 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With 3 clients running ceph-fuse, running the ior command:

/tmp/cephtest/binary/usr/local/bin/ior -e -w -r -W -b 10m -a POSIX -o /tmp/cephtest/gmnt/ior.testfile

This is shared file access. All three clients hanging, but it looks like 2 clients successfully stat the file and then wait on a barrier, while the third is hung on the stat.


Files

full3.logs (1.58 MB) full3.logs Sam Lang, 12/18/2012 08:59 PM
Actions #1

Updated by Sam Lang over 11 years ago

The hang occurs because a client requests a max size increase, but doesn't have write caps, so the mds puts it on the waiting list for the lock to reach a stable state, which never gets notified.

Actions #2

Updated by Sam Lang over 11 years ago

  • Status changed from In Progress to 7

Proposed fix in wip-3637. The client's max size request in MClientCaps gets dropped if the file lock is in a non-stable stable (sync->mix). The fix pushes the max size request through to the check_inode_max_size call that is made once the lock moves to a stable state (mix).

Actions #3

Updated by Greg Farnum over 11 years ago

I don't remember where this ended up. Was the proposed fix problematic, or did it never get looked at?

Actions #4

Updated by Sam Lang over 11 years ago

Sage has a different proposed fix than what's in the branch. Still needs to be tested.

Actions #5

Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Fix Under Review
  • Assignee changed from Sam Lang to Sage Weil
Actions #6

Updated by Sage Weil over 11 years ago

  • Status changed from Fix Under Review to In Progress
  • Assignee changed from Sage Weil to Sam Lang
Actions #7

Updated by Greg Farnum over 11 years ago

  • Priority changed from Normal to High
Actions #8

Updated by Greg Farnum about 11 years ago

  • Assignee changed from Sam Lang to Sage Weil
Actions #9

Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to 12
Actions #10

Updated by Sage Weil about 11 years ago

  • Assignee deleted (Sage Weil)
Actions #11

Updated by Greg Farnum about 11 years ago

  • Priority changed from High to Urgent
Actions #12

Updated by Greg Farnum about 11 years ago

#4489 is probably a duplicate of this and has steps to reproduce, if we need alternate angles of attack. (And we should verify that assessment once we fix this bug.)

Actions #13

Updated by Greg Farnum about 11 years ago

  • Assignee set to Greg Farnum

Starting to look at this now.

Actions #14

Updated by Greg Farnum about 11 years ago

  • Status changed from 12 to In Progress

Since I apparently forgot to mention it here, this has nothing to do with #4489; I just pattern-matched a little too quickly.

However, I do have a first pass couple patches in wip-max_size-3637. I need to do some testing with it to make sure first that I can reproduce the issue, and second that this branch fixes it without causing other issues (it is passing a single-client fsx.sh smoke test at least).
If anybody can suggest a more elegant solution that doesn't involve reworking the data structures involved I am all ears; the parameter explosion is making me a bit sad, although I'm glad to say it doesn't add any extra branches. Especially if it doesn't involve broadcasting the new max_size to all clients — speaking of which, I may be over-increasing the size due to the different levels of interpretation happening; I need to check that...

Actions #15

Updated by Greg Farnum about 11 years ago

I'm having difficulty reproducing this at all on current next, but am leaving it churning in the background... :/

I am seeing some other interesting warnings that may need their own bug, though:

INFO:teuthology.orchestra.run.out:access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
INFO:teuthology.orchestra.run.out:------    ---------  ---------- ---------  --------   --------   --------   --------   ----
INFO:teuthology.orchestra.run.out:ior WARNING: inconsistent file size by different tasks.
INFO:teuthology.orchestra.run.out:WARNING: Expected aggregate file size       = 125829120.
INFO:teuthology.orchestra.run.out:WARNING: Stat() of aggregate file size      = 41943040.
INFO:teuthology.orchestra.run.out:WARNING: Using actual aggregate bytes moved = 125829120.
INFO:teuthology.orchestra.run.out:write     200.38     40960      256.00     0.000183   0.598818   0.097357   0.598851   0
INFO:teuthology.orchestra.run.out:ior WARNING: inconsistent file size by different tasks.
INFO:teuthology.orchestra.run.out:WARNING: Expected aggregate file size       = 125829120.
INFO:teuthology.orchestra.run.out:WARNING: Stat() of aggregate file size      = 41943040.

But that doesn't induce ior to fail.

Actions #16

Updated by Zheng Yan about 11 years ago

If Locker::_do_cap_update can't get wrlock for a given client, the client should have no Fw cap. I think we can make the clients request new max size only when they have Fw cap. So Locker::_do_cap_update can always take wrlock when handling
new max size request.

Actions #17

Updated by Greg Farnum about 11 years ago

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?
2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:
  • client is writing, and then needs to extend the max size so adds it to caps for flushing
  • client receives cap revoke, removing the Fw perm
  • client returns cap flush (with the max size request stripped out because it no longer has Fw)
  • (time passes and) MDS gives client Fw back
  • (if the client had flushed its max size request, it could start writing now, but it hasn't) client requests max size increase
  • MDS gives back client max size increase after a round-trip to OSD
Actions #18

Updated by Greg Farnum about 11 years ago

  • Status changed from In Progress to Fix Under Review

Regarding the testing (which I'm doing now), what those warnings turned out to mean is that each instance had their own file, so it wasn't even being run on Ceph. Which explains why I couldn't reproduce. I'm attempting to deduce wtf ior is doing now.

Actions #19

Updated by Greg Farnum about 11 years ago

And that wasn't working because teuthology was creating working dirs like /tmp/cephtest/gregf@kai-2013-04-16_12-59-21/, and ior wasn't parsing past the '@' sign — no idea why and Sam didn't have one either. However, he tells me newer teuthology won't create dirs that look like that, so I'll just update and we should remember this if we have issues in the future. Hopefully this'll let me reproduce properly!

Actions #20

Updated by Greg Farnum about 11 years ago

Reproduced at last. There continues to be a problem with the fix branch too :( but it's not a max_size issue; one of the nodes is failing to get Fr. I'm looking at it now.

Actions #21

Updated by Greg Farnum about 11 years ago

Okay, it's not quite that simple. This (all following the data writeout; I think this is the data check — anyway, this) node has previously held pAsLsXsFscr.
Then, the MDS revokes Fsc and the client appears to handle that — _release(), _invalidate_inode_cache(), check_caps() and send_cap(). However, it's currently got Fcr marked as "used", so it...doesn't actually drop Fc. I admit I don't quite remember how this part is supposed to work, but the MDS sure seems to think that at this point the client no longer has those caps (but there might just not be any output of that).
Then later on the MDS does give the client Fr, which is enough for the client to do what it wants, but the client still says that it's got Fsc implemented (so it hasn't cleaned that up yet?), and so when it ends up in get_caps() it notices that it has what it needs, but (revoking & butnot) is non-zero (revoking still containing Fc since it has yet to drop the damn cap, as does butnot) so it continues to wait instead of breaking out.
I'll discuss this with Sage and move forward.

Actions #22

Updated by Zheng Yan about 11 years ago

Greg Farnum wrote:

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?
2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:
  • client is writing, and then needs to extend the max size so adds it to caps for flushing
  • client receives cap revoke, removing the Fw perm
  • client returns cap flush (with the max size request stripped out because it no longer has Fw)
  • (time passes and) MDS gives client Fw back
  • (if the client had flushed its max size request, it could start writing now, but it hasn't) client requests max size increase
  • MDS gives back client max size increase after a round-trip to OSD

thanks for clarification

Greg Farnum wrote:

Okay, it's not quite that simple. This (all following the data writeout; I think this is the data check — anyway, this) node has previously held pAsLsXsFscr.
Then, the MDS revokes Fsc and the client appears to handle that — _release(), _invalidate_inode_cache(), check_caps() and send_cap(). However, it's currently got Fcr marked as "used", so it...doesn't actually drop Fc. I admit I don't quite remember how this part is supposed to work, but the MDS sure seems to think that at this point the client no longer has those caps (but there might just not be any output of that).

This sounds like a bug I recently fixed, see [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued.

Actions #23

Updated by Zheng Yan about 11 years ago

Greg Farnum wrote:

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?

there are only 4 states that allow Fw caps, they are MIX, MIX_EXCL, EXCL and EXCL_MIX. they all allow Fw cap.

Greg Farnum wrote:

2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:

I thought about this again. I think we can allow force wrlock when Fw caps are revoking. (allow force wrlock for EXCL_XSYN, EXCL_SYNC, EXCL_LOCK MIX_SYNC, MIX_LOCK)

Actions #24

Updated by Zheng Yan about 11 years ago

Zheng Yan wrote:

there are only 4 states that allow Fw caps, they are MIX, MIX_EXCL, EXCL and EXCL_MIX. they all allow Fw cap.

Sorry, I mean they all allow wrlock

Actions #25

Updated by Greg Farnum about 11 years ago

That latest issue was #4746. Turning off the callback and testing again...

Actions #26

Updated by Greg Farnum about 11 years ago

  • Assignee changed from Greg Farnum to Sage Weil

This now appears to be passing (I've got it continuing to loop in the background), but it needs review and merging. Sage said he could review!

Actions #27

Updated by Greg Farnum about 11 years ago

  • Status changed from Fix Under Review to 7

And he gave me a reviewed-by tag. Will merge this tomorrow morning after some more testing.

Actions #28

Updated by Greg Farnum about 11 years ago

  • Assignee changed from Sage Weil to Greg Farnum
Actions #29

Updated by Greg Farnum about 11 years ago

  • Status changed from 7 to Resolved

Merged into next in commit:efbe2e8b55ba735673a3fdb925a6304915f333d8

Actions #30

Updated by Greg Farnum almost 8 years ago

  • Component(FS) Client added
Actions

Also available in: Atom PDF