Bug #3637: client: not issuing caps for with clients doing shared writes - CephFS - Ceph

Actions

Copy link

Bug #3637

closed

client: not issuing caps for with clients doing shared writes

Added by Sam Lang over 11 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

With 3 clients running ceph-fuse, running the ior command:

/tmp/cephtest/binary/usr/local/bin/ior -e -w -r -W -b 10m -a POSIX -o /tmp/cephtest/gmnt/ior.testfile

This is shared file access. All three clients hanging, but it looks like 2 clients successfully stat the file and then wait on a barrier, while the third is hung on the stat.

Files

full3.logs (1.58 MB) full3.logs

Sam Lang, 12/18/2012 08:59 PM

Actions

Copy link

Updated by Sam Lang over 11 years ago

File full3.logs full3.logs added
Description updated (diff)

The hang occurs because a client requests a max size increase, but doesn't have write caps, so the mds puts it on the waiting list for the lock to reach a stable state, which never gets notified.

Actions

Copy link

Updated by Sam Lang over 11 years ago

Status changed from In Progress to 7

Proposed fix in wip-3637. The client's max size request in MClientCaps gets dropped if the file lock is in a non-stable stable (sync->mix). The fix pushes the max size request through to the check_inode_max_size call that is made once the lock moves to a stable state (mix).

Actions

Copy link

Updated by Greg Farnum over 11 years ago

I don't remember where this ended up. Was the proposed fix problematic, or did it never get looked at?

Actions

Copy link

Updated by Sam Lang over 11 years ago

Sage has a different proposed fix than what's in the branch. Still needs to be tested.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from 7 to Fix Under Review
Assignee changed from Sam Lang to Sage Weil

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from Fix Under Review to In Progress
Assignee changed from Sage Weil to Sam Lang

Actions

Copy link

Updated by Greg Farnum over 11 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Assignee changed from Sam Lang to Sage Weil

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from In Progress to 12

Actions

Copy link

#10

Updated by Sage Weil about 11 years ago

Assignee deleted (~~Sage Weil~~)

Actions

Copy link

#11

Updated by Greg Farnum about 11 years ago

Priority changed from High to Urgent

Actions

Copy link

#12

Updated by Greg Farnum about 11 years ago

#4489 is probably a duplicate of this and has steps to reproduce, if we need alternate angles of attack. (And we should verify that assessment once we fix this bug.)

Actions

Copy link

#13

Updated by Greg Farnum about 11 years ago

Assignee set to Greg Farnum

Starting to look at this now.

Actions

Copy link

#14

Updated by Greg Farnum about 11 years ago

Status changed from 12 to In Progress

Since I apparently forgot to mention it here, this has nothing to do with #4489; I just pattern-matched a little too quickly.

However, I do have a first pass couple patches in wip-max_size-3637. I need to do some testing with it to make sure first that I can reproduce the issue, and second that this branch fixes it without causing other issues (it is passing a single-client fsx.sh smoke test at least).
If anybody can suggest a more elegant solution that doesn't involve reworking the data structures involved I am all ears; the parameter explosion is making me a bit sad, although I'm glad to say it doesn't add any extra branches. Especially if it doesn't involve broadcasting the new max_size to all clients — speaking of which, I may be over-increasing the size due to the different levels of interpretation happening; I need to check that...

Actions

Copy link

#15

Updated by Greg Farnum about 11 years ago

I'm having difficulty reproducing this at all on current next, but am leaving it churning in the background... :/

I am seeing some other interesting warnings that may need their own bug, though:

INFO:teuthology.orchestra.run.out:access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
INFO:teuthology.orchestra.run.out:------    ---------  ---------- ---------  --------   --------   --------   --------   ----
INFO:teuthology.orchestra.run.out:ior WARNING: inconsistent file size by different tasks.
INFO:teuthology.orchestra.run.out:WARNING: Expected aggregate file size       = 125829120.
INFO:teuthology.orchestra.run.out:WARNING: Stat() of aggregate file size      = 41943040.
INFO:teuthology.orchestra.run.out:WARNING: Using actual aggregate bytes moved = 125829120.
INFO:teuthology.orchestra.run.out:write     200.38     40960      256.00     0.000183   0.598818   0.097357   0.598851   0
INFO:teuthology.orchestra.run.out:ior WARNING: inconsistent file size by different tasks.
INFO:teuthology.orchestra.run.out:WARNING: Expected aggregate file size       = 125829120.
INFO:teuthology.orchestra.run.out:WARNING: Stat() of aggregate file size      = 41943040.

But that doesn't induce ior to fail.

Actions

Copy link

#16

Updated by Zheng Yan about 11 years ago

If Locker::_do_cap_update can't get wrlock for a given client, the client should have no Fw cap. I think we can make the clients request new max size only when they have Fw cap. So Locker::_do_cap_update can always take wrlock when handling
new max size request.

Actions

Copy link

#17

Updated by Greg Farnum about 11 years ago

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?
2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:

client is writing, and then needs to extend the max size so adds it to caps for flushing
client receives cap revoke, removing the Fw perm
client returns cap flush (with the max size request stripped out because it no longer has Fw)
(time passes and) MDS gives client Fw back
(if the client had flushed its max size request, it could start writing now, but it hasn't) client requests max size increase
MDS gives back client max size increase after a round-trip to OSD

Actions

Copy link

#18

Updated by Greg Farnum about 11 years ago

Status changed from In Progress to Fix Under Review

Regarding the testing (which I'm doing now), what those warnings turned out to mean is that each instance had their own file, so it wasn't even being run on Ceph. Which explains why I couldn't reproduce. I'm attempting to deduce wtf ior is doing now.

Actions

Copy link

#19

Updated by Greg Farnum about 11 years ago

And that wasn't working because teuthology was creating working dirs like /tmp/cephtest/gregf@kai-2013-04-16_12-59-21/, and ior wasn't parsing past the '@' sign — no idea why and Sam didn't have one either. However, he tells me newer teuthology won't create dirs that look like that, so I'll just update and we should remember this if we have issues in the future. Hopefully this'll let me reproduce properly!

Actions

Copy link

#20

Updated by Greg Farnum about 11 years ago

Reproduced at last. There continues to be a problem with the fix branch too :( but it's not a max_size issue; one of the nodes is failing to get Fr. I'm looking at it now.

Actions

Copy link

#21

Updated by Greg Farnum about 11 years ago

Okay, it's not quite that simple. This (all following the data writeout; I think this is the data check — anyway, this) node has previously held pAsLsXsFscr.
Then, the MDS revokes Fsc and the client appears to handle that — _release(), _invalidate_inode_cache(), check_caps() and send_cap(). However, it's currently got Fcr marked as "used", so it...doesn't actually drop Fc. I admit I don't quite remember how this part is supposed to work, but the MDS sure seems to think that at this point the client no longer has those caps (but there might just not be any output of that).
Then later on the MDS does give the client Fr, which is enough for the client to do what it wants, but the client still says that it's got Fsc implemented (so it hasn't cleaned that up yet?), and so when it ends up in get_caps() it notices that it has what it needs, but (revoking & butnot) is non-zero (revoking still containing Fc since it has yet to drop the damn cap, as does butnot) so it continues to wait instead of breaking out.
I'll discuss this with Sage and move forward.

Actions

Copy link

#22

Updated by Zheng Yan about 11 years ago

Greg Farnum wrote:

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?
2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:

client is writing, and then needs to extend the max size so adds it to caps for flushing

client receives cap revoke, removing the Fw perm

client returns cap flush (with the max size request stripped out because it no longer has Fw)

(time passes and) MDS gives client Fw back

(if the client had flushed its max size request, it could start writing now, but it hasn't) client requests max size increase

MDS gives back client max size increase after a round-trip to OSD

thanks for clarification

Greg Farnum wrote:

Okay, it's not quite that simple. This (all following the data writeout; I think this is the data check — anyway, this) node has previously held pAsLsXsFscr.
Then, the MDS revokes Fsc and the client appears to handle that — _release(), _invalidate_inode_cache(), check_caps() and send_cap(). However, it's currently got Fcr marked as "used", so it...doesn't actually drop Fc. I admit I don't quite remember how this part is supposed to work, but the MDS sure seems to think that at this point the client no longer has those caps (but there might just not be any output of that).

This sounds like a bug I recently fixed, see [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued.

Actions

Copy link

#23

Updated by Zheng Yan about 11 years ago

Greg Farnum wrote:

I don't remember how all the locking works when you have multiple writers, but I don't believe either of those suppositions are correct:
1) Multiple clients can have Fw cap at a time (without buffering) during sync writes, but that doesn't mean they can get wrlock immediately. That's part of what the distinction is there for, isn't it?

there are only 4 states that allow Fw caps, they are MIX, MIX_EXCL, EXCL and EXCL_MIX. they all allow Fw cap.

Greg Farnum wrote:

2) If we do have clients gaining and losing Fw cap while they've got a file opened for write, we'd have to be able to intelligently add and remove their max size change requests on their cap acknowledgements. Besides being more logic in the client, this would require more message traffic before a client can do writes. Eg:

I thought about this again. I think we can allow force wrlock when Fw caps are revoking. (allow force wrlock for EXCL_XSYN, EXCL_SYNC, EXCL_LOCK MIX_SYNC, MIX_LOCK)

Actions

Copy link

#24