Project

General

Profile

Bug #4582

mds: Client hang on fsstress with mds_thrasher

Added by Sam Lang about 8 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While trying to reproduce #4565, fsstress eventually hangs where the client is waiting for a max size update that the mds never sends. This is similar to the bug described in Zheng Yan's fix: c08ccf350bb726fd9c4b7ce1316e14111ed31b6e, except that the mds never journals the new max size.

Associated revisions

Revision 0ce09fad (diff)
Added by Sam Lang about 8 years ago

client: Kick waiters for max size

If the mds restarts without successfully logging a max size
cap update, the client waits indefinitely in Client::get_caps
on the waitfor_caps list. So when the client gets an mds map
indicating a new active mds has replaced a down mds, we need to
kick the caps update request. This patch mimics the behavior
in the kernel by setting the wanted_max_size
and requested_max_size to 0 and wakes up the waiters.

Fixes #4582.
Signed-off-by: Sam Lang <>

Revision 3c0debf9
Added by Greg Farnum about 8 years ago

Merge pull request #191 from ceph/wip-4582b

Fixes #4582.

Reviewed-by: Greg Farnum <>

History

#1 Updated by Sam Lang about 8 years ago

  • Category set to 46
  • Status changed from New to In Progress

#2 Updated by Sam Lang about 8 years ago

  • Status changed from In Progress to 7

I just pushed wip-4582. Testing it on the fsstress test with mds_thrasher now. I'm not positive this is the right approach, so if someone wants to look it over...

#3 Updated by Greg Farnum about 8 years ago

I'm not sure this is wrong, but it's confusing me a bit. I thought that the Client sent all capabilities it holds back to the MDS during the reconnect/replay sequence without needing a kick of any kind.
goes and looks
Ah, but it doesn't send back a full Capability struct, so we do in fact need to re-send the cap update. I'd have thought a size change would already be on the re-send list but I guess it's not since it's a request and not a tell (ie, doesn't count as flushing the caps, which is the only list we have), so it has different invariants than the existing list, so this approach looks good to me.

#4 Updated by Sam Lang about 8 years ago

I spent most of this morning figuring out if it made sense to send the full cap (ceph_mds_caps -- and get rid of the cap_reconnect_t). I think the behavior at the mds is too different though, or at least, doing so wouldn't qualify as a 'fix'.

There might be other cap fields that don't get passed along in the cap_reconnect_t (truncate_size, truncate_seq?) that will get dropped till the mds sends a cap update again...

#5 Updated by Greg Farnum about 8 years ago

I believe those are okay as truncate size changes should end up actually journaled (as setattrs) so they'll be replayed, and I think the max size change is the only thing that qualifies as a non-dirty "ask" instead of a dirty and flushing "tell" (that's new terminology I just made up).
It's always wise to check and see if there are related fields that have the same bugs as something you just fixed, though, so you should check and see if there are other things happening via the caps that we're missing here.

#6 Updated by Zheng Yan about 8 years ago

FYI:
The kclient deals with this case by calling wake_up_session_caps(). It just clear i_wanted_max_size/i_requested_max_size and wakes up the writer.

#7 Updated by Sam Lang about 8 years ago

Oh, yeah, we can do the same in the userspace client. I'll do that and re-push. Thanks Yan!

#8 Updated by Ian Colle about 8 years ago

  • Priority changed from Normal to Urgent

#9 Updated by Ian Colle about 8 years ago

  • Target version set to v0.61 - Cuttlefish

#10 Updated by Sam Lang about 8 years ago

  • Status changed from 7 to Fix Under Review

With the latest changes to the mds merged to master, and the fix from #4637, I was able to get a successful run of fsstress and mds_thrasher (sans an issue with client unmount hanging). I retooled the fixes into wip-4582b and submitted a pull request.

#11 Updated by Sam Lang about 8 years ago

  • Status changed from Fix Under Review to Resolved

#12 Updated by Greg Farnum over 4 years ago

  • Component(FS) Client added

Also available in: Atom PDF