Project

General

Profile

Actions

Bug #4582

closed

mds: Client hang on fsstress with mds_thrasher

Added by Sam Lang about 11 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While trying to reproduce #4565, fsstress eventually hangs where the client is waiting for a max size update that the mds never sends. This is similar to the bug described in Zheng Yan's fix: c08ccf350bb726fd9c4b7ce1316e14111ed31b6e, except that the mds never journals the new max size.

Actions #1

Updated by Sam Lang about 11 years ago

  • Category set to 46
  • Status changed from New to In Progress
Actions #2

Updated by Sam Lang about 11 years ago

  • Status changed from In Progress to 7

I just pushed wip-4582. Testing it on the fsstress test with mds_thrasher now. I'm not positive this is the right approach, so if someone wants to look it over...

Actions #3

Updated by Greg Farnum about 11 years ago

I'm not sure this is wrong, but it's confusing me a bit. I thought that the Client sent all capabilities it holds back to the MDS during the reconnect/replay sequence without needing a kick of any kind.
goes and looks
Ah, but it doesn't send back a full Capability struct, so we do in fact need to re-send the cap update. I'd have thought a size change would already be on the re-send list but I guess it's not since it's a request and not a tell (ie, doesn't count as flushing the caps, which is the only list we have), so it has different invariants than the existing list, so this approach looks good to me.

Actions #4

Updated by Sam Lang about 11 years ago

I spent most of this morning figuring out if it made sense to send the full cap (ceph_mds_caps -- and get rid of the cap_reconnect_t). I think the behavior at the mds is too different though, or at least, doing so wouldn't qualify as a 'fix'.

There might be other cap fields that don't get passed along in the cap_reconnect_t (truncate_size, truncate_seq?) that will get dropped till the mds sends a cap update again...

Actions #5

Updated by Greg Farnum about 11 years ago

I believe those are okay as truncate size changes should end up actually journaled (as setattrs) so they'll be replayed, and I think the max size change is the only thing that qualifies as a non-dirty "ask" instead of a dirty and flushing "tell" (that's new terminology I just made up).
It's always wise to check and see if there are related fields that have the same bugs as something you just fixed, though, so you should check and see if there are other things happening via the caps that we're missing here.

Actions #6

Updated by Zheng Yan about 11 years ago

FYI:
The kclient deals with this case by calling wake_up_session_caps(). It just clear i_wanted_max_size/i_requested_max_size and wakes up the writer.

Actions #7

Updated by Sam Lang about 11 years ago

Oh, yeah, we can do the same in the userspace client. I'll do that and re-push. Thanks Yan!

Actions #8

Updated by Ian Colle about 11 years ago

  • Priority changed from Normal to Urgent
Actions #9

Updated by Ian Colle about 11 years ago

  • Target version set to v0.61 - Cuttlefish
Actions #10

Updated by Sam Lang about 11 years ago

  • Status changed from 7 to Fix Under Review

With the latest changes to the mds merged to master, and the fix from #4637, I was able to get a successful run of fsstress and mds_thrasher (sans an issue with client unmount hanging). I retooled the fixes into wip-4582b and submitted a pull request.

Actions #11

Updated by Sam Lang about 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions #12

Updated by Greg Farnum almost 8 years ago

  • Component(FS) Client added
Actions

Also available in: Atom PDF