Bug #4582: mds: Client hang on fsstress with mds_thrasher - CephFS - Ceph

Actions

Copy link

Bug #4582

closed

mds: Client hang on fsstress with mds_thrasher

Added by Sam Lang about 11 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sam Lang

Category:

Target version:

v0.61 - Cuttlefish

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While trying to reproduce #4565, fsstress eventually hangs where the client is waiting for a max size update that the mds never sends. This is similar to the bug described in Zheng Yan's fix: c08ccf350bb726fd9c4b7ce1316e14111ed31b6e, except that the mds never journals the new max size.

Actions

Copy link

Updated by Sam Lang about 11 years ago

Category set to 46
Status changed from New to In Progress

Actions

Copy link

Updated by Sam Lang about 11 years ago

Status changed from In Progress to 7

I just pushed wip-4582. Testing it on the fsstress test with mds_thrasher now. I'm not positive this is the right approach, so if someone wants to look it over...

Actions

Copy link

Updated by Greg Farnum about 11 years ago

I'm not sure this is wrong, but it's confusing me a bit. I thought that the Client sent all capabilities it holds back to the MDS during the reconnect/replay sequence without needing a kick of any kind.
goes and looks
Ah, but it doesn't send back a full Capability struct, so we do in fact need to re-send the cap update. I'd have thought a size change would already be on the re-send list but I guess it's not since it's a request and not a tell (ie, doesn't count as flushing the caps, which is the only list we have), so it has different invariants than the existing list, so this approach looks good to me.

Actions

Copy link

Updated by Sam Lang about 11 years ago

I spent most of this morning figuring out if it made sense to send the full cap (ceph_mds_caps -- and get rid of the cap_reconnect_t). I think the behavior at the mds is too different though, or at least, doing so wouldn't qualify as a 'fix'.

There might be other cap fields that don't get passed along in the cap_reconnect_t (truncate_size, truncate_seq?) that will get dropped till the mds sends a cap update again...

Actions

Copy link

Updated by Greg Farnum about 11 years ago

I believe those are okay as truncate size changes should end up actually journaled (as setattrs) so they'll be replayed, and I think the max size change is the only thing that qualifies as a non-dirty "ask" instead of a dirty and flushing "tell" (that's new terminology I just made up).
It's always wise to check and see if there are related fields that have the same bugs as something you just fixed, though, so you should check and see if there are other things happening via the caps that we're missing here.

Actions

Copy link

Updated by Zheng Yan about 11 years ago

FYI:
The kclient deals with this case by calling wake_up_session_caps(). It just clear i_wanted_max_size/i_requested_max_size and wakes up the writer.

Actions

Copy link

Updated by Sam Lang about 11 years ago

Oh, yeah, we can do the same in the userspace client. I'll do that and re-push. Thanks Yan!

Actions

Copy link

Updated by Ian Colle about 11 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Ian Colle about 11 years ago

Target version set to v0.61 - Cuttlefish

Actions

Copy link

#10

Updated by Sam Lang about 11 years ago

Status changed from 7 to Fix Under Review

With the latest changes to the mds merged to master, and the fix from #4637, I was able to get a successful run of fsstress and mds_thrasher (sans an issue with client unmount hanging). I retooled the fixes into wip-4582b and submitted a pull request.

Actions

Copy link

#11

Updated by Sam Lang about 11 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

#12

Updated by Greg Farnum almost 8 years ago

Component(FS) Client added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #4582

mds: Client hang on fsstress with mds_thrasher

Updated by Sam Lang about 11 years ago

Updated by Sam Lang about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Sam Lang about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Zheng Yan about 11 years ago

Updated by Sam Lang about 11 years ago

Updated by Ian Colle about 11 years ago

Updated by Ian Colle about 11 years ago

Updated by Sam Lang about 11 years ago

Updated by Sam Lang about 11 years ago

Updated by Greg Farnum almost 8 years ago