mds: Client hang on fsstress with mds_thrasher
While trying to reproduce #4565, fsstress eventually hangs where the client is waiting for a max size update that the mds never sends. This is similar to the bug described in Zheng Yan's fix: c08ccf350bb726fd9c4b7ce1316e14111ed31b6e, except that the mds never journals the new max size.
client: Kick waiters for max size
If the mds restarts without successfully logging a max size
cap update, the client waits indefinitely in Client::get_caps
on the waitfor_caps list. So when the client gets an mds map
indicating a new active mds has replaced a down mds, we need to
kick the caps update request. This patch mimics the behavior
in the kernel by setting the wanted_max_size
and requested_max_size to 0 and wakes up the waiters.
#3 Updated by Greg Farnum about 8 years ago
I'm not sure this is wrong, but it's confusing me a bit. I thought that the Client sent all capabilities it holds back to the MDS during the reconnect/replay sequence without needing a kick of any kind.
goes and looks
Ah, but it doesn't send back a full Capability struct, so we do in fact need to re-send the cap update. I'd have thought a size change would already be on the re-send list but I guess it's not since it's a request and not a tell (ie, doesn't count as flushing the caps, which is the only list we have), so it has different invariants than the existing list, so this approach looks good to me.
#4 Updated by Sam Lang about 8 years ago
I spent most of this morning figuring out if it made sense to send the full cap (ceph_mds_caps -- and get rid of the cap_reconnect_t). I think the behavior at the mds is too different though, or at least, doing so wouldn't qualify as a 'fix'.
There might be other cap fields that don't get passed along in the cap_reconnect_t (truncate_size, truncate_seq?) that will get dropped till the mds sends a cap update again...
#5 Updated by Greg Farnum about 8 years ago
I believe those are okay as truncate size changes should end up actually journaled (as setattrs) so they'll be replayed, and I think the max size change is the only thing that qualifies as a non-dirty "ask" instead of a dirty and flushing "tell" (that's new terminology I just made up).
It's always wise to check and see if there are related fields that have the same bugs as something you just fixed, though, so you should check and see if there are other things happening via the caps that we're missing here.
#10 Updated by Sam Lang about 8 years ago
- Status changed from 7 to Fix Under Review
With the latest changes to the mds merged to master, and the fix from #4637, I was able to get a successful run of fsstress and mds_thrasher (sans an issue with client unmount hanging). I retooled the fixes into wip-4582b and submitted a pull request.