messenger: failed Pipe;:connect::assert(m) in Hadoop client
We have logs and a core dump from the QA run: http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-30_23:12:01-hadoop-next-testing-basic-multi/627428/
Given that it's the connect function, we're clearly opening up a connection to somebody who thinks that we've spoken to them before, but the seq they send back to us is too large (we don't have that many messages queued up). I wonder if we could possibly have done something bizarre like reuse the client entity across instances, but I think that's protected against in other ways.
#1 Updated by Sage Weil over 4 years ago
the new assert for wip-10057 would trigger this.
this looks like a corner case is the session close + reopen sequence. the client clears its seq state but the mds hasn't yet... but iirc the client isn't supposed to do that until it gets a positive "i closed this session" reply from the mds. :/
#2 Updated by Greg Farnum over 4 years ago
Hmm, the client only calls _closed_mds_session if:
1) it gets back a session close
2) the session goes stale
2a) which is set only if we get a remote reset while the session is open
3) it gets a remote reset while the state is closing or opening
Although it will also call mark_down() on the ConnectionRef of MDSes
1) which have CommandOps in progress and are named as laggy or nonexistent in the map
2) which are listed as not up, or whose gid does not match the current incarnation of that rank
So yeah, maybe if there are MDSes failing back and forth and we close a session without the MDS seeing the maps which forced it? If so I think this is related to #10080, although looking at the resetcheck policies I don't think the fix for that will have an impact here.