Project

General

Profile

Bug #10248

messenger: failed Pipe;:connect::assert(m) in Hadoop client

Added by Greg Farnum over 9 years ago. Updated over 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Code Hygiene
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have logs and a core dump from the QA run: http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-30_23:12:01-hadoop-next-testing-basic-multi/627428/

Given that it's the connect function, we're clearly opening up a connection to somebody who thinks that we've spoken to them before, but the seq they send back to us is too large (we don't have that many messages queued up). I wonder if we could possibly have done something bizarre like reuse the client entity across instances, but I think that's protected against in other ways.

History

#1 Updated by Sage Weil over 9 years ago

the new assert for wip-10057 would trigger this.

this looks like a corner case is the session close + reopen sequence. the client clears its seq state but the mds hasn't yet... but iirc the client isn't supposed to do that until it gets a positive "i closed this session" reply from the mds. :/

#2 Updated by Greg Farnum over 9 years ago

Hmm, the client only calls _closed_mds_session if:
1) it gets back a session close
2) the session goes stale
2a) which is set only if we get a remote reset while the session is open
3) it gets a remote reset while the state is closing or opening

Although it will also call mark_down() on the ConnectionRef of MDSes
1) which have CommandOps in progress and are named as laggy or nonexistent in the map
2) which are listed as not up, or whose gid does not match the current incarnation of that rank

So yeah, maybe if there are MDSes failing back and forth and we close a session without the MDS seeing the maps which forced it? If so I think this is related to #10080, although looking at the resetcheck policies I don't think the fix for that will have an impact here.

#3 Updated by Loïc Dachary almost 9 years ago

  • Regression set to No

is it still valid ?

#4 Updated by Greg Farnum almost 9 years ago

Yes — the related issue chain has been seen a few times, more recently.

#5 Updated by Sage Weil almost 9 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (msgr)

#6 Updated by Greg Farnum over 7 years ago

  • Category set to Code Hygiene

Also available in: Atom PDF