Project

General

Profile

Bug #18646

mds: rejoin_import_cap FAILED assert(session)

Added by Patrick Donnelly 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
multi-MDS
Target version:
-
Start date:
01/24/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Component(FS):
MDS
Needs Doc:
No

Description

2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: In function 'Capability* MDCache::rejoin_import_cap(CInode*, client_t, const cap_reconnect_t&, mds_rank_t)' thread 7f28bb8fd700 time 2017-01-17 01:22:49.270835
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: 5555: FAILED assert(session)
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: ceph version 11.1.0-6678-gda73c09 (da73c09995c9be5fca8d078223e0e9f3d071b2ab)
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7e) [0x7f28c1db3b0e]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 2: (MDCache::rejoin_import_cap(CInode*, client_t, cap_reconnect_t const&, int)+0x23d) [0x5632a76f2ead]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 3: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1991) [0x5632a772dd11]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 4: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x5632a77322ab]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 5: (MDCache::dispatch(Message*)+0xa5) [0x5632a7737685]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 6: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x5632a762db2c]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 7: (MDSRank::_dispatch(Message*, bool)+0x20c) [0x5632a76372bc]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x5632a7638485]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x5632a7625d03]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 10: (DispatchQueue::entry()+0x7a2) [0x7f28c1e0fff2]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f28c1ea046d]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 12: (()+0x7dc5) [0x7f28c06b6dc5]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 13: (clone()+0x6d) [0x7f28bf58c73d]

From: http://pulpito.ceph.com/pdonnell-2017-01-16_23:40:01-multimds:thrash-wip-multimds-thrasher-testing-basic-mira/723494/

To me it looks like this bug is caused by a stopping MDS that has removed a client session but (due to another MDS failing) gets a MMDSCacheRejoin message for the client that's been removed. This causes a session lookup failure in rejoin_import_cap:

  Session *session = mds->sessionmap.get_session(entity_name_t::CLIENT(client.v));
  assert(session);

[I think we could also see this if a client hasn't contacted an MDS which is importing caps (without any MDS failures). Is that reasonable?]

History

#1 Updated by Zheng Yan 4 months ago

Yes, it's reasonable. If client did not close the session volunteerly. It's likely the session was killed (due to timeout) by the importing mds,

#2 Updated by John Spray 4 months ago

I've been thinking a bit about how we handle eviction in the multimds case, and whether we perhaps ought to centralize client eviction (during reconnect phase or on timeout generally) on rank 0. Otherwise we're going to have lots of these weird situations where some MDSs have evicted a client and the other MDSs haven't.

#4 Updated by Zheng Yan 4 months ago

  • Status changed from New to Need Review

#5 Updated by Zheng Yan 3 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF