Bug #18646: mds: rejoin_import_cap FAILED assert(session) - CephFS - Ceph

Actions

Copy link

Bug #18646

closed

mds: rejoin_import_cap FAILED assert(session)

Added by Patrick Donnelly over 7 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: In function 'Capability* MDCache::rejoin_import_cap(CInode*, client_t, const cap_reconnect_t&, mds_rank_t)' thread 7f28bb8fd700 time 2017-01-17 01:22:49.270835
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: 5555: FAILED assert(session)
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: ceph version 11.1.0-6678-gda73c09 (da73c09995c9be5fca8d078223e0e9f3d071b2ab)
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7e) [0x7f28c1db3b0e]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 2: (MDCache::rejoin_import_cap(CInode*, client_t, cap_reconnect_t const&, int)+0x23d) [0x5632a76f2ead]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 3: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1991) [0x5632a772dd11]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 4: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x5632a77322ab]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 5: (MDCache::dispatch(Message*)+0xa5) [0x5632a7737685]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 6: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x5632a762db2c]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 7: (MDSRank::_dispatch(Message*, bool)+0x20c) [0x5632a76372bc]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x5632a7638485]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x5632a7625d03]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 10: (DispatchQueue::entry()+0x7a2) [0x7f28c1e0fff2]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f28c1ea046d]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 12: (()+0x7dc5) [0x7f28c06b6dc5]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 13: (clone()+0x6d) [0x7f28bf58c73d]

From: http://pulpito.ceph.com/pdonnell-2017-01-16_23:40:01-multimds:thrash-wip-multimds-thrasher-testing-basic-mira/723494/

To me it looks like this bug is caused by a stopping MDS that has removed a client session but (due to another MDS failing) gets a MMDSCacheRejoin message for the client that's been removed. This causes a session lookup failure in rejoin_import_cap:

  Session *session = mds->sessionmap.get_session(entity_name_t::CLIENT(client.v));
  assert(session);

[I think we could also see this if a client hasn't contacted an MDS which is importing caps (without any MDS failures). Is that reasonable?]

Actions

Copy link

Updated by Zheng Yan over 7 years ago

Yes, it's reasonable. If client did not close the session volunteerly. It's likely the session was killed (due to timeout) by the importing mds,

Actions

Copy link

Updated by John Spray over 7 years ago

I've been thinking a bit about how we handle eviction in the multimds case, and whether we perhaps ought to centralize client eviction (during reconnect phase or on timeout generally) on rank 0. Otherwise we're going to have lots of these weird situations where some MDSs have evicted a client and the other MDSs haven't.

Actions

Copy link