Project

General

Profile

Actions

Bug #18646

closed

mds: rejoin_import_cap FAILED assert(session)

Added by Patrick Donnelly over 7 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: In function 'Capability* MDCache::rejoin_import_cap(CInode*, client_t, const cap_reconnect_t&, mds_rank_t)' thread 7f28bb8fd700 time 2017-01-17 01:22:49.270835
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: 5555: FAILED assert(session)
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: ceph version 11.1.0-6678-gda73c09 (da73c09995c9be5fca8d078223e0e9f3d071b2ab)
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7e) [0x7f28c1db3b0e]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 2: (MDCache::rejoin_import_cap(CInode*, client_t, cap_reconnect_t const&, int)+0x23d) [0x5632a76f2ead]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 3: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1991) [0x5632a772dd11]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 4: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x5632a77322ab]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 5: (MDCache::dispatch(Message*)+0xa5) [0x5632a7737685]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 6: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x5632a762db2c]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 7: (MDSRank::_dispatch(Message*, bool)+0x20c) [0x5632a76372bc]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x5632a7638485]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x5632a7625d03]
2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 10: (DispatchQueue::entry()+0x7a2) [0x7f28c1e0fff2]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f28c1ea046d]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 12: (()+0x7dc5) [0x7f28c06b6dc5]
2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 13: (clone()+0x6d) [0x7f28bf58c73d]

From: http://pulpito.ceph.com/pdonnell-2017-01-16_23:40:01-multimds:thrash-wip-multimds-thrasher-testing-basic-mira/723494/

To me it looks like this bug is caused by a stopping MDS that has removed a client session but (due to another MDS failing) gets a MMDSCacheRejoin message for the client that's been removed. This causes a session lookup failure in rejoin_import_cap:

  Session *session = mds->sessionmap.get_session(entity_name_t::CLIENT(client.v));
  assert(session);

[I think we could also see this if a client hasn't contacted an MDS which is importing caps (without any MDS failures). Is that reasonable?]

Actions

Also available in: Atom PDF