Bug #18646
closedmds: rejoin_import_cap FAILED assert(session)
0%
Description
2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: In function 'Capability* MDCache::rejoin_import_cap(CInode*, client_t, const cap_reconnect_t&, mds_rank_t)' thread 7f28bb8fd700 time 2017-01-17 01:22:49.270835 2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6678-gda73c09/rpm/el7/BUILD/ceph-11.1.0-6678-gda73c09/src/mds/MDCache.cc: 5555: FAILED assert(session) 2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: ceph version 11.1.0-6678-gda73c09 (da73c09995c9be5fca8d078223e0e9f3d071b2ab) 2017-01-17T01:22:49.274 INFO:tasks.ceph.mds.b.mira101.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7e) [0x7f28c1db3b0e] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 2: (MDCache::rejoin_import_cap(CInode*, client_t, cap_reconnect_t const&, int)+0x23d) [0x5632a76f2ead] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 3: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x1991) [0x5632a772dd11] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 4: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x5632a77322ab] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 5: (MDCache::dispatch(Message*)+0xa5) [0x5632a7737685] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 6: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x5632a762db2c] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 7: (MDSRank::_dispatch(Message*, bool)+0x20c) [0x5632a76372bc] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x5632a7638485] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x5632a7625d03] 2017-01-17T01:22:49.275 INFO:tasks.ceph.mds.b.mira101.stderr: 10: (DispatchQueue::entry()+0x7a2) [0x7f28c1e0fff2] 2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f28c1ea046d] 2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 12: (()+0x7dc5) [0x7f28c06b6dc5] 2017-01-17T01:22:49.276 INFO:tasks.ceph.mds.b.mira101.stderr: 13: (clone()+0x6d) [0x7f28bf58c73d]
To me it looks like this bug is caused by a stopping MDS that has removed a client session but (due to another MDS failing) gets a MMDSCacheRejoin message for the client that's been removed. This causes a session lookup failure in rejoin_import_cap:
Session *session = mds->sessionmap.get_session(entity_name_t::CLIENT(client.v)); assert(session);
[I think we could also see this if a client hasn't contacted an MDS which is importing caps (without any MDS failures). Is that reasonable?]
Updated by Zheng Yan over 7 years ago
Yes, it's reasonable. If client did not close the session volunteerly. It's likely the session was killed (due to timeout) by the importing mds,
Updated by John Spray over 7 years ago
I've been thinking a bit about how we handle eviction in the multimds case, and whether we perhaps ought to centralize client eviction (during reconnect phase or on timeout generally) on rank 0. Otherwise we're going to have lots of these weird situations where some MDSs have evicted a client and the other MDSs haven't.
Updated by Zheng Yan about 7 years ago
- Status changed from New to Fix Under Review
Updated by Zheng Yan about 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Patrick Donnelly about 5 years ago
- Category deleted (
90) - Labels (FS) multimds added