Bug #21777
src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())
0%
Description
MDS may send a MMDSCacheRejoin(MMDSCacheRejoin::OP_WEAK) message to an MDS which is not rejoin/active/stopping. Once that MDS receives the message it will fail:
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390] 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f] 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b] 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85] 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624] 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13] 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55] 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33] 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12] 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd] 11: (()+0x7e25) [0x7fd43d712e25] 12: (clone()+0x6d) [0x7fd43c7f534d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Version-Release number of selected component (if applicable): ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
The MDS should be more tolerant of these messages when it's not active.
Related issues
History
#1 Updated by Patrick Donnelly over 5 years ago
- Assignee set to Patrick Donnelly
#2 Updated by Patrick Donnelly over 5 years ago
- Status changed from New to Fix Under Review
#3 Updated by Patrick Donnelly over 5 years ago
- Status changed from Fix Under Review to Need More Info
This is NMI because we weren't able to reproduce the actual problem. We'll ahve to wait for QE to reproduce again with complete logs.
#4 Updated by Patrick Donnelly over 5 years ago
- Priority changed from Urgent to Normal
Reducing priority since we can't seem to get this reproduced.
#5 Updated by Patrick Donnelly about 5 years ago
- Related to Bug #23826: mds: assert after daemon restart added
#6 Updated by Patrick Donnelly about 5 years ago
- Priority changed from Normal to Urgent
- Target version set to v13.0.0
#7 Updated by Patrick Donnelly about 5 years ago
- Status changed from Need More Info to New
- Assignee deleted (
Patrick Donnelly) - Target version changed from v13.0.0 to v13.2.0
- Source set to Q/A
- Labels (FS) multimds added
Deleted: see #24047.
#8 Updated by Patrick Donnelly about 5 years ago
<deleted/>
#9 Updated by Zheng Yan about 5 years ago
- Status changed from New to In Progress
#10 Updated by Zheng Yan about 5 years ago
#11 Updated by Zheng Yan about 5 years ago
- Status changed from In Progress to Fix Under Review
#12 Updated by Patrick Donnelly about 5 years ago
- Status changed from Fix Under Review to New
- Labels (FS) crash added
#13 Updated by Patrick Donnelly about 5 years ago
- Assignee set to Zheng Yan
- Priority changed from Urgent to Immediate
- Target version changed from v13.2.0 to v14.0.0
- Backport changed from luminous to mimic,luminous
Zheng, do you think this is also resolved by the fix to #23826?
#14 Updated by Patrick Donnelly over 4 years ago
- Status changed from New to Need More Info
- Assignee deleted (
Zheng Yan) - Priority changed from Immediate to Normal
Dropping priority on this as there have been no known reoccurrence.
#15 Updated by Patrick Donnelly about 4 years ago
- Target version changed from v14.0.0 to v15.0.0
#16 Updated by Patrick Donnelly over 3 years ago
- Target version deleted (
v15.0.0)
#17 Updated by haitao chen almost 3 years ago
- File mds_assert_rejoin.rar added
mds.node188-2(rank 2) receive MMDSCacheRejoin message from mds.node185-0(rank 1), but mds.node188-2 is in resolve state. So,it cause the assert. The assert happens in 2020-8-24.