Project

General

Profile

Bug #21777

src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

Added by Patrick Donnelly over 5 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

MDS may send a MMDSCacheRejoin(MMDSCacheRejoin::OP_WEAK) message to an MDS which is not rejoin/active/stopping. Once that MDS receives the message it will fail:

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

The MDS should be more tolerant of these messages when it's not active.

mds_assert_rejoin.rar (404 KB) haitao chen, 08/28/2020 07:55 AM


Related issues

Related to CephFS - Bug #23826: mds: assert after daemon restart Duplicate 04/23/2018

History

#1 Updated by Patrick Donnelly over 5 years ago

  • Assignee set to Patrick Donnelly

#2 Updated by Patrick Donnelly over 5 years ago

  • Status changed from New to Fix Under Review

#3 Updated by Patrick Donnelly over 5 years ago

  • Status changed from Fix Under Review to Need More Info

This is NMI because we weren't able to reproduce the actual problem. We'll ahve to wait for QE to reproduce again with complete logs.

#4 Updated by Patrick Donnelly over 5 years ago

  • Priority changed from Urgent to Normal

Reducing priority since we can't seem to get this reproduced.

#5 Updated by Patrick Donnelly about 5 years ago

  • Related to Bug #23826: mds: assert after daemon restart added

#6 Updated by Patrick Donnelly about 5 years ago

  • Priority changed from Normal to Urgent
  • Target version set to v13.0.0

#7 Updated by Patrick Donnelly about 5 years ago

  • Status changed from Need More Info to New
  • Assignee deleted (Patrick Donnelly)
  • Target version changed from v13.0.0 to v13.2.0
  • Source set to Q/A
  • Labels (FS) multimds added

Deleted: see #24047.

#8 Updated by Patrick Donnelly about 5 years ago

<deleted/>

#9 Updated by Zheng Yan about 5 years ago

  • Status changed from New to In Progress

#11 Updated by Zheng Yan about 5 years ago

  • Status changed from In Progress to Fix Under Review

#12 Updated by Patrick Donnelly about 5 years ago

  • Status changed from Fix Under Review to New
  • Labels (FS) crash added

#13 Updated by Patrick Donnelly about 5 years ago

  • Assignee set to Zheng Yan
  • Priority changed from Urgent to Immediate
  • Target version changed from v13.2.0 to v14.0.0
  • Backport changed from luminous to mimic,luminous

Zheng, do you think this is also resolved by the fix to #23826?

#14 Updated by Patrick Donnelly over 4 years ago

  • Status changed from New to Need More Info
  • Assignee deleted (Zheng Yan)
  • Priority changed from Immediate to Normal

Dropping priority on this as there have been no known reoccurrence.

#15 Updated by Patrick Donnelly about 4 years ago

  • Target version changed from v14.0.0 to v15.0.0

#16 Updated by Patrick Donnelly over 3 years ago

  • Target version deleted (v15.0.0)

#17 Updated by haitao chen almost 3 years ago

mds.node188-2(rank 2) receive MMDSCacheRejoin message from mds.node185-0(rank 1), but mds.node188-2 is in resolve state. So,it cause the assert. The assert happens in 2020-8-24.

Also available in: Atom PDF