Project

General

Profile

Actions

Bug #21777

open

src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

Added by Patrick Donnelly over 6 years ago. Updated over 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

MDS may send a MMDSCacheRejoin(MMDSCacheRejoin::OP_WEAK) message to an MDS which is not rejoin/active/stopping. Once that MDS receives the message it will fail:

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

The MDS should be more tolerant of these messages when it's not active.


Files

mds_assert_rejoin.rar (404 KB) mds_assert_rejoin.rar haitao chen, 08/28/2020 07:55 AM

Related issues 1 (0 open1 closed)

Related to CephFS - Bug #23826: mds: assert after daemon restartDuplicatePatrick Donnelly04/23/2018

Actions
Actions #1

Updated by Patrick Donnelly over 6 years ago

  • Assignee set to Patrick Donnelly
Actions #2

Updated by Patrick Donnelly over 6 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Patrick Donnelly over 6 years ago

  • Status changed from Fix Under Review to Need More Info

This is NMI because we weren't able to reproduce the actual problem. We'll ahve to wait for QE to reproduce again with complete logs.

Actions #4

Updated by Patrick Donnelly over 6 years ago

  • Priority changed from Urgent to Normal

Reducing priority since we can't seem to get this reproduced.

Actions #5

Updated by Patrick Donnelly almost 6 years ago

  • Related to Bug #23826: mds: assert after daemon restart added
Actions #6

Updated by Patrick Donnelly almost 6 years ago

  • Priority changed from Normal to Urgent
  • Target version set to v13.0.0
Actions #7

Updated by Patrick Donnelly almost 6 years ago

  • Status changed from Need More Info to New
  • Assignee deleted (Patrick Donnelly)
  • Target version changed from v13.0.0 to v13.2.0
  • Source set to Q/A
  • Labels (FS) multimds added

Deleted: see #24047.

Actions #8

Updated by Patrick Donnelly almost 6 years ago

<deleted/>

Actions #9

Updated by Zheng Yan almost 6 years ago

  • Status changed from New to In Progress
Actions #11

Updated by Zheng Yan almost 6 years ago

  • Status changed from In Progress to Fix Under Review
Actions #12

Updated by Patrick Donnelly almost 6 years ago

  • Status changed from Fix Under Review to New
  • Labels (FS) crash added
Actions #13

Updated by Patrick Donnelly almost 6 years ago

  • Assignee set to Zheng Yan
  • Priority changed from Urgent to Immediate
  • Target version changed from v13.2.0 to v14.0.0
  • Backport changed from luminous to mimic,luminous

Zheng, do you think this is also resolved by the fix to #23826?

Actions #14

Updated by Patrick Donnelly over 5 years ago

  • Status changed from New to Need More Info
  • Assignee deleted (Zheng Yan)
  • Priority changed from Immediate to Normal

Dropping priority on this as there have been no known reoccurrence.

Actions #15

Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0
Actions #16

Updated by Patrick Donnelly over 4 years ago

  • Target version deleted (v15.0.0)
Actions #17

Updated by haitao chen over 3 years ago

mds.node188-2(rank 2) receive MMDSCacheRejoin message from mds.node185-0(rank 1), but mds.node188-2 is in resolve state. So,it cause the assert. The assert happens in 2020-8-24.

Actions

Also available in: Atom PDF