Project

General

Profile

Bug #21777

src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

Added by Patrick Donnelly about 1 year ago. Updated 3 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
Start date:
10/12/2017
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, multimds
Pull request ID:

Description

MDS may send a MMDSCacheRejoin(MMDSCacheRejoin::OP_WEAK) message to an MDS which is not rejoin/active/stopping. Once that MDS receives the message it will fail:

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

The MDS should be more tolerant of these messages when it's not active.


Related issues

Related to fs - Bug #23826: mds: assert after daemon restart Duplicate 04/23/2018

History

#1 Updated by Patrick Donnelly about 1 year ago

  • Assignee set to Patrick Donnelly

#2 Updated by Patrick Donnelly about 1 year ago

  • Status changed from New to Need Review

#3 Updated by Patrick Donnelly about 1 year ago

  • Status changed from Need Review to Need More Info

This is NMI because we weren't able to reproduce the actual problem. We'll ahve to wait for QE to reproduce again with complete logs.

#4 Updated by Patrick Donnelly about 1 year ago

  • Priority changed from Urgent to Normal

Reducing priority since we can't seem to get this reproduced.

#5 Updated by Patrick Donnelly 8 months ago

  • Related to Bug #23826: mds: assert after daemon restart added

#6 Updated by Patrick Donnelly 8 months ago

  • Priority changed from Normal to Urgent
  • Target version set to v13.0.0

#7 Updated by Patrick Donnelly 7 months ago

  • Status changed from Need More Info to New
  • Assignee deleted (Patrick Donnelly)
  • Target version changed from v13.0.0 to v13.2.0
  • Source set to Q/A
  • Labels (FS) multimds added

Deleted: see #24047.

#8 Updated by Patrick Donnelly 7 months ago

<deleted/>

#9 Updated by Zheng Yan 7 months ago

  • Status changed from New to In Progress

#11 Updated by Zheng Yan 7 months ago

  • Status changed from In Progress to Need Review

#12 Updated by Patrick Donnelly 7 months ago

  • Status changed from Need Review to New
  • Labels (FS) crash added

#13 Updated by Patrick Donnelly 7 months ago

  • Assignee set to Zheng Yan
  • Priority changed from Urgent to Immediate
  • Target version changed from v13.2.0 to v14.0.0
  • Backport changed from luminous to mimic,luminous

Zheng, do you think this is also resolved by the fix to #23826?

#14 Updated by Patrick Donnelly 3 months ago

  • Status changed from New to Need More Info
  • Assignee deleted (Zheng Yan)
  • Priority changed from Immediate to Normal

Dropping priority on this as there have been no known reoccurrence.

Also available in: Atom PDF