Bug #21777: src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin()) - CephFS - Ceph

Actions

Copy link

Bug #21777

open

src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

Added by Patrick Donnelly over 6 years ago. Updated over 3 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

mimic,luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash, multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

MDS may send a MMDSCacheRejoin(MMDSCacheRejoin::OP_WEAK) message to an MDS which is not rejoin/active/stopping. Once that MDS receives the message it will fail:

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

The MDS should be more tolerant of these messages when it's not active.

Files

mds_assert_rejoin.rar (404 KB) mds_assert_rejoin.rar

haitao chen, 08/28/2020 07:55 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Assignee set to Patrick Donnelly

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Status changed from New to Fix Under Review

https://github.com/ceph/ceph/pull/18278

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Status changed from Fix Under Review to Need More Info

This is NMI because we weren't able to reproduce the actual problem. We'll ahve to wait for QE to reproduce again with complete logs.

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Priority changed from Urgent to Normal

Reducing priority since we can't seem to get this reproduced.

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Related to Bug #23826: mds: assert after daemon restart added

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Priority changed from Normal to Urgent
Target version set to v13.0.0

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Status changed from Need More Info to New
Assignee deleted (~~Patrick Donnelly~~)
Target version changed from v13.0.0 to v13.2.0
Source set to Q/A
Labels (FS) multimds added

Deleted: see #24047.

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Actions

Copy link

Updated by Zheng Yan almost 6 years ago

Status changed from New to In Progress

Actions

Copy link

#10

Updated by Zheng Yan almost 6 years ago

~~https://github.com/ceph/ceph/pull/21883~~

Actions

Copy link

#11

Updated by Zheng Yan almost 6 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

#12

Updated by Patrick Donnelly almost 6 years ago

Status changed from Fix Under Review to New
Labels (FS) crash added

Actions

Copy link

#13

Updated by Patrick Donnelly almost 6 years ago

Assignee set to Zheng Yan
Priority changed from Urgent to Immediate
Target version changed from v13.2.0 to v14.0.0
Backport changed from luminous to mimic,luminous

Zheng, do you think this is also resolved by the fix to #23826?

Actions

Copy link

#14

Updated by Patrick Donnelly over 5 years ago

Status changed from New to Need More Info
Assignee deleted (~~Zheng Yan~~)
Priority changed from Immediate to Normal

Dropping priority on this as there have been no known reoccurrence.

Actions

Copy link

#15

Updated by Patrick Donnelly about 5 years ago

Target version changed from v14.0.0 to v15.0.0

Actions

Copy link

#16

Updated by Patrick Donnelly over 4 years ago

Target version deleted (~~v15.0.0~~)

Actions

Copy link

#17

Updated by haitao chen over 3 years ago

File mds_assert_rejoin.rar mds_assert_rejoin.rar added

mds.node188-2(rank 2) receive MMDSCacheRejoin message from mds.node185-0(rank 1), but mds.node188-2 is in resolve state. So,it cause the assert. The assert happens in 2020-8-24.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #21777

src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

Updated by Patrick Donnelly over 6 years ago

Updated by Patrick Donnelly over 6 years ago

Updated by Patrick Donnelly over 6 years ago

Updated by Patrick Donnelly over 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Zheng Yan almost 6 years ago

Updated by Zheng Yan almost 6 years ago

Updated by Zheng Yan almost 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Patrick Donnelly over 5 years ago

Updated by Patrick Donnelly about 5 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by haitao chen over 3 years ago