Project

General

Profile

Actions

Bug #23826

closed

mds: assert after daemon restart

Added by Patrick Donnelly almost 6 years ago. Updated almost 6 years ago.

Status:
Duplicate
Priority:
Urgent
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Support
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


/builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: 5080: FAILED assert(isolated_inodes.empty())

 ceph version 12.2.1-46.el7cp (b6f6f1b141c306a43f669b974971b9ec44914cb0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x564975ec7b40]
 2: (MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)+0x25a0) [0x564975cb4e60]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x213) [0x564975cc12a3]
 4: (MDCache::dispatch(Message*)+0xa5) [0x564975cc6905]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x564975baf734]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x564975bbcd43]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x564975bbdb85]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x564975ba7023]
 9: (DispatchQueue::entry()+0x792) [0x5649761ab952]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x564975f4dfbd]
 11: (()+0x7dd5) [0x7f577c615dd5]
 12: (clone()+0x6d) [0x7f577b6f5b3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1570597


Files

ceph-mds.magna058.log.gz (52 KB) ceph-mds.magna058.log.gz Patrick Donnelly, 04/30/2018 07:00 PM

Related issues 2 (1 open1 closed)

Related to CephFS - Bug #21777: src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())Need More Info

Actions
Is duplicate of CephFS - Bug #24047: MDCache.cc: 5317: FAILED assert(mds->is_rejoin())ResolvedZheng Yan05/08/2018

Actions
Actions #1

Updated by Patrick Donnelly almost 6 years ago

Adding log from failed MDS.

Looks like it's receiving handle_cache_rejoin_ack message while in replay.

Actions #2

Updated by Patrick Donnelly almost 6 years ago

  • Related to Bug #21777: src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin()) added
Actions #3

Updated by Patrick Donnelly almost 6 years ago

Here's one possible way this could happen I think:

1. All MDS are rejoin or later.
2. A up:rejoin MDS does:
3. handle_mds_map
4. MDCache::rejoin_start
5. MDCache::process_imported_caps
6. open_ino(p->first, (int64_t)-1, new C_MDC_RejoinOpenInoFinish(this, p->first), false);
7. finisher calls mdcache->rejoin_open_ino_finish(ino, r);
8. MDCache::rejoin_gather_finish();
9. MDCache::rejoin_send_acks(); which sends the ACKs

Which will send the ACKs. I don't see this protected anywhere by MDSMap::is_rejoining().

Actions #4

Updated by Zheng Yan almost 6 years ago

checking MDSMap::is_rejoining() is not required here. If there are recovering mds which haven't entered rejoin state. MDCache::rejoin_gether set can not be empty.

Actions #5

Updated by Patrick Donnelly almost 6 years ago

  • Priority changed from High to Urgent
Actions #6

Updated by Patrick Donnelly almost 6 years ago

  • Assignee changed from Zheng Yan to Patrick Donnelly
  • Target version changed from v13.0.0 to v13.2.0
Actions #7

Updated by Zheng Yan almost 6 years ago

Finish context of MDCache::open_undef_inodes_dirfrags() calls rejoin_gather_finish() without check rejoin_gather. I think it can explain this crash.

https://github.com/ceph/ceph/pull/21883/commits/0a38a499b86c0ee13aa0e783a8359bcce0876088

Actions #8

Updated by Zheng Yan almost 6 years ago

  • Status changed from New to Duplicate
Actions #9

Updated by Patrick Donnelly almost 6 years ago

  • Is duplicate of Bug #24047: MDCache.cc: 5317: FAILED assert(mds->is_rejoin()) added
Actions

Also available in: Atom PDF