Bug #23826: mds: assert after daemon restart - CephFS - Ceph

Actions

Copy link

Bug #23826

closed

mds: assert after daemon restart

Added by Patrick Donnelly almost 6 years ago. Updated almost 6 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Patrick Donnelly

Category:

Correctness/Safety

Target version:

Ceph - v13.2.0

% Done:

Source:

Support

Tags:

Backport:

luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description


/builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: 5080: FAILED assert(isolated_inodes.empty())

 ceph version 12.2.1-46.el7cp (b6f6f1b141c306a43f669b974971b9ec44914cb0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x564975ec7b40]
 2: (MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)+0x25a0) [0x564975cb4e60]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x213) [0x564975cc12a3]
 4: (MDCache::dispatch(Message*)+0xa5) [0x564975cc6905]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x564975baf734]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x564975bbcd43]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x564975bbdb85]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x564975ba7023]
 9: (DispatchQueue::entry()+0x792) [0x5649761ab952]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x564975f4dfbd]
 11: (()+0x7dd5) [0x7f577c615dd5]
 12: (clone()+0x6d) [0x7f577b6f5b3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1570597

Files

ceph-mds.magna058.log.gz (52 KB) ceph-mds.magna058.log.gz

Patrick Donnelly, 04/30/2018 07:00 PM

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

File ceph-mds.magna058.log.gz ceph-mds.magna058.log.gz added

Adding log from failed MDS.

Looks like it's receiving handle_cache_rejoin_ack message while in replay.

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Related to Bug #21777: src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin()) added

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Here's one possible way this could happen I think:

1. All MDS are rejoin or later.
2. A up:rejoin MDS does:
3. handle_mds_map
4. MDCache::rejoin_start
5. MDCache::process_imported_caps
6. open_ino(p->first, (int64_t)-1, new C_MDC_RejoinOpenInoFinish(this, p->first), false);
7. finisher calls mdcache->rejoin_open_ino_finish(ino, r);
8. MDCache::rejoin_gather_finish();
9. MDCache::rejoin_send_acks(); which sends the ACKs

Which will send the ACKs. I don't see this protected anywhere by MDSMap::is_rejoining().

Actions

Copy link

Updated by Zheng Yan almost 6 years ago

checking MDSMap::is_rejoining() is not required here. If there are recovering mds which haven't entered rejoin state. MDCache::rejoin_gether set can not be empty.

Actions

Copy link