Project

General

Profile

Actions

Bug #20122

open

Ceph MDS crash with assert failure

Added by James Eckersall almost 7 years ago. Updated almost 7 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The cluster is running Kraken on CentOS 7.3 and has 3 MDS servers, 01 was up:active and is the one that crashed as per the below stacktrace, 02 was in up:standby:replay and 03 was in up:standby.
After the below crash, 01 came back up into up:standby, 02 changed to up:replay, but didn't log anything for two and a half hours and was stuck in up:replay for that whole time. At this point, two and a half hours since initial 01 crash, one of our engineers killed the MDS daemon process on 02 and 03 changed from up:standby to up:standby-replay and then to up:active, so service was restored. 01 changed into up:standby-replay state.

2017-05-30 22:12:00.933446 7f27cf42c700 1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7f27cf42c700 time 2017-05-30 22:
12:00.906195
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/mds/CDir.cc: 698: FAILED assert(dn
>get_linkage()->is_null())

ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f27db4df9c5]
2: (CDir::try_remove_dentries_for_stray()+0x1c0) [0x7f27db3424c0]
3: (StrayManager::__eval_stray(CDentry*, bool)+0x8a9) [0x7f27db2c60e9]
4: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f27db2c665e]
5: (MutationImpl::drop_pins()+0xc1) [0x7f27db20e8c1]
6: (MDCache::request_cleanup(std::shared_ptr<MDRequestImpl>&)+0x171) [0x7f27db236151]
7: (MDCache::request_finish(std::shared_ptr<MDRequestImpl>&)+0x160) [0x7f27db236590]
8: (Server::reply_client_request(std::shared_ptr<MDRequestImpl>&, MClientReply*)+0x223) [0x7f27db1b43b3]
9: (Server::respond_to_request(std::shared_ptr<MDRequestImpl>&, int)+0x411) [0x7f27db1b4fc1]
10: (Server::_unlink_local_finish(std::shared_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x312) [0x7f27db1befa2]
11: (MDSIOContextBase::complete(int)+0xa4) [0x7f27db3c3164]
12: (MDSLogContextBase::complete(int)+0x3c) [0x7f27db3c360c]
13: (Finisher::finisher_thread_entry()+0x1f6) [0x7f27db4deba6]
14: (()+0x7dc5) [0x7f27d91ecdc5]
15: (clone()+0x6d) [0x7f27d82d873d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
<10000 recent entries >
--- end dump of recent events ---
2017-05-30 22:12:00.962721 7f27cf42c700 -1 ** Caught signal (Aborted) *
in thread 7f27cf42c700 thread_name:fn_anonymous

Please let me know if there is any further information you require.

Actions

Also available in: Atom PDF