Project

General

Profile

Bug #36035

mds: MDCache.cc: 11673: abort()

Added by Patrick Donnelly about 2 months ago. Updated 24 days ago.

Status:
Need Review
Priority:
Urgent
Assignee:
-
Category:
Correctness/Safety
Target version:
Start date:
09/17/2018
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

2018-09-16T08:41:01.302 INFO:tasks.ceph.mds.i.smithi155.stderr:/build/ceph-14.0.0-3252-g561ad6d/src/mds/MDCache.cc: 11673: abort()
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr:
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: ceph version 14.0.0-3252-g561ad6d (561ad6d7a7950727f2a31290c28698fcd1355c37) nautilus (dev)
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x82) [0x7f8ea4ac8140]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 2: (MDCache::handle_fragment_notify(boost::intrusive_ptr<MMDSFragmentNotify const> const&)+0x380) [0x5d1a90]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 3: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x147) [0x5f75e7]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 4: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x171) [0x4df551]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 5: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x68b) [0x4e940b]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 6: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x15) [0x4e9bd5]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 7: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xff) [0x4d727f]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 8: (DispatchQueue::entry()+0xe6a) [0x7f8ea4ca6d7a]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f8ea4d3da1d]
2018-09-16T08:41:01.303 INFO:tasks.ceph.mds.i.smithi155.stderr: 10: (()+0x76ba) [0x7f8ea438d6ba]
2018-09-16T08:41:01.304 INFO:tasks.ceph.mds.i.smithi155.stderr: 11: (clone()+0x6d) [0x7f8ea3bb641d]

From: /ceph/teuthology-archive/pdonnell-2018-09-13_04:59:57-multimds-wip-pdonnell-testing-20180913.024004-distro-basic-smithi/3014469/teuthology.log

No coredumps/logs available unfortunately.

History

#1 Updated by Patrick Donnelly about 1 month ago

Another: /ceph/teuthology-archive/pdonnell-2018-10-09_01:07:48-multimds-wip-pdonnell-testing-20181008.224656-distro-basic-smithi/3119047/teuthology.log

#2 Updated by Zheng Yan 29 days ago

I reproduce this locally.

Dirfrag A is subtree root, its parent inode is indoe A. Auth mds of dirfrag A is mds.a. auth mds of inode A is mds.b. dirfrag A and inode A are replicated to mds.c. Following sequence of events can trigger the crash.

1. mds.a finishes fragmenting dirfrag A. It send fragment_notify to mds.c.
2. mds.b wants to readlock fragtreelock of inode A, it sends lock(a=sync idft...) to mds.a (lock state is mix->sync)
3. mds.a receive the lock sync message, send lock(a=syncack idft...) to mds.b
4. mds.b receive the lock synack message, it calls Locker::scatter_writebehind()
5. mds.b sends lock(a=sync idft...) to mds.a and mds.c (tiggered by scatter_writebehind_finish())
6. mds.c receives the lock ack message from mds.b. fragtreelock state of inode A becomes to SYNC state
7. mds.c trim inode A
8. mds.c receives the fragment_notify message (sent at the first step)

#4 Updated by Zheng Yan 29 days ago

  • Status changed from New to Need Review

#5 Updated by Patrick Donnelly 24 days ago

In Mimic: /ceph/teuthology-archive/yuriw-2018-10-18_15:37:57-multimds-wip-yuri4-testing-2018-10-17-2308-mimic-testing-basic-smithi/3158009/teuthology.log

Also available in: Atom PDF