Project

General

Profile

Actions

Bug #36349

closed

mds: src/mds/MDCache.cc: 1637: FAILED ceph_assert(follows >= realm->get_newest_seq())

Added by Patrick Donnelly over 5 years ago. Updated almost 4 years ago.

Status:
Can't reproduce
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr:/build/ceph-14.0.0-3907-g276c86e/src/mds/MDCache.cc: In function 'void MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)' thread 7fb68a8be700 time 2018-10-06 17:13:20.042739
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr:/build/ceph-14.0.0-3907-g276c86e/src/mds/MDCache.cc: 1637: FAILED ceph_assert(follows >= realm->get_newest_seq())
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr: ceph version 14.0.0-3907-g276c86e (276c86e4890a25e5e74b3f30dbb94987ede03b5a) nautilus (dev)
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fb692749f4f]
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7fb69274a12c]
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xee1) [0x5b33a1]
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc0) [0x5b3470]
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 5: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x631) [0x5b9271]
2018-10-06T17:13:20.928 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 6: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x65d) [0x5597ad]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 7: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0xb76) [0x55f906]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xc55) [0x579585]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 9: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x9c) [0x6207cc]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 10: (MDSInternalContextBase::complete(int)+0x72) [0x787432]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 11: (MDSCacheObject::finish_waiting(unsigned long, int)+0x283) [0x7a0db3]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 12: (SimpleLock::finish_waiters(unsigned long, int)+0xb8) [0x691eb8]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 13: (Locker::eval_gather(SimpleLock*, bool, bool*, std::vector<MDSInternalContextBase*, std::allocator<MDSInternalContextBase*> >*)+0x1121) [0x67e601]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 14: (Locker::handle_file_lock(ScatterLock*, boost::intrusive_ptr<MLock const> const&)+0x9b3) [0x68f9c3]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 15: (Locker::handle_lock(boost::intrusive_ptr<MLock const> const&)+0xa9) [0x690959]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 16: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0xbc) [0x690d0c]
2018-10-06T17:13:20.929 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 17: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x367) [0x4ea067]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 18: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x6fb) [0x4eca0b]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 19: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x15) [0x4ed1b5]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 20: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xfc) [0x4d9edc]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 21: (DispatchQueue::entry()+0x1669) [0x7fb692978999]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 22: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb692a258bd]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 23: (()+0x76ba) [0x7fb69200f6ba]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr: 24: (clone()+0x6d) [0x7fb69183841d]
2018-10-06T17:13:20.930 INFO:tasks.ceph.mds.l-s.smithi099.stderr:*** Caught signal (Aborted) **

From: /ceph/teuthology-archive/pdonnell-2018-10-06_01:15:54-multimds-wip-pdonnell-testing-20181005.225845-distro-basic-smithi/3106880/teuthology.log

Core: /ceph/teuthology-archive/pdonnell-2018-10-06_01:15:54-multimds-wip-pdonnell-testing-20181005.225845-distro-basic-smithi/3106880/remote/smithi099/coredump/1538846000.16983.core

Branch: https://github.com/batrick/ceph/tree/wip-pdonnell-testing-20181005.225845


Related issues 1 (1 open0 closed)

Related to Messengers - Bug #36540: msg: messages are queued but not sentNew

Actions
Actions #1

Updated by Zheng Yan over 5 years ago

Looks like that mds_table_request(snaptable server_ready) got lost. It's the first message that mds.0 sent to mds.3

2018-10-06 17:11:23.844 7fec0858a700  1 -- 172.21.15.99:6817/2278901200 _send_to--> mds 172.21.15.99:6818/120697649 -- mds_table_request(snaptable server_ready) v1 -- ?+0 0x1546d00

...

2018-10-06 17:11:23.844 7fec0858a700  1 -- 172.21.15.99:6817/2278901200 <== mds.3 172.21.15.99:6818/120697649 2 ==== discover(1 0x1.* ) v1 ==== 35+0+0 (3238743458 0 0) 0x1546d00 con 0x2253c00

...

2018-10-06 17:11:23.844 7fec0858a700  1 -- 172.21.15.99:6817/2278901200 --> 172.21.15.99:6818/120697649 -- discover_reply(1 0x1) v2 -- 0x21c0580 con 0

Can't find corresponding message in mds.3's log. The first message received from mds.0 is discover reply

2018-10-06 17:11:23.844 7fb68a8be700  1 -- 172.21.15.99:6818/120697649 <== mds.0 172.21.15.99:6817/2278901200 1 ==== discover_reply(1 0x1) v2 ==== 910+0+0 (3908223777 0 0) 0x2e34a00 con 0x2ecb100

Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Related to Bug #36540: msg: messages are queued but not sent added
Actions #3

Updated by Patrick Donnelly over 5 years ago

  • Status changed from New to Can't reproduce

Haven't seen this since. Closing as can't reproduce. Probably noise from messenger changes?

Actions #4

Updated by Wido den Hollander almost 4 years ago

I'm seeing this crash at the moment on a Nautilus 14.2.10 cluster which had 6 MDS active.

Running the cephfs data scans at the moment and waiting to see if that fixes it.

Actions

Also available in: Atom PDF