Bug #16768: multimds: check_rstat assertion failure - CephFS - Ceph

Actions

Copy link

Bug #16768

closed

multimds: check_rstat assertion failure

Added by Patrick Donnelly over 7 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Target version:

Ceph - v12.0.0

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2016-07-19T11:43:54.392 INFO:tasks.ceph.mds.e.mira027.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-709-g12c0683/src/mds/CDir.cc: In function 'bool CDir::check_rstats(bool)' thread 7fe0b191e700 time 2016-07-19 18:43:55.424696
2016-07-19T11:43:54.392 INFO:tasks.ceph.mds.e.mira027.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-709-g12c0683/src/mds/CDir.cc: 289: FAILED assert(nest_info.rbytes == fnode.rstat.rbytes)
2016-07-19T11:43:54.392 INFO:tasks.ceph.mds.e.mira027.stderr: ceph version v11.0.0-709-g12c0683 (12c068365c43a140fe1fe23bf68318342710e84d)
2016-07-19T11:43:54.393 INFO:tasks.ceph.mds.e.mira027.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x1878453]
2016-07-19T11:43:54.393 INFO:tasks.ceph.mds.e.mira027.stderr: 2: (CDir::check_rstats(bool)+0x16e8) [0x164daf8]
2016-07-19T11:43:54.393 INFO:tasks.ceph.mds.e.mira027.stderr: 3: (MDCache::predirty_journal_parents(std::shared_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x2640) [0x14bed80]
2016-07-19T11:43:54.393 INFO:tasks.ceph.mds.e.mira027.stderr: 4: (Server::_rename_prepare(std::shared_ptr<MDRequestImpl>&, EMetaBlob*, ceph::buffer::list*, CDentry*, CDentry*, CDentry*)+0x1a86) [0x143199e]
2016-07-19T11:43:54.394 INFO:tasks.ceph.mds.e.mira027.stderr: 5: (Server::handle_slave_rename_prep(std::shared_ptr<MDRequestImpl>&)+0x1fc2) [0x1436548]
2016-07-19T11:43:54.394 INFO:tasks.ceph.mds.e.mira027.stderr: 6: (Server::dispatch_slave_request(std::shared_ptr<MDRequestImpl>&)+0xc33) [0x14034db]
2016-07-19T11:43:54.394 INFO:tasks.ceph.mds.e.mira027.stderr: 7: (Server::_slave_rename_sessions_flushed(std::shared_ptr<MDRequestImpl>&)+0x22f) [0x143cb19]
2016-07-19T11:43:54.395 INFO:tasks.ceph.mds.e.mira027.stderr: 8: (C_MDS_SlaveRenameSessionsFlushed::finish(int)+0x2a) [0x1451d3e]
2016-07-19T11:43:54.395 INFO:tasks.ceph.mds.e.mira027.stderr: 9: (Context::complete(int)+0x27) [0x137daf5]
2016-07-19T11:43:54.395 INFO:tasks.ceph.mds.e.mira027.stderr: 10: (MDSInternalContextBase::complete(int)+0x1c6) [0x1705596]
2016-07-19T11:43:54.395 INFO:tasks.ceph.mds.e.mira027.stderr: 11: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::delete_me()+0x41) [0x13d48a3]
2016-07-19T11:43:54.396 INFO:tasks.ceph.mds.e.mira027.stderr: 12: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::sub_finish(MDSInternalContextBase*, int)+0x2a8) [0x13e0d5a]
2016-07-19T11:43:54.396 INFO:tasks.ceph.mds.e.mira027.stderr: 13: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::finish(int)+0x29) [0x13e08ab]
2016-07-19T11:43:54.396 INFO:tasks.ceph.mds.e.mira027.stderr: 14: (Context::complete(int)+0x27) [0x137daf5]
2016-07-19T11:43:54.397 INFO:tasks.ceph.mds.e.mira027.stderr: 15: (C_GatherBase<MDSInternalContextBase, MDSInternalContextGather>::C_GatherSub::complete(int)+0x20) [0x13e096a]
2016-07-19T11:43:54.397 INFO:tasks.ceph.mds.e.mira027.stderr: 16: (MDSRank::_advance_queues()+0x4c3) [0x13a9047]
2016-07-19T11:43:54.398 INFO:tasks.ceph.mds.e.mira027.stderr: 17: (MDSRank::_dispatch(Message*, bool)+0x55d) [0x13a6463]
2016-07-19T11:43:54.398 INFO:tasks.ceph.mds.e.mira027.stderr: 18: (MDSRankDispatcher::ms_dispatch(Message*)+0x34) [0x13a5ef0]
2016-07-19T11:43:54.398 INFO:tasks.ceph.mds.e.mira027.stderr: 19: (MDSDaemon::ms_dispatch(Message*)+0x21d) [0x13782bf]
2016-07-19T11:43:54.401 INFO:tasks.ceph.mds.e.mira027.stderr: 20: (Messenger::ms_deliver_dispatch(Message*)+0x98) [0x1af71bc]
2016-07-19T11:43:54.402 INFO:tasks.ceph.mds.e.mira027.stderr: 21: (DispatchQueue::entry()+0x5dd) [0x1af62d9]
2016-07-19T11:43:54.402 INFO:tasks.ceph.mds.e.mira027.stderr: 22: (DispatchQueue::DispatchThread::entry()+0x1c) [0x1902046]
2016-07-19T11:43:54.403 INFO:tasks.ceph.mds.e.mira027.stderr: 23: (Thread::entry_wrapper()+0xc1) [0x19ef733]
2016-07-19T11:43:54.404 INFO:tasks.ceph.mds.e.mira027.stderr: 24: (Thread::_entry_func(void*)+0x18) [0x19ef668]
2016-07-19T11:43:54.404 INFO:tasks.ceph.mds.e.mira027.stderr: 25: (()+0x8182) [0x7fe0b6730182]
2016-07-19T11:43:54.406 INFO:tasks.ceph.mds.e.mira027.stderr: 26: (clone()+0x6d) [0x7fe0b583147d]
2016-07-19T11:43:54.407 INFO:tasks.ceph.mds.e.mira027.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This test was marked dead due to timeout (I think?):

2016-07-19T11:43:55.737 INFO:tasks.workunit.client.0.mira034.stdout:5/839: dread d12/d16/d24/d32/de7/df5/ff7 [0,4194304] 0
2016-07-19T11:43:55.744 INFO:tasks.workunit.client.0.mira034.stdout:5/840: mkdir d12/d6d/d9b/ddd/d12a 0
2016-07-19T11:43:55.757 INFO:tasks.workunit.client.0.mira034.stdout:5/841: dwrite d12/f11b [0,4194304] 0
2016-07-19T11:43:55.767 INFO:tasks.workunit.client.0.mira034.stdout:5/842: unlink d12/d16/d24/d32/d5b/fb5 0
2016-07-19T11:43:55.777 INFO:tasks.workunit.client.0.mira034.stdout:5/843: dread d12/d16/d24/fa9 [0,4194304] 0
2016-07-19T14:42:53.022 INFO:tasks.workunit.client.0.mira034.stderr:/home/ubuntu/cephtest/workunit.client.0/suites/fsstress.sh: line 1: 13555 Terminated              $command

Killed after 3 hours.

From: http://pulpito.ceph.com/pdonnell-2016-07-18_20:02:54-multimds-master---basic-mira/321809/

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by John Spray over 7 years ago

Component(FS) MDS added

Actions

Copy link

Updated by Patrick Donnelly over 7 years ago

Another of the same:

Dead: 2016-07-24T02:04:40.579 INFO:tasks.workunit.client.0.mira082.stderr:/home/ubuntu/cephtest/workunit.client.0/suites/fsstress.sh: line 1: 14917 Terminated              $command
ceph version v11.0.0-820-ga0294e6 (a0294e64507a7916fdd9707ae22ba40b0d7b65d1)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x91be5b]
 2: (CDir::check_rstats(bool)+0x14a6) [0x7a8176]
 3: (MDCache::predirty_journal_parents(std::shared_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0xf73) [0x6fe913]
 4: (Server::_rename_prepare(std::shared_ptr<MDRequestImpl>&, EMetaBlob*, ceph::buffer::list*, CDentry*, CDentry*, CDentry*)+0x6b9) [0x659579]
 5: (Server::handle_slave_rename_prep(std::shared_ptr<MDRequestImpl>&)+0xffb) [0x6653fb]
 6: (Server::dispatch_slave_request(std::shared_ptr<MDRequestImpl>&)+0x70b) [0x666dab]
 7: (Server::handle_slave_request(MMDSSlaveRequest*)+0x8fc) [0x67019c]
 8: (Server::dispatch(Message*)+0x69b) [0x670efb]
 9: (MDSRank::handle_deferrable_message(Message*)+0x80c) [0x5f607c]
 10: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x5ffd91]
 11: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x600ee5]
 12: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x5e8043]
 13: (DispatchQueue::entry()+0x78b) [0xab7d0b]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x97b07d]
 15: (()+0x8182) [0x7f89e6bec182]
 16: (clone()+0x6d) [0x7f89e5ced47d]
1 jobs: ['327574']

From: http://pulpito.ceph.com/pdonnell-2016-07-21_13:20:27-multimds-master---basic-mira/327574/

Actions

Copy link

Updated by Patrick Donnelly over 7 years ago

Segmentation fault in another test which may be related:

Dead: 2016-07-19T01:26:57.608 INFO:tasks.workunit.client.0.mira064.stderr:/home/ubuntu/cephtest/workunit.client.0/suites/fsstress.sh: line 1:  3178 Terminated              $command
ceph version v11.0.0-709-g12c0683 (12c068365c43a140fe1fe23bf68318342710e84d)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x17b61e7]
 2: ceph-mds() [0x17b549f]
 3: (()+0x10340) [0x7f65805d6340]
 4: (MDSCacheObject::get(int)+0x13) [0x14454d5]
 5: (MutationImpl::pin(MDSCacheObject*)+0x42) [0x14a3c8c]
 6: (Server::handle_slave_rename_prep(std::shared_ptr<MDRequestImpl>&)+0x92c) [0x1434eb2]
 7: (Server::dispatch_slave_request(std::shared_ptr<MDRequestImpl>&)+0xc33) [0x14034db]
 8: (Server::handle_slave_request(MMDSSlaveRequest*)+0x1137) [0x140187f]
 9: (Server::dispatch(Message*)+0x976) [0x13f1b84]
 10: (MDSRank::handle_deferrable_message(Message*)+0xa71) [0x13a8057]
 11: (MDSRank::_dispatch(Message*, bool)+0x3bc) [0x13a62c2]
 12: (MDSRankDispatcher::ms_dispatch(Message*)+0x34) [0x13a5ef0]
 13: (MDSDaemon::ms_dispatch(Message*)+0x21d) [0x13782bf]
 14: (Messenger::ms_deliver_dispatch(Message*)+0x98) [0x1af71bc]
 15: (DispatchQueue::entry()+0x5dd) [0x1af62d9]
 16: (DispatchQueue::DispatchThread::entry()+0x1c) [0x1902046]
 17: (Thread::entry_wrapper()+0xc1) [0x19ef733]
 18: (Thread::_entry_func(void*)+0x18) [0x19ef668]
 19: (()+0x8182) [0x7f65805ce182]
 20: (clone()+0x6d) [0x7f657f6cf47d]
1 jobs: ['321707']
suites: ['clusters/3-mds.yaml', 'debug/mds_client.yaml', 'fs/btrfs.yaml', 'inline/no.yaml', 'mount/cfuse.yaml', 'multimds/basic/{ceph/base.yaml', 'overrides/whitelist_wrongly_marked_down.yaml', 'tasks/suites_fsstress.yaml}']

http://pulpito.ceph.com/pdonnell-2016-07-18_20:02:54-multimds-master---basic-mira/321707/

Actions

Copy link

Updated by Zheng Yan over 7 years ago

please enable mds_debug

Actions

Copy link

Updated by Patrick Donnelly over 7 years ago

Zheng, which setting is that and how do I enable it? Sorry...

Actions

Copy link

Updated by John Spray over 7 years ago

Related to Bug #16807: Crash in handle_slave_rename_prep added

Actions

Copy link

Updated by John Spray over 7 years ago

I've opened a separate ticket for the segfault, seems likely to be it's own issue (http://tracker.ceph.com/issues/16807)

Actions

Copy link

Updated by Zheng Yan over 7 years ago

please add a line "debug mds = 10" to ceph.conf

Actions

Copy link

Updated by Patrick Donnelly over 7 years ago

Zheng, I think we already have "debug mds = 20", right? From the config for this run: http://pulpito.ceph.com/pdonnell-2016-07-18_20:02:54-multimds-master---basic-mira/321707/

Actions

Copy link

#10

Updated by Patrick Donnelly over 7 years ago

Here's another instance of the assertion failure on a more recent master branch:

http://qa-proxy.ceph.com/teuthology/pdonnell-2016-07-29_12:44:23-multimds-master---basic-mira/340811/teuthology.log

Actions

Copy link

#11

Updated by Zheng Yan over 7 years ago

Patrick Donnelly wrote:

Zheng, I think we already have "debug mds = 20", right? From the config for this run: http://pulpito.ceph.com/pdonnell-2016-07-18_20:02:54-multimds-master---basic-mira/321707/

the problem is that there is no log in http://qa-proxy.ceph.com/teuthology/pdonnell-2016-07-18_20:02:54-multimds-master---basic-mira/321707/. (I think all multimds runs do not have no log, which makes diagnose impossible)

Actions

Copy link

#12

Updated by Zheng Yan over 7 years ago

Status changed from New to Need More Info

Actions

Copy link

#13

Updated by John Spray over 7 years ago

Priority changed from Normal to High
Target version set to v12.0.0

Actions

Copy link

#14

Updated by John Spray about 7 years ago

Another instance:
http://qa-proxy.ceph.com/teuthology/jspray-2017-01-19_17:59:40-multimds-wip-jcsp-testing-20170119b-testing-basic-smithi/731150/teuthology.log

Actions

Copy link

#15

Updated by John Spray about 7 years ago

I noticed that failure while the test was still stuck trying to unmount the kernel client, so I went in and killed the ssh connection that was running umount so that it will (hopefully) proceed to gather the logs for us.

Actions

Copy link

#16

Updated by John Spray about 7 years ago

Related to Bug #8090: multimds: mds crash in check_rstats added

Actions

Copy link

#17

Updated by John Spray about 7 years ago

Related to deleted (Bug #8090: multimds: mds crash in check_rstats )

Actions

Copy link

#18

Updated by John Spray about 7 years ago

Has duplicate Bug #8090: multimds: mds crash in check_rstats added

Actions

Copy link

#19

Updated by John Spray about 7 years ago

Hmm, well it didn't grab the logs for some reason but I did get the crashing MDS's log before the test tore down. It's in /home/jspray/16768 on teuthology.

Actions

Copy link

#20