Bug #1047: mds: crash on anchor table query - CephFS - Ceph

Actions

Copy link

Bug #1047

closed

mds: crash on anchor table query

Added by Fyodor Ustinov almost 13 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Attempt delete directory with backup created by BackupPC.

2011-05-03 14:12:37.288166 7fcc61d6f700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:37.288184 7fcc61d6f700 mds0.cache.dir(10000064c90) total of child dentrys: n(v0 1=0+1)
2011-05-03 14:12:37.288195 7fcc61d6f700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.282326 b-2704 1=0+1)
2011-05-03 14:12:37.288215 7fcc61d6f700 mds0.cache.dir(10000064c90) [dentry #1/dcvolia/backup03/pc/tmg.net.ua/new/f%2f/fusr/flocal/fwww [2,head] auth (dn xlock x=1 by 0x44c98a0) (dversion lock w=1 last_client=4118) pv=6823 v=5626 ap=2+2 inode=0x7fcbee887b90 | lock inodepin authpin 0x7fcc09f615a0] n(v0 1=0+1)
2011-05-03 14:12:44.639291 7fcc60c6c700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:44.639312 7fcc60c6c700 mds0.cache.dir(10000064c90) total of child dentrys: n()
2011-05-03 14:12:44.639323 7fcc60c6c700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.288150 b-2704)
2011-05-03 14:12:44.639330 7fcc60c6c700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:44.639336 7fcc60c6c700 mds0.cache.dir(10000064c90) total of child dentrys: n()
2011-05-03 14:12:44.639343 7fcc60c6c700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.288150 b-2704)
mds/AnchorServer.cc: In function 'virtual void AnchorServer::handle_query(MMDSTableRequest*)', in thread '0x7fcc61d6f700'
mds/AnchorServer.cc: 242: FAILED assert(anchor_map.count(curino) == 1)
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
2: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
3: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
4: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
5: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
7: (()+0x6d8c) [0x7fcc63fa5d8c]
8: (clone()+0x6d) [0x7fcc62e5804d]
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
2: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
3: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
4: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
5: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
7: (()+0x6d8c) [0x7fcc63fa5d8c]
8: (clone()+0x6d) [0x7fcc62e5804d]

Caught signal (Aborted) *
in thread 0x7fcc61d6f700
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: /usr/bin/cmds() [0x70fc5e]
2: (()+0xfc60) [0x7fcc63faec60]
3: (gsignal()+0x35) [0x7fcc62da5d05]
4: (abort()+0x186) [0x7fcc62da9ab6]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fcc6365c6dd]
6: (()+0xb9926) [0x7fcc6365a926]
7: (()+0xb9953) [0x7fcc6365a953]
8: (()+0xb9a5e) [0x7fcc6365aa5e]
9: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x36a) [0x6f3b4a]
10: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
11: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
12: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
13: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
14: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
15: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
16: (()+0x6d8c) [0x7fcc63fa5d8c]
17: (clone()+0x6d) [0x7fcc62e5804d]

Files

Download all files

mds.log (8.87 KB) mds.log		Fyodor Ustinov, 05/06/2011 04:42 AM
mds-crash-anchor_map-full.log.bz2 (155 KB) mds-crash-anchor_map-full.log.bz2		Amon Ott, 12/03/2011 01:01 AM
mds-crash.log (11 KB) mds-crash.log		Amon Ott, 01/02/2012 05:47 AM

Actions

Copy link

Updated by Fyodor Ustinov almost 13 years ago

Now I have a directory 'ls' in which make crash mds.

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Status changed from New to 4

It's the crash above, I take it?

Can you include a full mds log that ends with an 'ls' and crash?

It sounds like we screwed up the anchor table with all the hard links backuppc was making. We pass our basic link tests, though. Hopefully the original error is something we can reproduce as well (once we deal with this).

Thanks!

Actions

Copy link

Updated by Sage Weil almost 13 years ago

BTW, if you know what directory the hard link(s) in this directory reference (it'll be the first time the linked files in the bad dir were original written), you can work around this bug by doing an ls in that old/original dir first, and then doing the ls in the bad dir. Because the link targets will already be in cache the anchortable won't be queried at all.

Actions

Copy link

Updated by Fyodor Ustinov almost 13 years ago

File mds.log mds.log added

I alredy many times create cluster "from scratch", but log still available.

Full mds log from start after previous crash and to new crash.

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Status changed from 4 to Can't reproduce

The log doesn't have enough info. If anyone sees this again, let's reopen!

Actions

Copy link

Updated by Sage Weil over 12 years ago

Subject changed from mds crash to mds: crash on anchor table query
Status changed from Can't reproduce to Need More Info

Amon Ott just hit this one.

Actions

Copy link

Updated by Amon Ott over 12 years ago

File mds-crash-anchor_map-full.log.bz2 mds-crash-anchor_map-full.log.bz2 added

Attached a log of a full run up to the crash. MDS tries to recover from some problem, replays and crashes.

Actions

Copy link

Updated by Amon Ott over 12 years ago

Updated Ceph to 0.39 and the bug seems to be gone.

Actions

Copy link

Updated by Amon Ott over 12 years ago

Got it again with 0.39. Still there.

Actions

Copy link

#10

Updated by Greg Farnum over 12 years ago

Unfortunately there's not enough info in this log either. We're going to need a log with (at minimum) level 10 mds debugging, maybe higher.

Or a reliable reproducer we can run locally, of course!

Actions

Copy link

#11

Updated by Amon Ott over 12 years ago

File mds-crash.log mds-crash.log added

Here is what MDS logs with debug 20. No idea if it really helps. The cluster
is still in the broken state, should I try to reproduce with a recreated ceph
fs and debug 20? This could be GBs of logs.

Actions

Copy link

#12

Updated by Greg Farnum over 12 years ago

This log is less illuminating than I'd hoped since it's just replay commits. :(

However, it did give me one idea to look into (are replayed ops failing if they got committed while removing an inode and we didn't notice?); let me check into it and I'll post back her shortly.

Actions

Copy link

#13

Updated by Greg Farnum over 12 years ago

After looking into this a lot today, none of my ideas panned out. I also tried a few simple tests to reproduce locally and was unable to do so. :(

Actions

Copy link

#14

Updated by Greg Farnum over 12 years ago

Status changed from Need More Info to In Progress
Assignee set to Greg Farnum

All right, looking through a log that Amon got me after I specified the right debug arguments. I see that there are two separate anchor destroy events happening, and that's obviously incorrect. My guess is that there's a race which allows it to get marked for destruction more than once, but I haven't found it yet...

Actions

Copy link

#15

Updated by Sage Weil about 12 years ago

Status changed from In Progress to 12

Actions

Copy link

#16

Updated by Amon Ott almost 12 years ago

Just hit this bug again with 0.47.2 on ext4, so it is still open and it is still an issue.

Actions

Copy link

#17

Updated by Amon Ott almost 12 years ago

This bug seems to be reproduced with the deadlock described on the mailing list as
"OSD deadlock with cephfs client and OSD on same machine", it is probably related to a partial sync/commit of osd journal or storage.

Actions

Copy link

#18

Updated by Amon Ott almost 12 years ago

We are auto creating up to 200 preconfigured home directories on all four nodes, each home dir consists of ca. 400 dirs and files with ca. 16 MB of data. AFAIK, there are no hard links involved. So it is a massive parallel creation of many small files, probably lots of metadata for them.

Actions

Copy link

#19

Updated by Greg Farnum almost 12 years ago

Amon, are you sure you're hitting exactly this bug with your users? This particular one requires hard links to be in use to even be triggered.

Anyway, this is still in our queue but will require some in-depth investigation that we unfortunately haven't had time for.

Actions

Copy link

#20

Updated by Amon Ott almost 12 years ago

No, I am not sure about that. Only saw the same assert message and a similar trace, so I assumed it to be the same bug.

Actions

Copy link

#21

Updated by Sage Weil over 11 years ago

Project changed from Ceph to CephFS
Category deleted (1)

Actions

Copy link

#22

Updated by Samuel Hassine over 11 years ago

Hi,

Still have the problem here.

mds/AnchorServer.cc: 249: FAILED assert(anchor_map.count(curino) == 1)

 ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
 1: (AnchorServer::handle_query(MMDSTableRequest*)+0x749) [0x678d89]
 2: (MDS::handle_deferrable_message(Message*)+0xe84) [0x4af644]
 3: (MDS::_dispatch(Message*)+0xc23) [0x4c24c3]
 4: (MDS::ms_dispatch(Message*)+0x1eb) [0x4c34ab]
 5: (SimpleMessenger::DispatchQueue::entry()+0x903) [0x7579a3]
 6: (SimpleMessenger::dispatch_entry()+0x24) [0x758744]
 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x717a5d]
 8: (()+0x7e9a) [0x7fe878223e9a]
 9: (clone()+0x6d) [0x7fe8770c4cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -4> 2012-11-28 03:42:30.334561 7fe86ff34700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 118 (234 -> 352)
    -3> 2012-11-28 03:42:30.334689 7fe870c3a700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 2820 (352 -> 3172)
    -2> 2012-11-28 03:42:30.334736 7fe86fd32700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 117 (3172 -> 3289)
    -1> 2012-11-28 03:42:30.338286 7fe870136700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 117 (3289 -> 3406)
     0> 2012-11-28 03:42:30.338311 7fe870136700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 119 (3406 -> 3525)
--- end dump of recent events ---
2012-11-28 03:42:30.343327 7fe872e40700 -1 *** Caught signal (Aborted) **
 in thread 7fe872e40700

 ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
 1: /usr/bin/ceph-mds() [0x7e315a]
 2: (()+0xfcb0) [0x7fe87822bcb0]
 3: (gsignal()+0x35) [0x7fe877007425]
 4: (abort()+0x17b) [0x7fe87700ab8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe87795969d]
 6: (()+0xb5846) [0x7fe877957846]
 7: (()+0xb5873) [0x7fe877957873]
 8: (()+0xb596e) [0x7fe87795796e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1de) [0x7771fe]
 10: (AnchorServer::handle_query(MMDSTableRequest*)+0x749) [0x678d89]
 11: (MDS::handle_deferrable_message(Message*)+0xe84) [0x4af644]
 12: (MDS::_dispatch(Message*)+0xc23) [0x4c24c3]
 13: (MDS::ms_dispatch(Message*)+0x1eb) [0x4c34ab]
 14: (SimpleMessenger::DispatchQueue::entry()+0x903) [0x7579a3]
 15: (SimpleMessenger::dispatch_entry()+0x24) [0x758744]
 16: (SimpleMessenger::DispatchThread::entry()+0xd) [0x717a5d]
 17: (()+0x7e9a) [0x7fe878223e9a]
 18: (clone()+0x6d) [0x7fe8770c4cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Each time I want to restart mds server it crashes again, someone have an idea?

Regards.
Sam

Actions

Copy link

#23

Updated by Sage Weil over 11 years ago

Status changed from 12 to Resolved

most likely fixed by yan's patch

Actions

Copy link

#24

Updated by Ian Colle over 11 years ago

Sage Weil wrote:

most likely fixed by yan's patch

commit id?

Actions

Copy link

#25

Updated by Greg Farnum over 11 years ago

commit:a3aad3c3c58ffd7cd556275cd1c7dbc69b3eb4dd, which is in v0.55.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #1047

mds: crash on anchor table query

Updated by Fyodor Ustinov almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by Fyodor Ustinov almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by Sage Weil over 12 years ago

Updated by Amon Ott over 12 years ago

Updated by Amon Ott over 12 years ago

Updated by Amon Ott over 12 years ago

Updated by Greg Farnum over 12 years ago

Updated by Amon Ott over 12 years ago

Updated by Greg Farnum over 12 years ago

Updated by Greg Farnum over 12 years ago

Updated by Greg Farnum over 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Amon Ott almost 12 years ago

Updated by Amon Ott almost 12 years ago

Updated by Amon Ott almost 12 years ago

Updated by Greg Farnum almost 12 years ago

Updated by Amon Ott almost 12 years ago

Updated by Sage Weil over 11 years ago

Updated by Samuel Hassine over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Ian Colle over 11 years ago

Updated by Greg Farnum over 11 years ago