Project

General

Profile

Actions

Bug #1047

closed

mds: crash on anchor table query

Added by Fyodor Ustinov almost 13 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Attempt delete directory with backup created by BackupPC.

2011-05-03 14:12:37.288166 7fcc61d6f700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:37.288184 7fcc61d6f700 mds0.cache.dir(10000064c90) total of child dentrys: n(v0 1=0+1)
2011-05-03 14:12:37.288195 7fcc61d6f700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.282326 b-2704 1=0+1)
2011-05-03 14:12:37.288215 7fcc61d6f700 mds0.cache.dir(10000064c90) [dentry #1/dcvolia/backup03/pc/tmg.net.ua/new/f%2f/fusr/flocal/fwww [2,head] auth (dn xlock x=1 by 0x44c98a0) (dversion lock w=1 last_client=4118) pv=6823 v=5626 ap=2+2 inode=0x7fcbee887b90 | lock inodepin authpin 0x7fcc09f615a0] n(v0 1=0+1)
2011-05-03 14:12:44.639291 7fcc60c6c700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:44.639312 7fcc60c6c700 mds0.cache.dir(10000064c90) total of child dentrys: n()
2011-05-03 14:12:44.639323 7fcc60c6c700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.288150 b-2704)
2011-05-03 14:12:44.639330 7fcc60c6c700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:44.639336 7fcc60c6c700 mds0.cache.dir(10000064c90) total of child dentrys: n()
2011-05-03 14:12:44.639343 7fcc60c6c700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.288150 b-2704)
mds/AnchorServer.cc: In function 'virtual void AnchorServer::handle_query(MMDSTableRequest*)', in thread '0x7fcc61d6f700'
mds/AnchorServer.cc: 242: FAILED assert(anchor_map.count(curino) == 1)
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
2: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
3: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
4: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
5: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
7: (()+0x6d8c) [0x7fcc63fa5d8c]
8: (clone()+0x6d) [0x7fcc62e5804d]
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
2: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
3: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
4: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
5: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
7: (()+0x6d8c) [0x7fcc63fa5d8c]
8: (clone()+0x6d) [0x7fcc62e5804d]
  • Caught signal (Aborted) *
    in thread 0x7fcc61d6f700
    ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
    1: /usr/bin/cmds() [0x70fc5e]
    2: (()+0xfc60) [0x7fcc63faec60]
    3: (gsignal()+0x35) [0x7fcc62da5d05]
    4: (abort()+0x186) [0x7fcc62da9ab6]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fcc6365c6dd]
    6: (()+0xb9926) [0x7fcc6365a926]
    7: (()+0xb9953) [0x7fcc6365a953]
    8: (()+0xb9a5e) [0x7fcc6365aa5e]
    9: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x36a) [0x6f3b4a]
    10: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
    11: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
    12: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
    13: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
    14: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
    15: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
    16: (()+0x6d8c) [0x7fcc63fa5d8c]
    17: (clone()+0x6d) [0x7fcc62e5804d]

Files

mds.log (8.87 KB) mds.log Fyodor Ustinov, 05/06/2011 04:42 AM
mds-crash-anchor_map-full.log.bz2 (155 KB) mds-crash-anchor_map-full.log.bz2 Amon Ott, 12/03/2011 01:01 AM
mds-crash.log (11 KB) mds-crash.log Amon Ott, 01/02/2012 05:47 AM
Actions #1

Updated by Fyodor Ustinov almost 13 years ago

Now I have a directory 'ls' in which make crash mds.

Actions #2

Updated by Sage Weil almost 13 years ago

  • Status changed from New to 4

It's the crash above, I take it?

Can you include a full mds log that ends with an 'ls' and crash?

It sounds like we screwed up the anchor table with all the hard links backuppc was making. We pass our basic link tests, though. Hopefully the original error is something we can reproduce as well (once we deal with this).

Thanks!

Actions #3

Updated by Sage Weil almost 13 years ago

BTW, if you know what directory the hard link(s) in this directory reference (it'll be the first time the linked files in the bad dir were original written), you can work around this bug by doing an ls in that old/original dir first, and then doing the ls in the bad dir. Because the link targets will already be in cache the anchortable won't be queried at all.

Actions #4

Updated by Fyodor Ustinov almost 13 years ago

I alredy many times create cluster "from scratch", but log still available.

Full mds log from start after previous crash and to new crash.

Actions #5

Updated by Sage Weil almost 13 years ago

  • Status changed from 4 to Can't reproduce

The log doesn't have enough info. If anyone sees this again, let's reopen!

Actions #6

Updated by Sage Weil over 12 years ago

  • Subject changed from mds crash to mds: crash on anchor table query
  • Status changed from Can't reproduce to Need More Info

Amon Ott just hit this one.

Actions #7

Updated by Amon Ott over 12 years ago

Attached a log of a full run up to the crash. MDS tries to recover from some problem, replays and crashes.

Actions #8

Updated by Amon Ott over 12 years ago

Updated Ceph to 0.39 and the bug seems to be gone.

Actions #9

Updated by Amon Ott over 12 years ago

Got it again with 0.39. Still there.

Actions #10

Updated by Greg Farnum over 12 years ago

Unfortunately there's not enough info in this log either. We're going to need a log with (at minimum) level 10 mds debugging, maybe higher.

Or a reliable reproducer we can run locally, of course!

Actions #11

Updated by Amon Ott over 12 years ago

Here is what MDS logs with debug 20. No idea if it really helps. The cluster
is still in the broken state, should I try to reproduce with a recreated ceph
fs and debug 20? This could be GBs of logs.

Actions #12

Updated by Greg Farnum over 12 years ago

This log is less illuminating than I'd hoped since it's just replay commits. :(

However, it did give me one idea to look into (are replayed ops failing if they got committed while removing an inode and we didn't notice?); let me check into it and I'll post back her shortly.

Actions #13

Updated by Greg Farnum over 12 years ago

After looking into this a lot today, none of my ideas panned out. I also tried a few simple tests to reproduce locally and was unable to do so. :(

Actions #14

Updated by Greg Farnum over 12 years ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Greg Farnum

All right, looking through a log that Amon got me after I specified the right debug arguments. I see that there are two separate anchor destroy events happening, and that's obviously incorrect. My guess is that there's a race which allows it to get marked for destruction more than once, but I haven't found it yet...

Actions #15

Updated by Sage Weil about 12 years ago

  • Status changed from In Progress to 12
Actions #16

Updated by Amon Ott almost 12 years ago

Just hit this bug again with 0.47.2 on ext4, so it is still open and it is still an issue.

Actions #17

Updated by Amon Ott almost 12 years ago

This bug seems to be reproduced with the deadlock described on the mailing list as
"OSD deadlock with cephfs client and OSD on same machine", it is probably related to a partial sync/commit of osd journal or storage.

Actions #18

Updated by Amon Ott almost 12 years ago

We are auto creating up to 200 preconfigured home directories on all four nodes, each home dir consists of ca. 400 dirs and files with ca. 16 MB of data. AFAIK, there are no hard links involved. So it is a massive parallel creation of many small files, probably lots of metadata for them.

Actions #19

Updated by Greg Farnum almost 12 years ago

Amon, are you sure you're hitting exactly this bug with your users? This particular one requires hard links to be in use to even be triggered.

Anyway, this is still in our queue but will require some in-depth investigation that we unfortunately haven't had time for.

Actions #20

Updated by Amon Ott almost 12 years ago

No, I am not sure about that. Only saw the same assert message and a similar trace, so I assumed it to be the same bug.

Actions #21

Updated by Sage Weil over 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #22

Updated by Samuel Hassine over 11 years ago

Hi,

Still have the problem here.

mds/AnchorServer.cc: 249: FAILED assert(anchor_map.count(curino) == 1)

 ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
 1: (AnchorServer::handle_query(MMDSTableRequest*)+0x749) [0x678d89]
 2: (MDS::handle_deferrable_message(Message*)+0xe84) [0x4af644]
 3: (MDS::_dispatch(Message*)+0xc23) [0x4c24c3]
 4: (MDS::ms_dispatch(Message*)+0x1eb) [0x4c34ab]
 5: (SimpleMessenger::DispatchQueue::entry()+0x903) [0x7579a3]
 6: (SimpleMessenger::dispatch_entry()+0x24) [0x758744]
 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x717a5d]
 8: (()+0x7e9a) [0x7fe878223e9a]
 9: (clone()+0x6d) [0x7fe8770c4cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -4> 2012-11-28 03:42:30.334561 7fe86ff34700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 118 (234 -> 352)
    -3> 2012-11-28 03:42:30.334689 7fe870c3a700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 2820 (352 -> 3172)
    -2> 2012-11-28 03:42:30.334736 7fe86fd32700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 117 (3172 -> 3289)
    -1> 2012-11-28 03:42:30.338286 7fe870136700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 117 (3289 -> 3406)
     0> 2012-11-28 03:42:30.338311 7fe870136700  5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 119 (3406 -> 3525)
--- end dump of recent events ---
2012-11-28 03:42:30.343327 7fe872e40700 -1 *** Caught signal (Aborted) **
 in thread 7fe872e40700

 ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
 1: /usr/bin/ceph-mds() [0x7e315a]
 2: (()+0xfcb0) [0x7fe87822bcb0]
 3: (gsignal()+0x35) [0x7fe877007425]
 4: (abort()+0x17b) [0x7fe87700ab8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe87795969d]
 6: (()+0xb5846) [0x7fe877957846]
 7: (()+0xb5873) [0x7fe877957873]
 8: (()+0xb596e) [0x7fe87795796e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1de) [0x7771fe]
 10: (AnchorServer::handle_query(MMDSTableRequest*)+0x749) [0x678d89]
 11: (MDS::handle_deferrable_message(Message*)+0xe84) [0x4af644]
 12: (MDS::_dispatch(Message*)+0xc23) [0x4c24c3]
 13: (MDS::ms_dispatch(Message*)+0x1eb) [0x4c34ab]
 14: (SimpleMessenger::DispatchQueue::entry()+0x903) [0x7579a3]
 15: (SimpleMessenger::dispatch_entry()+0x24) [0x758744]
 16: (SimpleMessenger::DispatchThread::entry()+0xd) [0x717a5d]
 17: (()+0x7e9a) [0x7fe878223e9a]
 18: (clone()+0x6d) [0x7fe8770c4cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Each time I want to restart mds server it crashes again, someone have an idea?

Regards.
Sam

Actions #23

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to Resolved

most likely fixed by yan's patch

Actions #24

Updated by Ian Colle over 11 years ago

Sage Weil wrote:

most likely fixed by yan's patch

commit id?

Actions #25

Updated by Greg Farnum over 11 years ago

commit:a3aad3c3c58ffd7cd556275cd1c7dbc69b3eb4dd, which is in v0.55.

Actions

Also available in: Atom PDF