Bug #1047
closedmds: crash on anchor table query
0%
Description
Attempt delete directory with backup created by BackupPC.
2011-05-03 14:12:37.288166 7fcc61d6f700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!2011-05-03 14:12:37.288184 7fcc61d6f700 mds0.cache.dir(10000064c90) total of child dentrys: n(v0 1=0+1)
2011-05-03 14:12:37.288195 7fcc61d6f700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.282326 b-2704 1=0+1)
2011-05-03 14:12:37.288215 7fcc61d6f700 mds0.cache.dir(10000064c90) [dentry #1/dcvolia/backup03/pc/tmg.net.ua/new/f%2f/fusr/flocal/fwww [2,head] auth (dn xlock x=1 by 0x44c98a0) (dversion lock w=1 last_client=4118) pv=6823 v=5626 ap=2+2 inode=0x7fcbee887b90 | lock inodepin authpin 0x7fcc09f615a0] n(v0 1=0+1)
2011-05-03 14:12:44.639291 7fcc60c6c700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:44.639312 7fcc60c6c700 mds0.cache.dir(10000064c90) total of child dentrys: n()
2011-05-03 14:12:44.639323 7fcc60c6c700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.288150 b-2704)
2011-05-03 14:12:44.639330 7fcc60c6c700 mds0.cache.dir(10000064c90) mismatch between child accounted_rstats and my rstats!
2011-05-03 14:12:44.639336 7fcc60c6c700 mds0.cache.dir(10000064c90) total of child dentrys: n()
2011-05-03 14:12:44.639343 7fcc60c6c700 mds0.cache.dir(10000064c90) my rstats: n(v397 rc2011-05-03 14:12:37.288150 b-2704)
mds/AnchorServer.cc: In function 'virtual void AnchorServer::handle_query(MMDSTableRequest*)', in thread '0x7fcc61d6f700'
mds/AnchorServer.cc: 242: FAILED assert(anchor_map.count(curino) == 1)
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
2: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
3: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
4: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
5: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
7: (()+0x6d8c) [0x7fcc63fa5d8c]
8: (clone()+0x6d) [0x7fcc62e5804d]
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
2: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
3: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
4: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
5: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
7: (()+0x6d8c) [0x7fcc63fa5d8c]
8: (clone()+0x6d) [0x7fcc62e5804d]
- Caught signal (Aborted) *
in thread 0x7fcc61d6f700
ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
1: /usr/bin/cmds() [0x70fc5e]
2: (()+0xfc60) [0x7fcc63faec60]
3: (gsignal()+0x35) [0x7fcc62da5d05]
4: (abort()+0x186) [0x7fcc62da9ab6]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fcc6365c6dd]
6: (()+0xb9926) [0x7fcc6365a926]
7: (()+0xb9953) [0x7fcc6365a953]
8: (()+0xb9a5e) [0x7fcc6365aa5e]
9: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x36a) [0x6f3b4a]
10: (AnchorServer::handle_query(MMDSTableRequest*)+0x115) [0x664c65]
11: (MDS::handle_deferrable_message(Message*)+0xae4) [0x4a0564]
12: (MDS::_dispatch(Message*)+0x146e) [0x4b132e]
13: (MDS::ms_dispatch(Message*)+0x5b) [0x4b1aeb]
14: (SimpleMessenger::dispatch_entry()+0x7ea) [0x483eba]
15: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x47a95c]
16: (()+0x6d8c) [0x7fcc63fa5d8c]
17: (clone()+0x6d) [0x7fcc62e5804d]
Files
Updated by Fyodor Ustinov almost 13 years ago
Now I have a directory 'ls' in which make crash mds.
Updated by Sage Weil almost 13 years ago
- Status changed from New to 4
It's the crash above, I take it?
Can you include a full mds log that ends with an 'ls' and crash?
It sounds like we screwed up the anchor table with all the hard links backuppc was making. We pass our basic link tests, though. Hopefully the original error is something we can reproduce as well (once we deal with this).
Thanks!
Updated by Sage Weil almost 13 years ago
BTW, if you know what directory the hard link(s) in this directory reference (it'll be the first time the linked files in the bad dir were original written), you can work around this bug by doing an ls in that old/original dir first, and then doing the ls in the bad dir. Because the link targets will already be in cache the anchortable won't be queried at all.
Updated by Fyodor Ustinov almost 13 years ago
I alredy many times create cluster "from scratch", but log still available.
Full mds log from start after previous crash and to new crash.
Updated by Sage Weil almost 13 years ago
- Status changed from 4 to Can't reproduce
The log doesn't have enough info. If anyone sees this again, let's reopen!
Updated by Sage Weil over 12 years ago
- Subject changed from mds crash to mds: crash on anchor table query
- Status changed from Can't reproduce to Need More Info
Amon Ott just hit this one.
Updated by Amon Ott over 12 years ago
Attached a log of a full run up to the crash. MDS tries to recover from some problem, replays and crashes.
Updated by Amon Ott over 12 years ago
Updated Ceph to 0.39 and the bug seems to be gone.
Updated by Greg Farnum over 12 years ago
Unfortunately there's not enough info in this log either. We're going to need a log with (at minimum) level 10 mds debugging, maybe higher.
Or a reliable reproducer we can run locally, of course!
Updated by Amon Ott over 12 years ago
- File mds-crash.log mds-crash.log added
Here is what MDS logs with debug 20. No idea if it really helps. The cluster
is still in the broken state, should I try to reproduce with a recreated ceph
fs and debug 20? This could be GBs of logs.
Updated by Greg Farnum over 12 years ago
This log is less illuminating than I'd hoped since it's just replay commits. :(
However, it did give me one idea to look into (are replayed ops failing if they got committed while removing an inode and we didn't notice?); let me check into it and I'll post back her shortly.
Updated by Greg Farnum over 12 years ago
After looking into this a lot today, none of my ideas panned out. I also tried a few simple tests to reproduce locally and was unable to do so. :(
Updated by Greg Farnum over 12 years ago
- Status changed from Need More Info to In Progress
- Assignee set to Greg Farnum
All right, looking through a log that Amon got me after I specified the right debug arguments. I see that there are two separate anchor destroy events happening, and that's obviously incorrect. My guess is that there's a race which allows it to get marked for destruction more than once, but I haven't found it yet...
Updated by Amon Ott almost 12 years ago
Just hit this bug again with 0.47.2 on ext4, so it is still open and it is still an issue.
Updated by Amon Ott almost 12 years ago
This bug seems to be reproduced with the deadlock described on the mailing list as
"OSD deadlock with cephfs client and OSD on same machine", it is probably related to a partial sync/commit of osd journal or storage.
Updated by Amon Ott almost 12 years ago
We are auto creating up to 200 preconfigured home directories on all four nodes, each home dir consists of ca. 400 dirs and files with ca. 16 MB of data. AFAIK, there are no hard links involved. So it is a massive parallel creation of many small files, probably lots of metadata for them.
Updated by Greg Farnum almost 12 years ago
Amon, are you sure you're hitting exactly this bug with your users? This particular one requires hard links to be in use to even be triggered.
Anyway, this is still in our queue but will require some in-depth investigation that we unfortunately haven't had time for.
Updated by Amon Ott almost 12 years ago
No, I am not sure about that. Only saw the same assert message and a similar trace, so I assumed it to be the same bug.
Updated by Sage Weil over 11 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Updated by Samuel Hassine over 11 years ago
Hi,
Still have the problem here.
mds/AnchorServer.cc: 249: FAILED assert(anchor_map.count(curino) == 1) ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: (AnchorServer::handle_query(MMDSTableRequest*)+0x749) [0x678d89] 2: (MDS::handle_deferrable_message(Message*)+0xe84) [0x4af644] 3: (MDS::_dispatch(Message*)+0xc23) [0x4c24c3] 4: (MDS::ms_dispatch(Message*)+0x1eb) [0x4c34ab] 5: (SimpleMessenger::DispatchQueue::entry()+0x903) [0x7579a3] 6: (SimpleMessenger::dispatch_entry()+0x24) [0x758744] 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x717a5d] 8: (()+0x7e9a) [0x7fe878223e9a] 9: (clone()+0x6d) [0x7fe8770c4cbd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -4> 2012-11-28 03:42:30.334561 7fe86ff34700 5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 118 (234 -> 352) -3> 2012-11-28 03:42:30.334689 7fe870c3a700 5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 2820 (352 -> 3172) -2> 2012-11-28 03:42:30.334736 7fe86fd32700 5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 117 (3172 -> 3289) -1> 2012-11-28 03:42:30.338286 7fe870136700 5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 117 (3289 -> 3406) 0> 2012-11-28 03:42:30.338311 7fe870136700 5 throttle(msgr_dispatch_throttler-mds 0x233f9e0) get 119 (3406 -> 3525) --- end dump of recent events --- 2012-11-28 03:42:30.343327 7fe872e40700 -1 *** Caught signal (Aborted) ** in thread 7fe872e40700 ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: /usr/bin/ceph-mds() [0x7e315a] 2: (()+0xfcb0) [0x7fe87822bcb0] 3: (gsignal()+0x35) [0x7fe877007425] 4: (abort()+0x17b) [0x7fe87700ab8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe87795969d] 6: (()+0xb5846) [0x7fe877957846] 7: (()+0xb5873) [0x7fe877957873] 8: (()+0xb596e) [0x7fe87795796e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1de) [0x7771fe] 10: (AnchorServer::handle_query(MMDSTableRequest*)+0x749) [0x678d89] 11: (MDS::handle_deferrable_message(Message*)+0xe84) [0x4af644] 12: (MDS::_dispatch(Message*)+0xc23) [0x4c24c3] 13: (MDS::ms_dispatch(Message*)+0x1eb) [0x4c34ab] 14: (SimpleMessenger::DispatchQueue::entry()+0x903) [0x7579a3] 15: (SimpleMessenger::dispatch_entry()+0x24) [0x758744] 16: (SimpleMessenger::DispatchThread::entry()+0xd) [0x717a5d] 17: (()+0x7e9a) [0x7fe878223e9a] 18: (clone()+0x6d) [0x7fe8770c4cbd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Each time I want to restart mds server it crashes again, someone have an idea?
Regards.
Sam
Updated by Sage Weil over 11 years ago
- Status changed from 12 to Resolved
most likely fixed by yan's patch
Updated by Ian Colle over 11 years ago
Sage Weil wrote:
most likely fixed by yan's patch
commit id?
Updated by Greg Farnum over 11 years ago
commit:a3aad3c3c58ffd7cd556275cd1c7dbc69b3eb4dd, which is in v0.55.