Project

General

Profile

Actions

Bug #1548

closed

metadata inconsistencies and mds crashes

Added by Alexandre Oliva over 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Three issues are covered in this report: empty directories that can't be removed, files that cause the mds to crash when modified in any way, and that are listed twice in snapshots.

The directories are empty and do not contain snapshots of their own (although enclosing directories are snapshotted, but this shouldn't prevent their removal). Some of them have negative total size, clearly indicating some earlier state corruption:

  1. ls -lAR /media/shared/.dropme
    /media/shared/.dropme:
    total 4
    drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:46 drivers
    drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:46 include
    drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 linux
    drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 net
    drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 os
    drwxr-xr-x 1 aoliva aoliva 18446744073709550692 Sep 19 18:15 sh64
    drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 staging
    drwxr-xr-x 1 aoliva aoliva 18446744073709539551 Sep 19 18:46 trunk

/media/shared/.dropme/drivers:
total 0

/media/shared/.dropme/include:
total 0

/media/shared/.dropme/linux:
total 0

/media/shared/.dropme/net:
total 0

/media/shared/.dropme/os:
total 0

/media/shared/.dropme/sh64:
total 0

/media/shared/.dropme/staging:
total 0

/media/shared/.dropme/trunk:
total 0
  1. ls -la /media/shared/.dropme/*/..snap..
    /media/shared/.dropme/drivers/..snap..:
    total 1
    drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:46 .
    drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:46 _20110406_1099511627777

/media/shared/.dropme/include/..snap..:
total 1
drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:46 .
drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:46 _20110406_1099511627777

/media/shared/.dropme/linux/..snap..:
total 1
drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:15 .
drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 _20110406_1099511627777

/media/shared/.dropme/net/..snap..:
total 1
drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:15 .
drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 _20110406_1099511627777

/media/shared/.dropme/os/..snap..:
total 1
drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:15 .
drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 _20110406_1099511627777

/media/shared/.dropme/sh64/..snap..:
total 1
drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:15 .
drwxr-xr-x 1 aoliva aoliva 18446744073709550692 Sep 19 18:15 _20110406_1099511627777

/media/shared/.dropme/staging/..snap..:
total 1
drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:15 .
drwxr-xr-x 1 aoliva aoliva 0 Sep 19 18:15 _20110406_1099511627777

/media/shared/.dropme/trunk/..snap..:
total 1
drwxr-xr-x 0 aoliva aoliva 0 Sep 19 18:46 .
drwxr-xr-x 1 aoliva aoliva 18446744073709539551 Sep 19 18:46 _20110406_1099511627777

The attached mds logs, with debug logging set to 100, matches the following shell commands:

  1. service ceph start mds === mds.2 ===
    Starting Ceph mds.2 on frit... * WARNING: Ceph is still under development. Any feedback can be directed * * at or http://ceph.newdream.net/. *
    starting mds.2 at 0.0.0.0:6800/15440
  2. mount /media/shared * WARNING: Ceph is still under development. Any feedback can be directed * * at or http://ceph.newdream.net/. *
    cfuse17223: starting ceph client
    cfuse17223: starting fuse
  3. rm -rf /media/shared/.dropme
    rm: cannot remove `/media/shared/.dropme/drivers': Directory not empty
    rm: cannot remove `/media/shared/.dropme/include': Directory not empty
    rm: cannot remove `/media/shared/.dropme/linux': Directory not empty
    rm: cannot remove `/media/shared/.dropme/net': Directory not empty
    rm: cannot remove `/media/shared/.dropme/os': Directory not empty
    rm: cannot remove `/media/shared/.dropme/sh64': Directory not empty
    rm: cannot remove `/media/shared/.dropme/staging': Directory not empty
    rm: cannot remove `/media/shared/.dropme/trunk': Directory not empty
  4. ls /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_*
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
  5. ls /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_*
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
  6. ls /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/..snap../*/perf_*
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/..snap../_20110406_1099511627777/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/..snap../_20110406_1099511627777/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/..snap../_20110406_1099511627779/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13
    /..snap../_20110406_1099511627779/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
  7. ls /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/..snap../*/perf_*
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/..snap../_20110406_1099511627777/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/..snap../_20110406_1099511627779/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
  8. ls /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_*
    /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch
  9. rm /media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_*
    rm: remove regular file `/media/shared/l/aoliva/priv/fsfla/svn/software/linux-libre/freed-ora/current/f13/perf_events-fix-perf_counter_mmap-hook-in-mprotect.patch'? y

At this point, the mds crashed, and rm wouldn't complete. killing cfuse enables a new mds to become active without crashing right away.
Note how the directories that were empty couldn't be removed; how the perf_* file was present in two snapshots of enclosing directories but shell blob expansion expanded it twice the first time each snapshot subdir was accessed, and (from the logs) how the removal caused the mds to crash with this backtrace:

  1. gdb /l/tmp/build/rpmbuild/BUILD/ceph-0.34/src/cmds /core.15441
    GNU gdb (GDB) Fedora (7.2-51.fc14)
    Copyright (C) 2010 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law. Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-redhat-linux-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/&gt;...
    Reading symbols from /l/tmp/build/rpmbuild/BUILD/ceph-0.34/src/cmds...done.
    [New Thread 15447]
    [New Thread 15448]
    [New Thread 15452]
    [New Thread 20114]
    [New Thread 15449]
    [New Thread 15482]
    [New Thread 15445]
    [New Thread 15446]
    [New Thread 15444]
    [New Thread 20119]
    [New Thread 15453]
    [New Thread 15485]
    [New Thread 15441]
    [New Thread 15443]
    Missing separate debuginfo for
    Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/3a/8fe6cb0063d56fc9be76ecd085c05f1b8a76e6
    Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
    [Thread debugging using libthread_db enabled]
    Loaded symbols for /lib64/libpthread.so.0
    Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
    Loaded symbols for /lib64/librt.so.1
    Reading symbols from /usr/lib64/libtcmalloc.so.0...(no debugging symbols found)...done.
    Loaded symbols for /usr/lib64/libtcmalloc.so.0
    Reading symbols from /usr/lib64/libcryptopp.so.6...(no debugging symbols found)...done.
    Loaded symbols for /usr/lib64/libcryptopp.so.6
    Reading symbols from /usr/lib64/libstdc++.so.6...(no debugging symbols found)...done.
    Loaded symbols for /usr/lib64/libstdc++.so.6
    Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
    Loaded symbols for /lib64/libm.so.6
    Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
    Loaded symbols for /lib64/libgcc_s.so.1
    Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
    Loaded symbols for /lib64/libc.so.6
    Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
    Loaded symbols for /lib64/ld-linux-x86-64.so.2
    Reading symbols from /usr/lib64/libunwind.so.7...(no debugging symbols found)...done.
    Loaded symbols for /usr/lib64/libunwind.so.7
    Core was generated by `/usr/bin/cmds i 2 -c /etc/ceph/ceph.conf'.
    Program terminated with signal 11, Segmentation fault.
    #0 0x0000003c51088e83 in _memcpy_sse2 () from /lib64/libc.so.6
    Missing separate debuginfos, use: debuginfo-install cryptopp-5.6.1-3.fc14.x86_64 glibc-2.14-4.x86_64 google-perftools-1.7-3.fc14.x86_64 libgcc-4.5.1-4.fc14.x86_64 libstdc++-4.5.1-4.fc14.x86_64 libunwind-0.99-0.13.20090430betagit4b8404d1.fc13.x86_64
    (gdb) where
    #0 0x0000003c51088e83 in __memcpy_sse2 () from /lib64/libc.so.6
    #1 0x00000000007742c4 in ceph::BackTrace::print (this=0x7f9c6958fef0, out=...)
    at common/BackTrace.cc:37
    #2 0x00000000007bb7d1 in handle_fatal_signal (signum=11)
    at global/signal_handler.cc:103
    #3 <signal handler called>
    #4 0x0000003c51088e83 in __memcpy_sse2 () from /lib64/libc.so.6
    #5 0x00000000007742c4 in ceph::BackTrace::print (this=0x7f9c69590d30, out=...)
    at common/BackTrace.cc:37
    #6 0x00000000007bb7d1 in handle_fatal_signal (signum=6)
    at global/signal_handler.cc:103
    #7 <signal handler called>
    #8 0x0000003c510352d5 in raise () from /lib64/libc.so.6
    #9 0x0000003c51036beb in abort () from /lib64/libc.so.6
    #10 0x0000003c574bc08d in __gnu_cxx::
    _verbose_terminate_handler() ()
    from /usr/lib64/libstdc++.so.6
    #11 0x0000003c574ba2a6 in ?? () from /usr/lib64/libstdc++.so.6
    #12 0x0000003c574ba2d3 in std::terminate() () from /usr/lib64/libstdc++.so.6
    #13 0x0000003c574ba3de in _cxa_throw () from /usr/lib64/libstdc++.so.6
    #14 0x0000000000747f57 in ceph::
    _ceph_assert_fail (
    assertion=<value optimized out>, file=<value optimized out>,
    line=<value optimized out>, func=<value optimized out>)
    at common/assert.cc:70
    ---Type <return> to continue, or q <return> to quit--

    #15 0x000000000055efe1 in MDCache::add_inode (this=0x1938000, in=0xe3428e0)
    at mds/MDCache.cc:202
    #16 0x0000000000578dac in MDCache::cow_inode (this=0x1938000, in=0xe32d180,
    last=...) at mds/MDCache.cc:1427
    #17 0x0000000000579f25 in MDCache::journal_cow_dentry (this=0x1938000,
    mut=0x119d0000, metablob=0x1889428, dn=0xcb76050, follows=...,
    pcow_inode=0x0, dnl=0xcb76148) at mds/MDCache.cc:1533
    #18 0x0000000000524803 in Server::_unlink_local (this=0x18800e0,
    mdr=0x119d0000, dn=0xcb76050, straydn=0xcb7abe8) at mds/Server.cc:4563
    #19 0x00000000005357a1 in Server::handle_client_unlink (this=0x18800e0,
    mdr=0x119d0000) at mds/Server.cc:4488
    #20 0x0000000000546121 in Server::dispatch_client_request (this=0x18800e0,
    mdr=0x119d0000) at mds/Server.cc:1186
    #21 0x0000000000568276 in MDCache::dispatch_request (this=0x1938000,
    mdr=0x119d0000) at mds/MDCache.cc:7470
    #22 0x000000000054a0d1 in C_MDS_RetryRequest::finish(int) ()
    #23 0x00000000004b3a8a in Context::complete (this=0x298aa00,
    r=<value optimized out>) at ./include/Context.h:41
    #24 0x000000000055a0d1 in finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int) ()
    #25 0x00000000005f6998 in finish_waiting (this=0x1831d40, lock=0xcb761a0,
    first=<value optimized out>, pneed_issue=0x0, pfinishers=0x0)
    at mds/mdstypes.h:1528
    ---Type <return> to continue, or q <return> to quit---
    #26 finish_waiters (this=0x1831d40, lock=0xcb761a0,
    first=<value optimized out>, pneed_issue=0x0, pfinishers=0x0)
    at mds/SimpleLock.h:304
    #27 Locker::eval_gather (this=0x1831d40, lock=0xcb761a0,
    first=<value optimized out>, pneed_issue=0x0, pfinishers=0x0)
    at mds/Locker.cc:720
    #28 0x0000000000642a33 in CDentry::remove_client_lease (this=0xcb76050,
    l=<value optimized out>, locker=0x1831d40) at mds/CDentry.cc:573
    #29 0x00000000005e831c in Locker::handle_client_lease (this=0x1831d40,
    m=0x1962f20) at mds/Locker.cc:2854
    #30 0x00000000004c2d7f in MDS::handle_deferrable_message (this=0x18a8500,
    m=0x1962f20) at mds/MDS.cc:1758
    #31 0x00000000004d0601 in MDS::_dispatch (this=0x18a8500, m=0x1962f20)
    at mds/MDS.cc:1810
    #32 0x00000000004d1c11 in MDS::ms_dispatch (this=0x18a8500, m=0x1962f20)
    at mds/MDS.cc:1618
    #33 0x000000000073597b in ms_deliver_dispatch (this=0x18a8a00)
    at msg/Messenger.h:102
    #34 SimpleMessenger::dispatch_entry (this=0x18a8a00)
    at msg/SimpleMessenger.cc:356
    #35 0x00000000004ad17c in SimpleMessenger::DispatchThread::entry (
    this=<value optimized out>) at ./msg/SimpleMessenger.h:546
    #36 0x0000003c51807af1 in start_thread () from /lib64/libpthread.so.0
    ---Type <return> to continue, or q <return> to quit---
    #37 0x0000003c510dfb7d in clone () from /lib64/libc.so.6
    (gdb)

I'll retain the core file and the filesystem for a while, though I'm planning on starting over from scratch for 0.35, to try to get a clean slate without any internal inconsistencies that might show up only at a later time.

(I'm getting 500 HTTP errors when attaching the 16MB mds log, I'll open the bug first and try to attach the file later)

Actions #1

Updated by Sage Weil about 12 years ago

  • Category set to 1
Actions #2

Updated by Sage Weil over 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #3

Updated by Sage Weil over 11 years ago

  • Status changed from New to Resolved

commit:44bc687d98f931b15538805d3923492d62dca779

Actions

Also available in: Atom PDF