Project

General

Profile

Actions

Bug #16829

closed

ceph-mds crashing constantly

Added by Tomasz Torcz almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm using CEPH packages from Fedora 24: ceph-mds-10.2.2-2.fc24.x86_64

I've created simple cephfs once, stored some data inside. Then I removed the filesystem using "ceph fs rm" and recreated a new one. Now I cannot access this filesystem, ceph-mds process crash just after startup:

ceph-mds8599: starting mds.dashboardpc at :/0
ceph-mds8599: 2016-07-27 14:34:34.420858 7f27934bb700 1 log_channel(cluster) log [ERR] : replayed stray Session close event for client.14334 3.193.149.4:0/2612743178 from time 2016-06-05 09:23:48.286189, ignoring
ceph-mds8599: 2016-07-27 14:34:39.440266 7f27958ca700 -1 log_channel(cluster) log [ERR] : loaded dup inode 10000000000 [2,head] v18767 at /dane, but inode 10000000000.head v77080 already exists at /tmp2
ceph-mds8599: mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7f27960cb700 time 2016-07-27 14:34:39.528175
ceph-mds8599: mds/CDir.cc: 699: FAILED assert(dn
>get_linkage()->is_null())
ceph-mds8599: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
ceph-mds8599: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55f2f9cc72a0]
ceph-mds8599: 2: (CDir::try_remove_dentries_for_stray()+0x150) [0x55f2f9a83220]
ceph-mds8599: 3: (StrayManager::__eval_stray(CDentry*, bool)+0x8e9) [0x55f2f9a02589]
ceph-mds8599: 4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f2f9a02a12]
ceph-mds8599: 5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f2f99571ad]
ceph-mds8599: 6: (MDSInternalContextBase::complete(int)+0x20b) [0x55f2f9b0dffb]
ceph-mds8599: 7: (MDSRank::_advance_queues()+0x66b) [0x55f2f98b7a0b]
ceph-mds8599: 8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f2f98b7f0a]
ceph-mds8599: 9: (()+0x75ca) [0x7f279f6bc5ca]
ceph-mds8599: 10: (clone()+0x6d) [0x7f279e0fbead]
ceph-mds8599: NOTE: a copy of the executable, or `objdump rdS <executable>` is needed to interpret this.
ceph-mds8599: 2016-07-27 14:34:39.531349 7f27960cb700 -1 mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7f27960cb700 time 2016-07-27 14:34:39.528175
ceph-mds8599: mds/CDir.cc: 699: FAILED assert(dn
>get_linkage()->is_null())
ceph-mds8599: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
ceph-mds8599: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55f2f9cc72a0]
ceph-mds8599: 2: (CDir::try_remove_dentries_for_stray()+0x150) [0x55f2f9a83220]
ceph-mds8599: 3: (StrayManager::__eval_stray(CDentry*, bool)+0x8e9) [0x55f2f9a02589]
ceph-mds8599: 4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f2f9a02a12]
ceph-mds8599: 5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f2f99571ad]
ceph-mds8599: 6: (MDSInternalContextBase::complete(int)+0x20b) [0x55f2f9b0dffb]
ceph-mds8599: 7: (MDSRank::_advance_queues()+0x66b) [0x55f2f98b7a0b]
ceph-mds8599: 8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f2f98b7f0a]
ceph-mds8599: 9: (()+0x75ca) [0x7f279f6bc5ca]
ceph-mds8599: 10: (clone()+0x6d) [0x7f279e0fbead]
ceph-mds8599: NOTE: a copy of the executable, or `objdump rdS <executable>` is needed to interpret this.
audit8660: ANOM_ABEND auid=4294967295 uid=167 gid=167 ses=4294967295 subj=system_u:system_r:ceph_t:s0 pid=8660 comm="mds_rank_progr" exe="/usr/bin/ceph-mds" sig=6
ceph-mds8599: -451> 2016-07-27 14:34:34.420858 7f27934bb700 -1 log_channel(cluster) log [ERR] : replayed stray Session close event for client.14334 3.193.149.4:0/2612743178 from time 2016-06-05 09:23:48.286189, ignoring
ceph-mds8599: -329> 2016-07-27 14:34:39.440266 7f27958ca700 -1 log_channel(cluster) log [ERR] : loaded dup inode 10000000000 [2,head] v18767 at /dane, but inode 10000000000.head v77080 already exists at /tmp2
ceph-mds8599: 0> 2016-07-27 14:34:39.531349 7f27960cb700 -1 mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7f27960cb700 time 2016-07-27 14:34:39.528175
ceph-mds8599: mds/CDir.cc: 699: FAILED assert(dn
>get_linkage()->is_null())
ceph-mds8599: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
ceph-mds8599: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55f2f9cc72a0]
ceph-mds8599: 2: (CDir::try_remove_dentries_for_stray()+0x150) [0x55f2f9a83220]
ceph-mds8599: 3: (StrayManager::__eval_stray(CDentry*, bool)+0x8e9) [0x55f2f9a02589]
ceph-mds8599: 4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f2f9a02a12]
ceph-mds8599: 5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f2f99571ad]
ceph-mds8599: 6: (MDSInternalContextBase::complete(int)+0x20b) [0x55f2f9b0dffb]
ceph-mds8599: 7: (MDSRank::_advance_queues()+0x66b) [0x55f2f98b7a0b]
ceph-mds8599: 8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f2f98b7f0a]
ceph-mds8599: 9: (()+0x75ca) [0x7f279f6bc5ca]
ceph-mds8599: 10: (clone()+0x6d) [0x7f279e0fbead]
ceph-mds8599: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
ceph-mds8599: * Caught signal (Aborted) *
ceph-mds8599: in thread 7f27960cb700 thread_name:mds_rank_progr
ceph-mds8599: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
ceph-mds8599: 1: (()+0x514a2e) [0x55f2f9bb9a2e]
ceph-mds8599: 2: (()+0x10c30) [0x7f279f6c5c30]
ceph-mds8599: 3: (gsignal()+0x35) [0x7f279e02d6f5]
ceph-mds8599: 4: (abort()+0x16a) [0x7f279e02f2fa]
ceph-mds8599: 5: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x26b) [0x55f2f9cc748b]
ceph-mds8599: 6: (CDir::try_remove_dentries_for_stray()+0x150) [0x55f2f9a83220]
ceph-mds8599: 7: (StrayManager::__eval_stray(CDentry*, bool)+0x8e9) [0x55f2f9a02589]
ceph-mds8599: 8: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f2f9a02a12]
ceph-mds8599: 9: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f2f99571ad]
ceph-mds8599: 10: (MDSInternalContextBase::complete(int)+0x20b) [0x55f2f9b0dffb]
ceph-mds8599: 11: (MDSRank::_advance_queues()+0x66b) [0x55f2f98b7a0b]
ceph-mds8599: 12: (MDSRank::ProgressThread::entry()+0x4a) [0x55f2f98b7f0a]
ceph-mds8599: 13: (()+0x75ca) [0x7f279f6bc5ca]
ceph-mds8599: 14: (clone()+0x6d) [0x7f279e0fbead]
ceph-mds8599: 2016-07-27 14:34:39.535832 7f27960cb700 -1
Caught signal (Aborted)
ceph-mds8599: in thread 7f27960cb700 thread_name:mds_rank_progr
ceph-mds8599: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
ceph-mds8599: 1: (()+0x514a2e) [0x55f2f9bb9a2e]
ceph-mds8599: 2: (()+0x10c30) [0x7f279f6c5c30]
ceph-mds8599: 3: (gsignal()+0x35) [0x7f279e02d6f5]
ceph-mds8599: 4: (abort()+0x16a) [0x7f279e02f2fa]
ceph-mds8599: 5: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x26b) [0x55f2f9cc748b]
ceph-mds8599: 6: (CDir::try_remove_dentries_for_stray()+0x150) [0x55f2f9a83220]
ceph-mds8599: 7: (StrayManager::__eval_stray(CDentry*, bool)+0x8e9) [0x55f2f9a02589]
ceph-mds8599: 8: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f2f9a02a12]
ceph-mds8599: 9: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f2f99571ad]
ceph-mds8599: 10: (MDSInternalContextBase::complete(int)+0x20b) [0x55f2f9b0dffb]
ceph-mds8599: 11: (MDSRank::_advance_queues()+0x66b) [0x55f2f98b7a0b]
ceph-mds8599: 12: (MDSRank::ProgressThread::entry()+0x4a) [0x55f2f98b7f0a]
ceph-mds8599: 13: (()+0x75ca) [0x7f279f6bc5ca]
ceph-mds8599: 14: (clone()+0x6d) [0x7f279e0fbead]
ceph-mds8599: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
ceph-mds8599: 0> 2016-07-27 14:34:39.535832 7f27960cb700 -1
Caught signal (Aborted) *
ceph-mds8599: in thread 7f27960cb700 thread_name:mds_rank_progr
ceph-mds8599: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
ceph-mds8599: 1: (()+0x514a2e) [0x55f2f9bb9a2e]
ceph-mds8599: 2: (()+0x10c30) [0x7f279f6c5c30]
ceph-mds8599: 3: (gsignal()+0x35) [0x7f279e02d6f5]
ceph-mds8599: 4: (abort()+0x16a) [0x7f279e02f2fa]
ceph-mds8599: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x55f2f9cc748b]
ceph-mds8599: 6: (CDir::try_remove_dentries_for_stray()+0x150) [0x55f2f9a83220]
ceph-mds8599: 7: (StrayManager::__eval_stray(CDentry*, bool)+0x8e9) [0x55f2f9a02589]
ceph-mds8599: 8: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f2f9a02a12]
ceph-mds8599: 9: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f2f99571ad]
ceph-mds8599: 10: (MDSInternalContextBase::complete(int)+0x20b) [0x55f2f9b0dffb]
ceph-mds8599: 11: (MDSRank::_advance_queues()+0x66b) [0x55f2f98b7a0b]
ceph-mds8599: 12: (MDSRank::ProgressThread::entry()+0x4a) [0x55f2f98b7f0a]
ceph-mds8599: 13: (()+0x75ca) [0x7f279f6bc5ca]
ceph-mds8599: 14: (clone()+0x6d) [0x7f279e0fbead]
ceph-mds8599: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
systemd1: : Main process exited, code=killed, status=6/ABRT
systemd1: : Unit entered failed state.

  1. ceph fs ls
    name: CoEPH, metadata pool: CoEPH_metadata, data pools: [CoEPH_data_ec ]
  1. ceph fs dump
    dumped fsmap epoch 1081503
    e1081503
    enable_multiple, ever_enabled_multiple: 0,0
    compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table}

Filesystem 'CoEPH' (5)
fs_name CoEPH
epoch 1081503
flags 4
created 2016-06-02 09:47:20.833036
modified 2016-07-25 09:06:15.506987
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 608353
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table}
max_mds 1
in 0,1
up {0=22021782,1=21854109}
failed
damaged
stopped
data_pools 3
metadata_pool 1
inline_data enabled
22021782: 3.193.150.32:6800/19218 'dashboardpc' mds.0.1081499 up:active seq 5
21854109: 3.193.148.63:6800/885 'switcheroo' mds.1.957438 up:active seq 8

  1. ceph df
    GLOBAL:
    SIZE AVAIL RAW USED %RAW USED
    5733G 3885G 1845G 32.18
    POOLS:
    NAME ID USED %USED MAX AVAIL OBJECTS
    rbd 0 142G 4.98 1293G 36578
    CoEPH_metadata 1 66134k 0 862G 94
    CoEPH_data_ec 3 858G 22.47 1725G 224082
    CoEPH_data_replicated 5 81496M 4.16 862G 25780

CoEPH_data_replicated is a caching pool for CoEPH_data_ec.

Actions #1

Updated by Nathan Cutler almost 8 years ago

  • Project changed from Ceph to CephFS
Actions #2

Updated by Greg Farnum over 7 years ago

  • Status changed from New to Closed

It looks like you did "fs rm" and "fs new" but kept the same metadata pool in RADOS. That doesn't work; you can resolve this by again doing "fs rm", deleting the pools, and then creating new ones.

We have #11124 to prevent users doing this.

Actions #3

Updated by Tomasz Torcz over 7 years ago

So, two things:

1) can ceph-mds be made more resiliant when finding data from non-existing filesystems?

2) I cannot delete all pools, because I've created and utilized rbd images in erasure coded pool with replicated pool as cache (I cannot leave the storage unused while cephfs is not available).
Will deleting and recreating only metadata pool help?
Is it possible to delete only cephfs rados objects from pool?

Actions #4

Updated by Greg Farnum over 7 years ago

1) Possibly, but we're more likely to just lock out pools which have data in them.

2) You mean you intermingled RBD and CephFS data within a single pool? That's unfortunate. Deleting only the metadata won't be sufficient, as you'll have random file data lying around. If you can't get rid of it you could always create a different pool for your new CephFS.

If you want to go to more effort, you can run some scripts that find and either keep or delete RADOS objects based on whether they're FS or RBD objects. FS objects will have names like 10000000000.00000000; I'm not sure how RBD ones are named but they do look different, you should be able to tell. If you do that, you can probably reuse the pool, but I really don't recommend it...

Actions

Also available in: Atom PDF