Bug #18816: MDS crashes with log disabled - CephFS - Ceph

Actions

Copy link

Bug #18816

closed

MDS crashes with log disabled

Added by Ahmed Akhuraidah about 7 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Note the "mds_log = false" below. If you do that, this happens:

Have crushed MDS daemon during executing different payload tests with such dump:

--- begin dump of recent events ---
0> 2017-02-03 02:34:41.974639 7f7e8ec5e700 -1 ** Caught signal (Aborted) *
in thread 7f7e8ec5e700 thread_name:ms_dispatch

ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)
 1: (()+0x5142a2) [0x557c51e092a2]
 2: (()+0x10b00) [0x7f7e95df2b00]
 3: (gsignal()+0x37) [0x7f7e93ccb8d7]
 4: (abort()+0x13a) [0x7f7e93ccccaa]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x557c51f133d5]
 6: (MutationImpl::~MutationImpl()+0x28e) [0x557c51bb9e1e]
 7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x557c51b2ccf9]
 8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x9a7) [0x557c51ca2757]
 9: (Locker::remove_client_cap(CInode*, client_t)+0xb1) [0x557c51ca38f1]
 10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long, unsigned int, unsigned int)+0x90d) [0x557c51ca424d]
 11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1cc) [0x557c51ca449c]
 12: (MDSRank::handle_deferrable_message(Message*)+0xc1c) [0x557c51b33d3c]
 13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x557c51b3c991]
 14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x557c51b3dae5]
 15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557c51b25703]
 16: (DispatchQueue::entry()+0x78b) [0x557c5200d06b]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x557c51ee5dcd]
 18: (()+0x8734) [0x7f7e95dea734]
 19: (clone()+0x6d) [0x7f7e93d80d3d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

How to reproduce the issue:

1th option is FIO test. Run FIO payload on client machine with config below, then execute "echo 3 > /proc/sys/vm/drop_caches" and after run test again.

[test]
blocksize=64k
filename=/mnt/mycephfs/payload3G
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=8
runtime=300
filesize=3G

2th option is "inodes payload". Download latest Joomla distribution, unzip and perfrom this:

time for i in `seq 100`; do cp -a joomla /mnt/mycephfs/joomla${i}; done

My sandbox is pretty simple and consists server (cephnode below) and client (payload below) machines:

cephnode:~ # ceph -s
cluster c848af4a-98ea-498c-87d6-059ebf609287
health HEALTH_WARN
mds cephnode is laggy
monmap e1: 1 mons at {cephnode=192.168.10.20:6789/0}
election epoch 9, quorum 0 cephnode
fsmap e96: 1/1/1 up {0=cephnode=up:active(laggy or crashed)}
osdmap e96: 1 osds: 1 up, 1 in
flags sortbitwise,require_jewel_osds
pgmap v1832: 204 pgs, 3 pools, 3072 MB data, 787 objects
3117 MB used, 396 GB / 399 GB avail
204 active+clean

cephnode:~ # cat /etc/ceph/ceph.conf
[global]
fsid = c848af4a-98ea-498c-87d6-059ebf609287
mon_initial_members = cephnode
mon_host = 192.168.10.20
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 1
mds_log = false

cephnode:~ # lsb_release -a
LSB Version: n/a
Distributor ID: SUSE
Description: SUSE Linux Enterprise Server 12 SP2
Release: 12.2
Codename: n/a

cephnode:~ # uname -a
Linux cephnode 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux

cephnode:~ # ceph -v
ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)

payload:~ # ceph -v
ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)

payload:~ # mount
...
192.168.10.20:6789:/ on /mnt/mycephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl)

payload:~ # lsb_release -a
LSB Version: n/a
Distributor ID: SUSE
Description: SUSE Linux Enterprise Server 12 SP2
Release: 12.2
Codename: n/a

payload:~ # uname -a
Linux payload 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
"

Priority: High
Affected Versions: 10.2.4
ceph-qa-suite: 'fs' and 'kcephfs'
Release: jewel

Actions

Copy link

Updated by Ahmed Akhuraidah about 7 years ago

test

Actions

Copy link

Updated by Shinobu Kinjo about 7 years ago

/// * Updated by Ahmed Akhuraidah in ML

The issue can be reproduced with upstream Ceph packages.

ahmed@ubcephnode:~$ ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

ahmed@ubcephnode:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty

ahmed@ubcephnode:~$ uname -a
Linux ubcephnode 4.4.0-62-generic #83~14.04.1-Ubuntu SMP Wed Jan 18 18:10:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ahmed@ubpayload:~$ ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

ahmed@ubpayload:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty

ahmed@ubpayload:~$ uname -a
Linux ubpayload 4.4.0-62-generic #83~14.04.1-Ubuntu SMP Wed Jan 18 18:10:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ahmed@ubcephnode:~$ cat /etc/ceph/ceph.conf
[global]
fsid = 7c39c59a-4951-4798-9c42-59da474afd26
mon_initial_members = ubcephnode
mon_host = 192.168.10.120
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 1
mds_log = false

ahmed@ubpayload:~$ mount
..
192.168.10.120:6789:/ on /mnt/mycephfs type ceph (name=admin,key=client.admin)

ahmed@ubcephnode:~$ ceph -s
cluster 7c39c59a-4951-4798-9c42-59da474afd26
health HEALTH_ERR
mds rank 0 is damaged
mds cluster is degraded
monmap e1: 1 mons at {ubcephnode=192.168.10.120:6789/0}
election epoch 3, quorum 0 ubcephnode
fsmap e11: 0/1/1 up, 1 up:standby, 1 damaged
osdmap e12: 1 osds: 1 up, 1 in
flags sortbitwise,require_jewel_osds
pgmap v32: 204 pgs, 3 pools, 3072 MB data, 787 objects
3109 MB used, 48064 MB / 51173 MB avail
204 active+clean

--- begin dump of recent events ---
0> 2017-02-08 06:50:16.206926 7f306a642700 -1 ** Caught signal (Aborted) *
in thread 7f306a642700 thread_name:ms_dispatch

ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 1: (()+0x4f62b2) [0x556839b472b2]
 2: (()+0x10330) [0x7f307084a330]
 3: (gsignal()+0x37) [0x7f306ecd2c37]
 4: (abort()+0x148) [0x7f306ecd6028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x556839c3d135]
 6: (MutationImpl::~MutationImpl()+0x28e) [0x5568398f7b5e]
 7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x55683986ac49]
 8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x9a7) [0x5568399e0947]
 9: (Locker::remove_client_cap(CInode*, client_t)+0xb1) [0x5568399e1ae1]
 10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long, unsigned int, unsigned int)+0x90d) [0x5568399e243d]
 11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1dc) [0x5568399e269c]
 12: (MDSRank::handle_deferrable_message(Message*)+0xc1c) [0x556839871dac]
 13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x55683987aa01]
 14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55683987bb55]
 15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x556839863653]
 16: (DispatchQueue::entry()+0x78b) [0x556839d3772b]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x556839c2280d]
 18: (()+0x8184) [0x7f3070842184]
 19: (clone()+0x6d) [0x7f306ed9637d]

Actions

Copy link

Updated by Greg Farnum about 7 years ago

Subject changed from MDS crush: thread_name:ms_dispatch to MDS crashes with log disabled
Description updated (diff)

For some reason we still let people disable the MDS log. That's...bad. I think it only existed for some cheap benchmarking a decade ago and the config option should get thrown out.

Not sure why either of you were testing with this option, but I'm quite sure that's the problem and you shouldn't bother. :)

Actions

Copy link