Bug #18816
closedMDS crashes with log disabled
0%
Description
Note the "mds_log = false" below. If you do that, this happens:
Have crushed MDS daemon during executing different payload tests with such dump:
--- begin dump of recent events ---
0> 2017-02-03 02:34:41.974639 7f7e8ec5e700 -1 ** Caught signal (Aborted) *
in thread 7f7e8ec5e700 thread_name:ms_dispatch
ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)
1: (()+0x5142a2) [0x557c51e092a2]
2: (()+0x10b00) [0x7f7e95df2b00]
3: (gsignal()+0x37) [0x7f7e93ccb8d7]
4: (abort()+0x13a) [0x7f7e93ccccaa]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x557c51f133d5]
6: (MutationImpl::~MutationImpl()+0x28e) [0x557c51bb9e1e]
7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x557c51b2ccf9]
8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x9a7) [0x557c51ca2757]
9: (Locker::remove_client_cap(CInode*, client_t)+0xb1) [0x557c51ca38f1]
10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long, unsigned int, unsigned int)+0x90d) [0x557c51ca424d]
11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1cc) [0x557c51ca449c]
12: (MDSRank::handle_deferrable_message(Message*)+0xc1c) [0x557c51b33d3c]
13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x557c51b3c991]
14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x557c51b3dae5]
15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557c51b25703]
16: (DispatchQueue::entry()+0x78b) [0x557c5200d06b]
17: (DispatchQueue::DispatchThread::entry()+0xd) [0x557c51ee5dcd]
18: (()+0x8734) [0x7f7e95dea734]
19: (clone()+0x6d) [0x7f7e93d80d3d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
How to reproduce the issue:
1th option is FIO test. Run FIO payload on client machine with config below, then execute "echo 3 > /proc/sys/vm/drop_caches" and after run test again.
[test]
blocksize=64k
filename=/mnt/mycephfs/payload3G
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=8
runtime=300
filesize=3G
2th option is "inodes payload". Download latest Joomla distribution, unzip and perfrom this:
time for i in `seq 100`; do cp -a joomla /mnt/mycephfs/joomla${i}; done
My sandbox is pretty simple and consists server (cephnode below) and client (payload below) machines:
cephnode:~ # ceph -s
cluster c848af4a-98ea-498c-87d6-059ebf609287
health HEALTH_WARN
mds cephnode is laggy
monmap e1: 1 mons at {cephnode=192.168.10.20:6789/0}
election epoch 9, quorum 0 cephnode
fsmap e96: 1/1/1 up {0=cephnode=up:active(laggy or crashed)}
osdmap e96: 1 osds: 1 up, 1 in
flags sortbitwise,require_jewel_osds
pgmap v1832: 204 pgs, 3 pools, 3072 MB data, 787 objects
3117 MB used, 396 GB / 399 GB avail
204 active+clean
cephnode:~ # cat /etc/ceph/ceph.conf
[global]
fsid = c848af4a-98ea-498c-87d6-059ebf609287
mon_initial_members = cephnode
mon_host = 192.168.10.20
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 1
mds_log = false
cephnode:~ # lsb_release -a
LSB Version: n/a
Distributor ID: SUSE
Description: SUSE Linux Enterprise Server 12 SP2
Release: 12.2
Codename: n/a
cephnode:~ # uname -a
Linux cephnode 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
cephnode:~ # ceph -v
ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)
payload:~ # ceph -v
ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318ed0dcedc8734)
payload:~ # mount
...
192.168.10.20:6789:/ on /mnt/mycephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl)
payload:~ # lsb_release -a
LSB Version: n/a
Distributor ID: SUSE
Description: SUSE Linux Enterprise Server 12 SP2
Release: 12.2
Codename: n/a
payload:~ # uname -a
Linux payload 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
"
Priority: High
Affected Versions: 10.2.4
ceph-qa-suite: 'fs' and 'kcephfs'
Release: jewel
Updated by Shinobu Kinjo about 7 years ago
/// * Updated by Ahmed Akhuraidah in ML
The issue can be reproduced with upstream Ceph packages.
ahmed@ubcephnode:~$ ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
ahmed@ubcephnode:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
ahmed@ubcephnode:~$ uname -a
Linux ubcephnode 4.4.0-62-generic #83~14.04.1-Ubuntu SMP Wed Jan 18 18:10:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
ahmed@ubpayload:~$ ceph -v
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
ahmed@ubpayload:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
ahmed@ubpayload:~$ uname -a
Linux ubpayload 4.4.0-62-generic #83~14.04.1-Ubuntu SMP Wed Jan 18 18:10:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
ahmed@ubcephnode:~$ cat /etc/ceph/ceph.conf
[global]
fsid = 7c39c59a-4951-4798-9c42-59da474afd26
mon_initial_members = ubcephnode
mon_host = 192.168.10.120
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 1
mds_log = false
ahmed@ubpayload:~$ mount
..
192.168.10.120:6789:/ on /mnt/mycephfs type ceph (name=admin,key=client.admin)
ahmed@ubcephnode:~$ ceph -s
cluster 7c39c59a-4951-4798-9c42-59da474afd26
health HEALTH_ERR
mds rank 0 is damaged
mds cluster is degraded
monmap e1: 1 mons at {ubcephnode=192.168.10.120:6789/0}
election epoch 3, quorum 0 ubcephnode
fsmap e11: 0/1/1 up, 1 up:standby, 1 damaged
osdmap e12: 1 osds: 1 up, 1 in
flags sortbitwise,require_jewel_osds
pgmap v32: 204 pgs, 3 pools, 3072 MB data, 787 objects
3109 MB used, 48064 MB / 51173 MB avail
204 active+clean
--- begin dump of recent events ---
0> 2017-02-08 06:50:16.206926 7f306a642700 -1 ** Caught signal (Aborted) *
in thread 7f306a642700 thread_name:ms_dispatch
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (()+0x4f62b2) [0x556839b472b2]
2: (()+0x10330) [0x7f307084a330]
3: (gsignal()+0x37) [0x7f306ecd2c37]
4: (abort()+0x148) [0x7f306ecd6028]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x556839c3d135]
6: (MutationImpl::~MutationImpl()+0x28e) [0x5568398f7b5e]
7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x55683986ac49]
8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool, unsigned long, utime_t)+0x9a7) [0x5568399e0947]
9: (Locker::remove_client_cap(CInode*, client_t)+0xb1) [0x5568399e1ae1]
10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long, unsigned int, unsigned int)+0x90d) [0x5568399e243d]
11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1dc) [0x5568399e269c]
12: (MDSRank::handle_deferrable_message(Message*)+0xc1c) [0x556839871dac]
13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x55683987aa01]
14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55683987bb55]
15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x556839863653]
16: (DispatchQueue::entry()+0x78b) [0x556839d3772b]
17: (DispatchQueue::DispatchThread::entry()+0xd) [0x556839c2280d]
18: (()+0x8184) [0x7f3070842184]
19: (clone()+0x6d) [0x7f306ed9637d]
Updated by Greg Farnum about 7 years ago
- Subject changed from MDS crush: thread_name:ms_dispatch to MDS crashes with log disabled
- Description updated (diff)
For some reason we still let people disable the MDS log. That's...bad. I think it only existed for some cheap benchmarking a decade ago and the config option should get thrown out.
Not sure why either of you were testing with this option, but I'm quite sure that's the problem and you shouldn't bother. :)
Updated by John Spray about 7 years ago
- Status changed from New to Fix Under Review
I'm proposing that we rip out this configuration option, it's a trap for the unwary:
https://github.com/ceph/ceph/pull/14652
Updated by John Spray about 7 years ago
- Status changed from Fix Under Review to Resolved