Bug #23972
openCeph MDS Crash from client mounting aufs over cephfs
0%
Description
Here is a rough outline of my topology
https://pastebin.com/HQqbMxyj
---
I can reliably crash all (in my case 2) cephfs MDS from a client by trying to mount cephFS under AUFS. I am not sure what it is doing to cause this but the MDS will refuse to start until I 1.) Reboot my client to stop any more requests and 2.) Mark the current active MDS server as failed.
`ceph -s ` will report that the current monitors are up but the processes will be dead on both MDS servers:
Ceph health prior to trying to mount bridge cephfs with aufs
----------------------------------------------
ceph -s
cluster:
id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
health: HEALTH_OK
services:
mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
mgr: kh08-8(active)
mds: cephfs-1/1/1 up {0=kh09-8=up:active}, 1 up:standby
osd: 570 osds: 570 up, 570 in
Client tries to mount aufs :: No output here it just hangs.
mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
Monitors now report health_warn state
----------------------------------------------
root@kh08-8:~# ceph -s
cluster:
id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
health: HEALTH_WARN
insufficient standby MDS daemons available
services:
mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
mgr: kh08-8(active)
mds: cephfs-1/1/1 up {0=kh10-8=up:active(laggy or crashed)}
At this point all mounts hang until I stop the client, mark the mds servers as failed, and restart the mds servers.
I tried installing the following packages (ceph-mds-dbg ceph-mgr-dbg ceph-mon-dbg ceph-osd-dbg ceph-test-dbg)
kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
The log files are pretty large (one 4.1G and the other 200MB)
kh10-8 (200MB) mds log -- https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh10-8.log
kh09-8 (4.1GB) mds log -- https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh09-8.log
I am trying to mount aufs over the cephfs directory /aufstest so here are the last few lines from kh10-8 (secondary MDS server at the time) around the aufs mention.
Updated by John Spray almost 6 years ago
- Project changed from Ceph to CephFS
Any chance you can reproduce this with debuginfo packages installed, so that we can get meaningful backtraces?
Updated by Patrick Donnelly almost 6 years ago
- Target version changed from v12.2.5 to v14.0.0
- Source set to Community (user)
- Tags deleted (
mds, cephfs, crash,) - Affected Versions deleted (
v12.2.4, v12.2.5) - ceph-qa-suite deleted (
fs) - Component(FS) MDS added
- Labels (FS) crash added
Updated by Sean Sullivan almost 6 years ago
John Spray wrote:
Any chance you can reproduce this with debuginfo packages installed, so that we can get meaningful backtraces?
Hopefully this helps. I'm a dummy and not exactly sure how to do this well. I also missed this reply, sorry again. I have all the packages ceph-*-dbg installed but this time I attached to ceph-mds with gdb prior to the crash:
https://pastebin.com/kw4bZVZT -- kh09-9
https://pastebin.com/sYZQx0ER -- kh10-9
-----------------------------------------------
List of dbg packages installed on one of the mds servers (same installed on both):
root@kh09-8:~# dpkg -l | grep -i dbg
ii ceph-base-dbg 12.2.5-1xenial amd64 debugging symbols for ceph-base
ii ceph-common-dbg 12.2.5-1xenial amd64 debugging symbols for ceph-common
ii ceph-fuse-dbg 12.2.5-1xenial amd64 debugging symbols for ceph-fuse
ii ceph-mds-dbg 12.2.5-1xenial amd64 debugging symbols for ceph-mds
ii ceph-mgr-dbg 12.2.5-1xenial amd64 debugging symbols for ceph-mgr
ii ceph-mon-dbg 12.2.5-1xenial amd64 debugging symbols for ceph-mon
ii ceph-osd-dbg 12.2.5-1xenial amd64 debugging symbols for ceph-osd
ii libc6-dbg:amd64 2.23-0ubuntu10 amd64 GNU C Library: detached debugging symbols
ii libcephfs2-dbg 12.2.5-1xenial amd64 debugging symbols for libcephfs2
ii librados2-dbg 12.2.5-1xenial amd64 debugging symbols for librados
ii librbd1-dbg 12.2.5-1xenial amd64 debugging symbols for librbd1
ii librgw2-dbg 12.2.5-1xenial amd64 debugging symbols for librbd1
ii radosgw-dbg 12.2.5-1xenial amd64 debugging symbols for radosgw
ii rbd-fuse-dbg 12.2.5-1xenial amd64 debugging symbols for rbd-fuse
ii rbd-mirror-dbg 12.2.5-1xenial amd64 debugging symbols for rbd-mirror
ii rbd-nbd-dbg 12.2.5-1xenial amd64 debugging symbols for rbd-nbd
so I'm not sure why the symbols are not loaded in the original traces. I hope these new traces help.
Updated by Zheng Yan almost 6 years ago
The crash was at "mdr->tracedn = mdr->dn[ 0].back()", because mdr->dn[ 0] is empty. request that triggered the crash is something like "lookup #0x1//"
dout(10) << "reply to stat on " << *req << dendl; mdr->tracei = ref; if (is_lookup) mdr->tracedn = mdr->dn[0].back(); respond_to_request(mdr, 0);
Following patch can prevents kclient from sending malformed lookup request. But the real bug should be located in aufs, it should never revalidate root dentry.
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index f1d9c6cc0491..3c2b1b553654 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1197,6 +1197,9 @@ static int ceph_d_revalidate(struct dentry *dentry, unsigned int flags) struct dentry *parent; struct inode *dir; + if (IS_ROOT(dentry)) + return 1; + if (flags & LOOKUP_RCU) { parent = READ_ONCE(dentry->d_parent); dir = d_inode_rcu(parent);
Updated by Patrick Donnelly about 5 years ago
- Target version changed from v14.0.0 to v15.0.0