Support #17171
closedCeph-fuse client hangs on unmount
0%
Description
We use autofs/automount to mount/unmount ceph-fuse mounts and from time to time ceph-fuse client hangs on umount and never stops, uses memory etc. gdb revealed such trace;
```
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007f323b432b32 in WaitUntil (when=..., mutex=..., this=0x7f32461ab410) at ./common/Cond.h:72
#2 Client::unmount (this=0x7f32461aa5a0) at client/Client.cc:5614
#3 0x00007f323b3bbba1 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at ceph_fuse.cc:266
```
Ceph-fuse client version: 10.2.1, platform Debian Jessie, 64bit
Updated by Greg Farnum over 7 years ago
- Tracker changed from Bug to Support
- Project changed from Ceph to CephFS
When are you doing this unmount? If it's on shutdown, and it happens to be unmounted after networking gets shut down, ceph-fuse is going to get stuck.
Otherwise, can you provide more details about exactly what's happening, and upload a log using ceph-post-file?
Updated by Arturas Moskvinas over 7 years ago
This happens during automount/autofs process decision to unmount filesystem when no process is using it for couple of minutes. Networking is always UP at the moment. I'll attach log a bit later.
Updated by Arturas Moskvinas over 7 years ago
Hmm, logs are at the moment pretty useless only contains such entries:
2016-09-01 07:15:01.319054 7fe081227e80 -1 init, newargv = 0x7fe08b128870 newargc=13 2016-09-01 08:15:01.502778 7f60f5c31e80 0 ceph version 10.2.1-2 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269), process ceph-fuse, pid 22722 2016-09-01 08:15:01.508655 7f60f5c31e80 -1 asok(0x7f60ff00d9a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.cephfs.asok': (17) File exists 2016-09-01 08:15:01.509217 7f60f5c31e80 -1 init, newargv = 0x7f60ff067870 newargc=13
I'll enable more logging and wait until similar situation happens.
Updated by Arturas Moskvinas over 7 years ago
Actually after several checks it seems like ceph is actually not responding from time to time due to very high load/disk failures and ceph-fuse can't 'unmount' such FS when it is stuck in such state. Not sure if this is feature or bug for ceph-fuse. Please triage if you want to dive deeper or close it as WONTFIX...
Updated by John Spray over 7 years ago
To be clear, you're saying that while the server cluster is unresponsive, ceph-fuse hangs on unmount? That is expected behaviour. However, if the cluster becomes responsive again, the unmount should eventually complete, does that match what you're seeing?
Updated by Arturas Moskvinas over 7 years ago
Actually when it becomes responsive - unmount is still hanging and only `kill -9` helps.
Updated by John Spray over 7 years ago
Hmm, so if that's resproducible then it sounds like we could reproduce it by killing an MDS, invoking umount, seeing it block, waiting a bit and then starting the MDS up again.
Updated by Arturas Moskvinas over 7 years ago
We can probably close this issue, we'll reopen or create new when we'll be able to reliably reproduce issue