Project

General

Profile

Actions

Support #17171

closed

Ceph-fuse client hangs on unmount

Added by Arturas Moskvinas over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

We use autofs/automount to mount/unmount ceph-fuse mounts and from time to time ceph-fuse client hangs on umount and never stops, uses memory etc. gdb revealed such trace;
```
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007f323b432b32 in WaitUntil (when=..., mutex=..., this=0x7f32461ab410) at ./common/Cond.h:72
#2 Client::unmount (this=0x7f32461aa5a0) at client/Client.cc:5614
#3 0x00007f323b3bbba1 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at ceph_fuse.cc:266
```

Ceph-fuse client version: 10.2.1, platform Debian Jessie, 64bit

Actions #1

Updated by Greg Farnum over 7 years ago

  • Tracker changed from Bug to Support
  • Project changed from Ceph to CephFS

When are you doing this unmount? If it's on shutdown, and it happens to be unmounted after networking gets shut down, ceph-fuse is going to get stuck.

Otherwise, can you provide more details about exactly what's happening, and upload a log using ceph-post-file?

Actions #2

Updated by Arturas Moskvinas over 7 years ago

This happens during automount/autofs process decision to unmount filesystem when no process is using it for couple of minutes. Networking is always UP at the moment. I'll attach log a bit later.

Actions #3

Updated by Arturas Moskvinas over 7 years ago

Hmm, logs are at the moment pretty useless only contains such entries:

2016-09-01 07:15:01.319054 7fe081227e80 -1 init, newargv = 0x7fe08b128870 newargc=13
2016-09-01 08:15:01.502778 7f60f5c31e80  0 ceph version 10.2.1-2 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269), process ceph-fuse, pid 22722
2016-09-01 08:15:01.508655 7f60f5c31e80 -1 asok(0x7f60ff00d9a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.cephfs.asok': (17) File exists
2016-09-01 08:15:01.509217 7f60f5c31e80 -1 init, newargv = 0x7f60ff067870 newargc=13

I'll enable more logging and wait until similar situation happens.

Actions #4

Updated by Arturas Moskvinas over 7 years ago

Actually after several checks it seems like ceph is actually not responding from time to time due to very high load/disk failures and ceph-fuse can't 'unmount' such FS when it is stuck in such state. Not sure if this is feature or bug for ceph-fuse. Please triage if you want to dive deeper or close it as WONTFIX...

Actions #5

Updated by John Spray over 7 years ago

To be clear, you're saying that while the server cluster is unresponsive, ceph-fuse hangs on unmount? That is expected behaviour. However, if the cluster becomes responsive again, the unmount should eventually complete, does that match what you're seeing?

Actions #6

Updated by Arturas Moskvinas over 7 years ago

Actually when it becomes responsive - unmount is still hanging and only `kill -9` helps.

Actions #7

Updated by John Spray over 7 years ago

Hmm, so if that's resproducible then it sounds like we could reproduce it by killing an MDS, invoking umount, seeing it block, waiting a bit and then starting the MDS up again.

Actions #8

Updated by Arturas Moskvinas over 7 years ago

We can probably close this issue, we'll reopen or create new when we'll be able to reliably reproduce issue

Actions #9

Updated by John Spray over 7 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF