Bug #56725
open file hang using vim with ceph-fuse client
0%
Description
ceph version is 16.2.9
open file using vim hang with ceph-fuse client. by default open file using vim will create swp and swpx temperory files and close them laster. Ceph-fuse successfully closed swpx file but failed on swp file.
On pacific cluster, ceph-fuse got the following debug info:
unique: 152, opcode: CREATE (35), nodeid: 1, insize: 68, pid: 3868383
unique: 152, success, outsize: 160
unique: 154, opcode: LOOKUP (1), nodeid: 1, insize: 53, pid: 3868383
unique: 154, error: -2 (No such file or directory), outsize: 16
unique: 156, opcode: LOOKUP (1), nodeid: 1, insize: 53, pid: 3868383
unique: 156, error: -2 (No such file or directory), outsize: 16
unique: 158, opcode: CREATE (35), nodeid: 1, insize: 69, pid: 3868383
unique: 158, success, outsize: 160
unique: 160, opcode: GETATTR (3), nodeid: 1099511627776, insize: 56, pid: 3868383
unique: 160, success, outsize: 120
unique: 162, opcode: GETATTR (3), nodeid: 1099511627777, insize: 56, pid: 3868383
unique: 162, success, outsize: 120
unique: 164, opcode: FLUSH (25), nodeid: 1099511627777, insize: 64, pid: 3868383
unique: 164, success, outsize: 16
unique: 166, opcode: RELEASE (18), nodeid: 1099511627777, insize: 64, pid: 0
unique: 166, success, outsize: 16
unique: 168, opcode: LOOKUP (1), nodeid: 1, insize: 53, pid: 3868383
unique: 168, success, outsize: 144
unique: 170, opcode: UNLINK (10), nodeid: 1, insize: 53, pid: 3868383
unique: 170, success, outsize: 16
unique: 172, opcode: FLUSH (25), nodeid: 1099511627776, insize: 64, pid: 3868383
unique: 172, success, outsize: 16
unique: 174, opcode: FORGET (2), nodeid: 1099511627777, insize: 48, pid: 0
With Nautilus cluster, ceph-fuse got the following debug info:
unique: 83, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 83, error: -2 (No such file or directory), outsize: 16
unique: 84, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 84, error: -2 (No such file or directory), outsize: 16
unique: 85, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 85, error: -2 (No such file or directory), outsize: 16
unique: 86, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 86, error: -2 (No such file or directory), outsize: 16
unique: 87, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 87, error: -2 (No such file or directory), outsize: 16
unique: 88, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 89, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 629807
unique: 89, success, outsize: 120
unique: 90, opcode: STATFS (17), nodeid: 1, insize: 40, pid: 629807
unique: 88, error: -2 (No such file or directory), outsize: 16
unique: 91, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 90, success, outsize: 96
unique: 92, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 629807
unique: 92, success, outsize: 120
unique: 91, error: -2 (No such file or directory), outsize: 16
unique: 93, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 93, error: -2 (No such file or directory), outsize: 16
unique: 94, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 94, error: -2 (No such file or directory), outsize: 16
unique: 95, opcode: CREATE (35), nodeid: 1, insize: 73, pid: 629803
unique: 95, success, outsize: 160
unique: 96, opcode: LOOKUP (1), nodeid: 1, insize: 58, pid: 629803
unique: 96, error: -2 (No such file or directory), outsize: 16
unique: 97, opcode: LOOKUP (1), nodeid: 1, insize: 58, pid: 629803
unique: 97, error: -2 (No such file or directory), outsize: 16
unique: 98, opcode: CREATE (35), nodeid: 1, insize: 74, pid: 629803
unique: 98, success, outsize: 160
unique: 99, opcode: GETATTR (3), nodeid: 1099511628777, insize: 56, pid: 629803
unique: 99, success, outsize: 120
unique: 100, opcode: GETATTR (3), nodeid: 1099511628778, insize: 56, pid: 629803
unique: 100, success, outsize: 120
unique: 101, opcode: FLUSH (25), nodeid: 1099511628778, insize: 64, pid: 629803
unique: 101, success, outsize: 16
unique: 102, opcode: RELEASE (18), nodeid: 1099511628778, insize: 64, pid: 0
unique: 103, opcode: LOOKUP (1), nodeid: 1, insize: 58, pid: 629803
unique: 102, success, outsize: 16
unique: 103, success, outsize: 144
unique: 104, opcode: UNLINK (10), nodeid: 1, insize: 58, pid: 629803
unique: 104, success, outsize: 16
unique: 105, opcode: FLUSH (25), nodeid: 1099511628777, insize: 64, pid: 629803
unique: 106, opcode: FORGET (2), nodeid: 1099511628778, insize: 48, pid: 0
unique: 105, success, outsize: 16
unique: 107, opcode: RELEASE (18), nodeid: 1099511628777, insize: 64, pid: 0
unique: 108, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 107, success, outsize: 16
unique: 108, success, outsize: 144
unique: 109, opcode: UNLINK (10), nodeid: 1, insize: 57, pid: 629803
unique: 109, success, outsize: 16
unique: 110, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 111, opcode: FORGET (2), nodeid: 1099511628777, insize: 48, pid: 0
Seems that no release request sent to pacific ceph-fuse and caused the hang
History
#1 Updated by Xiubo Li over 1 year ago
BTW, could you reproduce this easily ?
#2 Updated by Bin Zhao over 1 year ago
this issue can be reproduced stably
set vim with noswapfile, can sucessfully open and edit file. Ceph-fuse hang after removing the file
#3 Updated by Zhi Zhang over 1 year ago
We can reproduce this issue on Pacific version too.
kernel 4.14.105
libfuse version 3.6.1.
#4 Updated by Zhi Zhang over 1 year ago
On fuse3 with N version, client will call below function.
#define fuse_session_loop_mt(se, clone_fd) fuse_session_loop_mt_31(se, clone_fd)
fuse_session_loop_mt_31 will set max_idle_threads to 10 by default.
But on fuse3 with P version, client will call below functions.
#define fuse_session_loop_mt(se, config) fuse_session_loop_mt_32(se, config)
fuse_session_loop_mt_32 will set max_idle_threads according to fuse_loop_config which is initialized to 0 by ceph. After fuse forget op, there is no thread left to pick up next fuse op, so the whole client hangs forever.
The fix is below.
diff --git a/src/client/fuse_ll.cc b/src/client/fuse_ll.cc
index b42a7cc970..ef2df10ee5 100644
--- a/src/client/fuse_ll.cc
+++ b/src/client/fuse_ll.cc
@@ -1633,6 +1633,7 @@ int CephFuse::Handle::loop()
struct fuse_loop_config conf = { 0 };
conf.clone_fd = opts.clone_fd;
+ conf.max_idle_threads = client->cct->_conf.get_val<int64_t>("fuse_max_idle_threads");
return fuse_session_loop_mt(se, &conf);
}
#else
diff --git a/src/common/options.cc b/src/common/options.cc
index 341d7fc0bc..8bb50c84bc 100755
--- a/src/common/options.cc
+++ b/src/common/options.cc
@@ -9335,6 +9335,10 @@ std::vector<Option> get_mds_client_options() {
.set_default(false)
.set_description(""),
+ Option("fuse_max_idle_threads", Option::TYPE_INT, Option::LEVEL_ADVANCED)
+ .set_default(10)
+ .set_description(""),
+
#5 Updated by Mathew Clarke about 1 year ago
I'm also experiencing this issue. I'm running "ceph-fuse version 17.2.0" with a "ceph version 17.2.3" cluster. Do you know when this fix is likley to be merged?
#6 Updated by George Fedorov 9 months ago
For the most recent ceph-fuse client, available to date for Ubuntu Jammy,
# ceph-fuse --version ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)
The issue is still present. By the looks of it, it is fixed in the repository, so now it is Ubuntu package maintainer's turn.
Zhi Zhang's comment nails it -- any FORGET operation will essentially make ceph-fuse to freeze.
See libfuse [[ https://github.com/libfuse/libfuse/blob/b45d66cafb3b8de191d44b3a5705637319f7a552/lib/fuse_loop_mt.c#L127 | main loop code]] and [[ https://github.com/ceph/ceph/blob/41ad4cd0bd429a0054881c54d106a1090c55870d/src/client/fuse_ll.cc#L1268 | older versions ]] of ceph-fuse.
Code below can reproduce the problem, hanging at the first read() attempt of the second iteration:
#!/usr/bin/python3
import os,sys,time
from contextlib import contextmanager
FILENAME = '/mnt/cephtest/crashtest'
OPEN_FLAGS_RO = os.O_RDONLY
OPEN_FLAGS_RW = os.O_RDWR | os.O_CREAT
OPEN_MODE = 0o600
def perr(*args):
print(*args, file=sys.stderr)
def try_os_call(fn, *args, **kwargs):
try:
arg1 = f'{args[0]!r}' if isinstance(args[0], str) else str(args[0])
arg2 = (', ' + f'0x{args[1]:08X}') if args[1:] else ''
arg3 = (', ' + f'0o{args[2]:04o}') if args[2:] else ''
perr( f" --> {fn.__name__}({arg1}{arg2}{arg3})" )
ret = fn(*args, **kwargs)
except OSError as e:
print(f"OS Error: {e.args}", file=sys.stderr)
ret = -e.errno
perr( f" <= {ret}" )
return ret
def pause():
discardme = input('press <Enter>')
for n in range(2):
fd_ro = try_os_call(os.open, FILENAME, OPEN_FLAGS_RO)
fd_rw = try_os_call(os.open, FILENAME, OPEN_FLAGS_RW, OPEN_MODE)
if (fd_rw >= 0):
try_os_call( os.close, fd_rw )
try_os_call(os.unlink, FILENAME)
pause()
ceph-fuse debug output will confirm that it freezes after the first FORGET:
unique: 8, opcode: LOOKUP (1), nodeid: 1, insize: 50, pid: 3430 unique: 8, error: -2 (No such file or directory), outsize: 16 unique: 10, opcode: LOOKUP (1), nodeid: 1, insize: 50, pid: 3430 unique: 10, error: -2 (No such file or directory), outsize: 16 unique: 12, opcode: CREATE (35), nodeid: 1, insize: 66, pid: 3430 unique: 12, success, outsize: 160 unique: 14, opcode: FLUSH (25), nodeid: 1099511627829, insize: 64, pid: 3430 unique: 14, success, outsize: 16 unique: 16, opcode: RELEASE (18), nodeid: 1099511627829, insize: 64, pid: 0 unique: 16, success, outsize: 16 unique: 18, opcode: LOOKUP (1), nodeid: 1, insize: 50, pid: 3430 unique: 18, success, outsize: 144 unique: 20, opcode: UNLINK (10), nodeid: 1, insize: 50, pid: 3430 unique: 20, success, outsize: 16 unique: 22, opcode: FORGET (2), nodeid: 1099511627829, insize: 48, pid: 0
Finally, for a workaround one can add [[ https://github.com/ceph/ceph/blob/ee1ae6cbd04079ff722f2466208af0813466da72/src/client/fuse_ll.cc#L1601 | fuse_multithreaded = false ]] to e.g. the "[global]" section of "ceph.conf" file fed to ceph-fuse executable; this will have the obvious drawbacks of running single-threaded code, but at least will not hang.
#7 Updated by George Fedorov 8 months ago
Created a PR ( https://github.com/ceph/ceph/pull/50668 ) to backport commit https://github.com/ceph/ceph/commit/70425c75df1161befe4b4f35739d1432aa0e3505 into Quincy -- which effectively applies something very similar to Zhi Zhang's patch (just uses fuse multithreading settings instead of a new Ceph fuse parameter).
(Tested and confirm that it solves the current issue.)
Note: one can also build ceph-fuse from either quincy or quincy-release branch, but link it against libfuse2 instead of libfuse3 -- this will resolve the issue as well.