Project

General

Profile

Actions

Bug #56725

open

open file hang using vim with ceph-fuse client

Added by Bin Zhao over 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version is 16.2.9
open file using vim hang with ceph-fuse client. by default open file using vim will create swp and swpx temperory files and close them laster. Ceph-fuse successfully closed swpx file but failed on swp file.

On pacific cluster, ceph-fuse got the following debug info:
unique: 152, opcode: CREATE (35), nodeid: 1, insize: 68, pid: 3868383
unique: 152, success, outsize: 160
unique: 154, opcode: LOOKUP (1), nodeid: 1, insize: 53, pid: 3868383
unique: 154, error: -2 (No such file or directory), outsize: 16
unique: 156, opcode: LOOKUP (1), nodeid: 1, insize: 53, pid: 3868383
unique: 156, error: -2 (No such file or directory), outsize: 16
unique: 158, opcode: CREATE (35), nodeid: 1, insize: 69, pid: 3868383
unique: 158, success, outsize: 160
unique: 160, opcode: GETATTR (3), nodeid: 1099511627776, insize: 56, pid: 3868383
unique: 160, success, outsize: 120
unique: 162, opcode: GETATTR (3), nodeid: 1099511627777, insize: 56, pid: 3868383
unique: 162, success, outsize: 120
unique: 164, opcode: FLUSH (25), nodeid: 1099511627777, insize: 64, pid: 3868383
unique: 164, success, outsize: 16
unique: 166, opcode: RELEASE (18), nodeid: 1099511627777, insize: 64, pid: 0
unique: 166, success, outsize: 16
unique: 168, opcode: LOOKUP (1), nodeid: 1, insize: 53, pid: 3868383
unique: 168, success, outsize: 144
unique: 170, opcode: UNLINK (10), nodeid: 1, insize: 53, pid: 3868383
unique: 170, success, outsize: 16
unique: 172, opcode: FLUSH (25), nodeid: 1099511627776, insize: 64, pid: 3868383
unique: 172, success, outsize: 16
unique: 174, opcode: FORGET (2), nodeid: 1099511627777, insize: 48, pid: 0

With Nautilus cluster, ceph-fuse got the following debug info:
unique: 83, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 83, error: -2 (No such file or directory), outsize: 16
unique: 84, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 84, error: -2 (No such file or directory), outsize: 16
unique: 85, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 85, error: -2 (No such file or directory), outsize: 16
unique: 86, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 86, error: -2 (No such file or directory), outsize: 16
unique: 87, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 87, error: -2 (No such file or directory), outsize: 16
unique: 88, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 89, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 629807
unique: 89, success, outsize: 120
unique: 90, opcode: STATFS (17), nodeid: 1, insize: 40, pid: 629807
unique: 88, error: -2 (No such file or directory), outsize: 16
unique: 91, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 629803
unique: 90, success, outsize: 96
unique: 92, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 629807
unique: 92, success, outsize: 120
unique: 91, error: -2 (No such file or directory), outsize: 16
unique: 93, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 93, error: -2 (No such file or directory), outsize: 16
unique: 94, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 94, error: -2 (No such file or directory), outsize: 16
unique: 95, opcode: CREATE (35), nodeid: 1, insize: 73, pid: 629803
unique: 95, success, outsize: 160
unique: 96, opcode: LOOKUP (1), nodeid: 1, insize: 58, pid: 629803
unique: 96, error: -2 (No such file or directory), outsize: 16
unique: 97, opcode: LOOKUP (1), nodeid: 1, insize: 58, pid: 629803
unique: 97, error: -2 (No such file or directory), outsize: 16
unique: 98, opcode: CREATE (35), nodeid: 1, insize: 74, pid: 629803
unique: 98, success, outsize: 160
unique: 99, opcode: GETATTR (3), nodeid: 1099511628777, insize: 56, pid: 629803
unique: 99, success, outsize: 120
unique: 100, opcode: GETATTR (3), nodeid: 1099511628778, insize: 56, pid: 629803
unique: 100, success, outsize: 120
unique: 101, opcode: FLUSH (25), nodeid: 1099511628778, insize: 64, pid: 629803
unique: 101, success, outsize: 16
unique: 102, opcode: RELEASE (18), nodeid: 1099511628778, insize: 64, pid: 0
unique: 103, opcode: LOOKUP (1), nodeid: 1, insize: 58, pid: 629803
unique: 102, success, outsize: 16
unique: 103, success, outsize: 144
unique: 104, opcode: UNLINK (10), nodeid: 1, insize: 58, pid: 629803
unique: 104, success, outsize: 16
unique: 105, opcode: FLUSH (25), nodeid: 1099511628777, insize: 64, pid: 629803
unique: 106, opcode: FORGET (2), nodeid: 1099511628778, insize: 48, pid: 0
unique: 105, success, outsize: 16
unique: 107, opcode: RELEASE (18), nodeid: 1099511628777, insize: 64, pid: 0
unique: 108, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 107, success, outsize: 16
unique: 108, success, outsize: 144
unique: 109, opcode: UNLINK (10), nodeid: 1, insize: 57, pid: 629803
unique: 109, success, outsize: 16
unique: 110, opcode: LOOKUP (1), nodeid: 1, insize: 57, pid: 629803
unique: 111, opcode: FORGET (2), nodeid: 1099511628777, insize: 48, pid: 0

Seems that no release request sent to pacific ceph-fuse and caused the hang

Actions #1

Updated by Xiubo Li over 1 year ago

BTW, could you reproduce this easily ?

Actions #2

Updated by Bin Zhao over 1 year ago

this issue can be reproduced stably

set vim with noswapfile, can sucessfully open and edit file. Ceph-fuse hang after removing the file

Actions #3

Updated by Zhi Zhang over 1 year ago

We can reproduce this issue on Pacific version too.

kernel 4.14.105
libfuse version 3.6.1.

Actions #4

Updated by Zhi Zhang over 1 year ago

On fuse3 with N version, client will call below function.
#define fuse_session_loop_mt(se, clone_fd) fuse_session_loop_mt_31(se, clone_fd)

fuse_session_loop_mt_31 will set max_idle_threads to 10 by default.

But on fuse3 with P version, client will call below functions.
#define fuse_session_loop_mt(se, config) fuse_session_loop_mt_32(se, config)

fuse_session_loop_mt_32 will set max_idle_threads according to fuse_loop_config which is initialized to 0 by ceph. After fuse forget op, there is no thread left to pick up next fuse op, so the whole client hangs forever.

The fix is below.

diff --git a/src/client/fuse_ll.cc b/src/client/fuse_ll.cc
index b42a7cc970..ef2df10ee5 100644
--- a/src/client/fuse_ll.cc
+++ b/src/client/fuse_ll.cc
@@ -1633,6 +1633,7 @@ int CephFuse::Handle::loop()
       struct fuse_loop_config conf = { 0 };

       conf.clone_fd = opts.clone_fd;
+      conf.max_idle_threads = client->cct->_conf.get_val<int64_t>("fuse_max_idle_threads");
       return fuse_session_loop_mt(se, &conf);
     }
 #else
diff --git a/src/common/options.cc b/src/common/options.cc
index 341d7fc0bc..8bb50c84bc 100755
--- a/src/common/options.cc
+++ b/src/common/options.cc
@@ -9335,6 +9335,10 @@ std::vector<Option> get_mds_client_options() {
     .set_default(false)
     .set_description(""),

+    Option("fuse_max_idle_threads", Option::TYPE_INT, Option::LEVEL_ADVANCED)
+    .set_default(10)
+    .set_description(""),
+
Actions #5

Updated by Mathew Clarke over 1 year ago

I'm also experiencing this issue. I'm running "ceph-fuse version 17.2.0" with a "ceph version 17.2.3" cluster. Do you know when this fix is likley to be merged?

Actions #6

Updated by George Fedorov about 1 year ago

For the most recent ceph-fuse client, available to date for Ubuntu Jammy,

# ceph-fuse --version
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)

The issue is still present. By the looks of it, it is fixed in the repository, so now it is Ubuntu package maintainer's turn.

Zhi Zhang's comment nails it -- any FORGET operation will essentially make ceph-fuse to freeze.

See libfuse [[ https://github.com/libfuse/libfuse/blob/b45d66cafb3b8de191d44b3a5705637319f7a552/lib/fuse_loop_mt.c#L127 | main loop code]] and [[ https://github.com/ceph/ceph/blob/41ad4cd0bd429a0054881c54d106a1090c55870d/src/client/fuse_ll.cc#L1268 | older versions ]] of ceph-fuse.

Code below can reproduce the problem, hanging at the first read() attempt of the second iteration:

#!/usr/bin/python3

import os,sys,time
from contextlib import contextmanager

FILENAME = '/mnt/cephtest/crashtest'
OPEN_FLAGS_RO      = os.O_RDONLY
OPEN_FLAGS_RW      = os.O_RDWR   | os.O_CREAT
OPEN_MODE = 0o600

def perr(*args):
    print(*args, file=sys.stderr)

def try_os_call(fn, *args, **kwargs):

    try:
        arg1 = f'{args[0]!r}' if isinstance(args[0], str) else str(args[0])
        arg2 = (', ' + f'0x{args[1]:08X}') if args[1:] else ''
        arg3 = (', ' + f'0o{args[2]:04o}') if args[2:] else ''
        perr( f" --> {fn.__name__}({arg1}{arg2}{arg3})" )
        ret = fn(*args, **kwargs)
    except OSError as e:
        print(f"OS Error: {e.args}", file=sys.stderr)
        ret = -e.errno

    perr( f" <= {ret}" )

    return ret

def pause():
    discardme = input('press <Enter>')

for n in range(2):

    fd_ro = try_os_call(os.open, FILENAME, OPEN_FLAGS_RO)
    fd_rw = try_os_call(os.open, FILENAME, OPEN_FLAGS_RW, OPEN_MODE)

    if (fd_rw >= 0):
        try_os_call( os.close, fd_rw  )
        try_os_call(os.unlink, FILENAME)

    pause()

ceph-fuse debug output will confirm that it freezes after the first FORGET:

unique: 8, opcode: LOOKUP (1), nodeid: 1, insize: 50, pid: 3430
   unique: 8, error: -2 (No such file or directory), outsize: 16
unique: 10, opcode: LOOKUP (1), nodeid: 1, insize: 50, pid: 3430
   unique: 10, error: -2 (No such file or directory), outsize: 16
unique: 12, opcode: CREATE (35), nodeid: 1, insize: 66, pid: 3430
   unique: 12, success, outsize: 160
unique: 14, opcode: FLUSH (25), nodeid: 1099511627829, insize: 64, pid: 3430
   unique: 14, success, outsize: 16
unique: 16, opcode: RELEASE (18), nodeid: 1099511627829, insize: 64, pid: 0
   unique: 16, success, outsize: 16
unique: 18, opcode: LOOKUP (1), nodeid: 1, insize: 50, pid: 3430
   unique: 18, success, outsize: 144
unique: 20, opcode: UNLINK (10), nodeid: 1, insize: 50, pid: 3430
   unique: 20, success, outsize: 16
unique: 22, opcode: FORGET (2), nodeid: 1099511627829, insize: 48, pid: 0

Finally, for a workaround one can add [[ https://github.com/ceph/ceph/blob/ee1ae6cbd04079ff722f2466208af0813466da72/src/client/fuse_ll.cc#L1601 | fuse_multithreaded = false ]] to e.g. the "[global]" section of "ceph.conf" file fed to ceph-fuse executable; this will have the obvious drawbacks of running single-threaded code, but at least will not hang.

Actions #7

Updated by George Fedorov about 1 year ago

Created a PR ( https://github.com/ceph/ceph/pull/50668 ) to backport commit https://github.com/ceph/ceph/commit/70425c75df1161befe4b4f35739d1432aa0e3505 into Quincy -- which effectively applies something very similar to Zhi Zhang's patch (just uses fuse multithreading settings instead of a new Ceph fuse parameter).

(Tested and confirm that it solves the current issue.)

Note: one can also build ceph-fuse from either quincy or quincy-release branch, but link it against libfuse2 instead of libfuse3 -- this will resolve the issue as well.

Actions

Also available in: Atom PDF