Project

General

Profile

Actions

Bug #65455

closed

read operation hung in Client::get_caps

Added by tod chen 21 days ago. Updated 3 days ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

How to reproduce the scene
1. I used two nfs ganesha+libcephfs as the nfs server (server1, server2), and used the same subdirectory from cephfs as the nfs export directory

2. One nfs client client1 mounts to server1 through NFS v3.0, and another client2 mounts to server2 through NFS v3.0

3. Next, I will write a file through client2:
dd if=/dev/zero of=testfile bs=1M count=10000 of lag=direct status=progress

4. After the file is writed completed, I read it through client1, the reading may be hung:
dd if=testfile of=/dev/null bs=4k iflag=direct count=1000000 status=progress

5. The hung reading thread stack on server1:

Thread 36 (Thread 0x7f8546ce5700 (LWP 83644)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7f8546ce24d8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f85c5f4c0f0, cond=0x7f8546ce24b0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f8546ce24b0, mutex=0x7f85c5f4c0f0) at pthread_cond_wait.c:655
#3  0x00007f85c67773bc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f85bc053d66 in Client::wait_on_list (this=0x7f85c5f4c000, ls=Python Exception <class 'ValueError'> Cannot find type class std::__cxx11::list<std::condition_variable*, std::allocator<std::condition_variable*> >::_Node:
std::__cxx11::list) at ./src/client/Client.cc:3832
#5  0x00007f85bc0b131f in Client::get_caps (this=0x7f85c5f4c000, fh=0x7f85a7ce7900, need=2048, want=1024, phave=0x7f8546ce26fc, endoff=-1) at ./src/client/Client.cc:3342
#6  0x00007f85bc0c40fb in Client::_read (this=0x7f85c5f4c000, f=0x7f85a7ce7900, offset=49152, size=4096, bl=0x7f8546ce2830) at ./src/client/Client.cc:9296
#7  0x00007f85bc0c4a53 in Client::ll_read (this=0x7f85c5f4c000, fh=0x7f85a7ce7900, off=off@entry=49152, len=4096, bl=bl@entry=0x7f8546ce2830) at ./src/client/Client.cc:13487
#8  0x00007f85bc04ba7a in ceph_ll_read (cmount=<optimized out>, filehandle=<optimized out>, off=off@entry=49152, len=<optimized out>, buf=0x7f8557860000 "") at ./src/libcephfs.cc:1670
#9  0x00007f85bc16a47b in ceph_fsal_read2 (obj_hdl=0x7f85a8154cc0, bypass=<optimized out>, done_cb=0x7f85c7481460 <mdc_read_cb>, read_arg=0x7f85a3282e80, caller_arg=0x7f85a32343e0) at ./src/FSAL/FSAL_CEPH/handle.c:1890
#10 0x00007f85c7482a30 in mdcache_read2 (obj_hdl=0x7f85a8047d38, bypass=<optimized out>, done_cb=<optimized out>, read_arg=0x7f85a3282e80, caller_arg=<optimized out>) at ./src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:620
#11 0x00007f85c7467dca in nfs3_read (arg=<optimized out>, req=0x7f85a3256000, res=0x7f83277b5500) at ./src/Protocols/NFS/nfs3_read.c:367
#12 0x00007f85c73ad145 in nfs_rpc_process_request (reqdata=0x7f85a3256000, retry=<optimized out>) at ./src/MainNFSD/nfs_worker_thread.c:1518
#13 0x00007f85c732f5cb in svc_request (xprt=0x7f85a840ac00, xdrs=0x7f85a85418c0) at ./src/svc_rqst.c:1209
#14 0x00007f85c732c8b1 in svc_rqst_xprt_task_recv (wpe=<optimized out>) at ./src/svc_rqst.c:1183
#15 0x00007f85c732d298 in svc_rqst_epoll_loop (wpe=0x7f85c5f53600) at ./src/svc_rqst.c:1571
#16 0x00007f85c7338778 in work_pool_thread (arg=0x7f85a7c00960) at ./src/work_pool.c:183
#17 0x00007f85c6be4fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#18 0x00007f85c693e06f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

6. the caps held by the two server:

    "client_caps": [
        {
            "client_id": 36812824,      -> server2
            "pending": "pAsLsXsFrw",
            "issued": "pAsLsXsFrw",
            "wanted": "pAsxXsxFsxcrwb",
            "last_sent": 18169114
        },
        {
            "client_id": 39570896,      -> server1
            "pending": "pAsLsXsFr",
            "issued": "pAsLsXsFr",
            "wanted": "pFscr",
            "last_sent": 5239872
        }
    ],

Actions #1

Updated by tod chen 21 days ago

the ceph version is 15.2.17 and 16.2.14

Actions #2

Updated by Venky Shankar 18 days ago

  • Status changed from New to Need More Info

tod chen wrote in #note-1:

the ceph version is 15.2.17 and 16.2.14

ceph 15.x is EOL'd and unsupported. Could you test this with a recent (supported - quincy or reef) ceph and see if it can be reproduced.

Actions #3

Updated by Venky Shankar 3 days ago

  • Status changed from Need More Info to Rejected

Please reopen the ticket if this is reproducible supported ceph versions.

Actions

Also available in: Atom PDF