Actions
Bug #43711
openCeph commands hang when ms_type=async+rdma is used
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
rdma
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
If cluster public communication is switched to RDMA with
ms_type=async+rdma
... every 'ceph' command invocation hangs after completion. The thread 'rdma-polling' is seen spinning with 100% CPU usage with f.ex. htop.
Stracing for example the 'ceph -s' command reveals the following after the command has printed out the status report succesfully:
munmap(0x7f216872c000, 262144) = 0 munmap(0x7f21686ec000, 262144) = 0 munmap(0x7f216862c000, 262144) = 0 munmap(0x7f21686ac000, 262144) = 0 clone(child_stack=0x7f21cb0c8fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f21cb0c99d0, tls=0x7f21cb0c9700, child_tidptr=0x7f21cb0c99d0) = 3250559 futex(0x55f4b88971a0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x55f4b88971a0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable) futex(0x55f4b89fffd0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x55f4b88971a0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x55f4b899b2c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable) select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=8000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=16000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=32000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
The last line repeats ad infinitum.
'perf record' and 'perf report' give the following trace data for the hung-up process:
21,53% rdma-polling libceph-common.so.0 [.] PerfCounters::set 16,52% rdma-polling libpthread-2.27.so [.] pthread_spin_lock 13,02% rdma-polling libceph-common.so.0 [.] RDMADispatcher::polling 8,39% rdma-polling libpthread-2.27.so [.] __pthread_mutex_lock 8,18% rdma-polling libceph-common.so.0 [.] Cycles::to_nanoseconds 7,54% rdma-polling libpthread-2.27.so [.] __pthread_mutex_unlock 3,68% rdma-polling libceph-common.so.0 [.] Infiniband::CompletionQueue::poll_cq 3,66% rdma-polling libceph-common.so.0 [.] Mutex::lock 3,64% rdma-polling libceph-common.so.0 [.] Mutex::unlock 2,49% rdma-polling libceph-common.so.0 [.] Cycles::to_microseconds 1,13% rdma-polling libc-2.27.so [.] pthread_mutex_lock 1,12% rdma-polling libceph-common.so.0 [.] pthread_self@plt 0,83% rdma-polling libc-2.27.so [.] pthread_self
Please let me know the exact debugging commands you would like me to run if you require extra information to troubleshoot this issue.
Actions