Project

General

Profile

Actions

Bug #43711

open

Ceph commands hang when ms_type=async+rdma is used

Added by Mikko Tanner over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
rdma
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If cluster public communication is switched to RDMA with

ms_type=async+rdma

... every 'ceph' command invocation hangs after completion. The thread 'rdma-polling' is seen spinning with 100% CPU usage with f.ex. htop.

Stracing for example the 'ceph -s' command reveals the following after the command has printed out the status report succesfully:

munmap(0x7f216872c000, 262144)          = 0
munmap(0x7f21686ec000, 262144)          = 0
munmap(0x7f216862c000, 262144)          = 0
munmap(0x7f21686ac000, 262144)          = 0
clone(child_stack=0x7f21cb0c8fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f21cb0c99d0, tls=0x7f21cb0c9700, child_tidptr=0x7f21cb0c99d0) = 3250559
futex(0x55f4b88971a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55f4b88971a0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55f4b89fffd0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55f4b88971a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55f4b899b2c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)

The last line repeats ad infinitum.

'perf record' and 'perf report' give the following trace data for the hung-up process:

  21,53%  rdma-polling  libceph-common.so.0  [.] PerfCounters::set
  16,52%  rdma-polling  libpthread-2.27.so   [.] pthread_spin_lock
  13,02%  rdma-polling  libceph-common.so.0  [.] RDMADispatcher::polling
   8,39%  rdma-polling  libpthread-2.27.so   [.] __pthread_mutex_lock
   8,18%  rdma-polling  libceph-common.so.0  [.] Cycles::to_nanoseconds
   7,54%  rdma-polling  libpthread-2.27.so   [.] __pthread_mutex_unlock
   3,68%  rdma-polling  libceph-common.so.0  [.] Infiniband::CompletionQueue::poll_cq
   3,66%  rdma-polling  libceph-common.so.0  [.] Mutex::lock
   3,64%  rdma-polling  libceph-common.so.0  [.] Mutex::unlock
   2,49%  rdma-polling  libceph-common.so.0  [.] Cycles::to_microseconds
   1,13%  rdma-polling  libc-2.27.so         [.] pthread_mutex_lock
   1,12%  rdma-polling  libceph-common.so.0  [.] pthread_self@plt
   0,83%  rdma-polling  libc-2.27.so         [.] pthread_self

Please let me know the exact debugging commands you would like me to run if you require extra information to troubleshoot this issue.

Actions #1

Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (ceph cli)
Actions

Also available in: Atom PDF