Backport #37591: luminous: RDMAStack: do not destroy QP if SQ is not fully consumed - Ceph - Ceph

Actions

Copy link

Backport #37591

closed

luminous: RDMAStack: do not destroy QP if SQ is not fully consumed

Added by Roman Penyaev over 5 years ago. Updated over 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

Release:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The bug is well reproduced on qlogic hardware. The following is the debug output from qedr uverbs driver and gdb:

qelr_modify_qp:1150]QP Modify state 1->1, rc=0
[qelr_destroy_qp:1188]destroy qp: 0x55555fbc0000
[qelr_destroy_qp:1206]destroy qp: successfully destroyed 0x55555fbc0000

Thread 44 "msgr-worker-0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe4d3e700 (LWP 161083)]
0x00007fffe629b567 in qelr_poll_cq_req (cq=<optimized out>, req=0x7ffff7e38fa0, req=0x7ffff7e38fa0, wc=0x7fffe4d3c110, num_entries=32, qp=0x55555fbc0000) at /root/rpen/fastlinq-8.37.15.0/rdma-core/providers/qedr/qelr_verbs.c:2302
2302                    DP_ERR(cxt->dbg_fp,
(gdb) bt
#0  0x00007fffe629b567 in qelr_poll_cq_req (cq=<optimized out>, req=0x7ffff7e38fa0, req=0x7ffff7e38fa0, wc=0x7fffe4d3c110, num_entries=32, qp=0x55555fbc0000) at /root/rpen/fastlinq-8.37.15.0/rdma-core/providers/qedr/qelr_verbs.c:2302
#1  qelr_poll_cq (ibcq=0x55555f445e00, num_entries=32, wc=0x7fffe4d3c110) at /root/rpen/fastlinq-8.37.15.0/rdma-core/providers/qedr/qelr_verbs.c:2654
#2  0x000055555628c7cc in ibv_poll_cq (wc=0x7fffe4d3c110, num_entries=32, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1908
#3  Infiniband::CompletionQueue::poll_cq (this=0x55555fbf4090, num_entries=num_entries@entry=32, ret_wc_array=ret_wc_array@entry=0x7fffe4d3c110) at /usr/src/debug/ceph-12.2.8-467-g080f2248ff/src/msg/async/rdma/Infiniband.cc:452
#4  0x0000555556084435 in RDMADispatcher::polling (this=0x55555f49cd80) at /usr/src/debug/ceph-12.2.8-467-g080f2248ff/src/msg/async/rdma/RDMAStack.cc:157
#5  0x00007ffff5bf0d50 in ?? () from /usr/lib64/libstdc++.so.6
#6  0x00007ffff62d9724 in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff535fe8d in clone () from /lib64/libc.so.6

From the output it is obvious, that we continue receiving CWE from the QP which was just destroyed (see "[qelr_destroy_qp:1206]" log)

The following commits in master partially fix the problem (should be applied from bottom to top):

15f833d0e2b msg/async/rdma: a tiny typo fix.
9d9593f3bec msg/async/rdma: fix a coredump bug which is introduced by PR #18053, where the iterator is not working properly after erase().
cbb3bd46dbb Addressing CR comments from alex-mikheev (Alex Mikheev), to use a single atomic counter for inflight Tx CQEs.
c90588f0bca Addressing CR comments from tchaikov (Kefu Chai).
ec605b26f6f msg/async/rdma: fix Tx buffer leakage which can introduce "heartbeat no reply" due to out of Tx buffers, this can be reproduced by marking some OSDs down in a big Ceph cluster, say 300+ OSDs.
54e98167201 msg/async/rdma: fix a potential coredump when handling tx_buffers under heavy RDMA traffic, there are chances to access a current_chunk which can be beyond the range of pre-allocated Tx buffer pool thus causes a coredump

Actions

Copy link

Updated by Nathan Cutler over 5 years ago

Status changed from New to Closed

Was a misunderstanding.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Backport #37591

luminous: RDMAStack: do not destroy QP if SQ is not fully consumed

Updated by Nathan Cutler over 5 years ago