Project

General

Profile

Actions

Bug #42859

open

ceph utils hung when executing rados.shutdown in RoCE transport

Added by yong xing over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The transport layer is async+rdma. When executing the ceph util in command line, I can get the expected results, but the tool will be hung with big probability. The logs are as follows:

2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: waiting for dispatch queue
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: dispatch queue is stopped
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 shutdown_connections
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: done.
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 wait complete.
2019-11-18 11:39:14.677 7fe2410e9700 1 librados: shutdown
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 msgr2=0x7fe23c053440 unknown :-1 s=STATE_NONE l=0).mark_down
2019-11-18 11:39:14.677 7fe2410e9700 20 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 2 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 10 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).discard_out_queue started
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 RDMAStack ~RDMADispatcher destructing rdma dispatcher
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1

It seems that a bug in the shutdown logic of async rdma messenger: a inflight request has never received a response ?

Actions

Also available in: Atom PDF