Project

General

Profile

Actions

Bug #42859

open

ceph utils hung when executing rados.shutdown in RoCE transport

Added by yong xing over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The transport layer is async+rdma. When executing the ceph util in command line, I can get the expected results, but the tool will be hung with big probability. The logs are as follows:

2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: waiting for dispatch queue
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: dispatch queue is stopped
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 shutdown_connections
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: done.
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 wait complete.
2019-11-18 11:39:14.677 7fe2410e9700 1 librados: shutdown
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 msgr2=0x7fe23c053440 unknown :-1 s=STATE_NONE l=0).mark_down
2019-11-18 11:39:14.677 7fe2410e9700 20 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 2 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 10 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).discard_out_queue started
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 RDMAStack ~RDMADispatcher destructing rdma dispatcher
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1

It seems that a bug in the shutdown logic of async rdma messenger: a inflight request has never received a response ?

Actions #1

Updated by yong xing over 4 years ago

When I comment out the 1265 and the 1266 line in /usr/bin/ceph (This avoids shutdown logic), this tool functions well.

1262 if name == '__main__':
1263 retval = main()
1264 # shutdown explicitly; Rados() does not
1265 if cluster_handle:
1266 run_in_thread(cluster_handle.shutdown)
1267 sys.exit(retval)
Actions #2

Updated by yong xing over 4 years ago

Sometimes, the command "ceph status" were hung for several seconds, and exited with the following debug logs.

2019-11-25 08:12:51.229 7f79b4b79700 20 Infiniband rearm_notify started.
2019-11-25 08:12:51.229 7f79b4b79700 20 Infiniband rearm_notify started.
2019-11-25 08:12:51.229 7f79b4b79700 20 RDMAStack polling tx completion queue got 2 responses.
2019-11-25 08:12:51.229 7f79b4b79700 1 RDMAStack handle_tx_event Work Request Flushed Error: this connection's qp=17135 should be down while this WR=140160755516176 still in flight.
2019-11-25 08:12:51.229 7f79b4b79700 1 RDMAStack handle_tx_event Work Request Flushed Error: this connection's qp=17135 should be down while this WR=140160755591296 still in flight.
2019-11-25 08:12:51.229 7f79b4b79700 1 RDMAStack handle_tx_event sending of the disconnect msg completed
2019-11-25 08:12:51.229 7f79b4b79700 10 RDMAStack polling finally delete qp=0x7f79b80c6880
2019-11-25 08:12:51.229 7f79b4b79700 20 Infiniband ~QueuePair destroy qp=0x7f79b80c6908

Actions

Also available in: Atom PDF