Bug #42859
openceph utils hung when executing rados.shutdown in RoCE transport
0%
Description
The transport layer is async+rdma. When executing the ceph util in command line, I can get the expected results, but the tool will be hung with big probability. The logs are as follows:
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: waiting for dispatch queue
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: dispatch queue is stopped
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 shutdown_connections
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: done.
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 wait complete.
2019-11-18 11:39:14.677 7fe2410e9700 1 librados: shutdown
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 msgr2=0x7fe23c053440 unknown :-1 s=STATE_NONE l=0).mark_down
2019-11-18 11:39:14.677 7fe2410e9700 20 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 2 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 10 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).discard_out_queue started
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 RDMAStack ~RDMADispatcher destructing rdma dispatcher
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
It seems that a bug in the shutdown logic of async rdma messenger: a inflight request has never received a response ?