Bug #42859: ceph utils hung when executing rados.shutdown in RoCE transport - Messengers - Ceph

Actions

Copy link

Bug #42859

open

ceph utils hung when executing rados.shutdown in RoCE transport

Added by yong xing over 4 years ago. Updated over 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.4

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The transport layer is async+rdma. When executing the ceph util in command line, I can get the expected results, but the tool will be hung with big probability. The logs are as follows:

2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: waiting for dispatch queue
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: dispatch queue is stopped
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 shutdown_connections
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 10 -- 172.16.26.34:0/1717388727 wait: done.
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 wait complete.
2019-11-18 11:39:14.677 7fe2410e9700 1 librados: shutdown
2019-11-18 11:39:14.677 7fe2410e9700 1 -- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 msgr2=0x7fe23c053440 unknown :-1 s=STATE_NONE l=0).mark_down
2019-11-18 11:39:14.677 7fe2410e9700 20 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 2 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).stop
2019-11-18 11:39:14.677 7fe2410e9700 10 --1- 172.16.26.34:0/1717388727 >> 172.16.26.34:0/1717388727 conn(0x7fe23c053040 0x7fe23c053440 :-1 s=NONE pgs=0 cs=0 l=0).discard_out_queue started
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06f1b0 nevent=5000 time_id=1).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c070880 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.677 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.677 7fe2410e9700 20 Event(0x7fe23c06ab80 nevent=5000 time_id=5).wakeup
2019-11-18 11:39:14.677 7fe2410e9700 20 RDMAStack ~RDMADispatcher destructing rdma dispatcher
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 Infiniband rearm_notify started.
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1
2019-11-18 11:39:14.777 7fe1e1376700 20 RDMAStack polling bypass qp=0x7fe230013940 tx_wr=1

It seems that a bug in the shutdown logic of async rdma messenger: a inflight request has never received a response ?

Actions

Copy link

Updated by yong xing over 4 years ago

When I comment out the 1265 and the 1266 line in /usr/bin/ceph (This avoids shutdown logic), this tool functions well.

1262 if name == '__main__':
   1263     retval = main()
   1264     # shutdown explicitly; Rados() does not
   1265     if cluster_handle:
   1266         run_in_thread(cluster_handle.shutdown)
   1267     sys.exit(retval)

Actions

Copy link

Updated by yong xing over 4 years ago

Sometimes, the command "ceph status" were hung for several seconds, and exited with the following debug logs.

2019-11-25 08:12:51.229 7f79b4b79700 20 Infiniband rearm_notify started.
2019-11-25 08:12:51.229 7f79b4b79700 20 Infiniband rearm_notify started.
2019-11-25 08:12:51.229 7f79b4b79700 20 RDMAStack polling tx completion queue got 2 responses.
2019-11-25 08:12:51.229 7f79b4b79700 1 RDMAStack handle_tx_event Work Request Flushed Error: this connection's qp=17135 should be down while this WR=140160755516176 still in flight.
2019-11-25 08:12:51.229 7f79b4b79700 1 RDMAStack handle_tx_event Work Request Flushed Error: this connection's qp=17135 should be down while this WR=140160755591296 still in flight.
2019-11-25 08:12:51.229 7f79b4b79700 1 RDMAStack handle_tx_event sending of the disconnect msg completed
2019-11-25 08:12:51.229 7f79b4b79700 10 RDMAStack polling finally delete qp=0x7f79b80c6880
2019-11-25 08:12:51.229 7f79b4b79700 20 Infiniband ~QueuePair destroy qp=0x7f79b80c6908

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Messengers

Custom queries

Bug #42859

ceph utils hung when executing rados.shutdown in RoCE transport

Updated by yong xing over 4 years ago

Updated by yong xing over 4 years ago