Project

General

Profile

Actions

Bug #19867

closed

SimpleMessenger: osd cluster messenger pipe connection is only one direction

Added by Peng Xie almost 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
teuthology
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

recently i found an interesting simple messenger pipe connection problems among osds' cluster network.

my test environment is 2 ceph nodes, 2 osds in total, one osd per node. 2 replica. ceph-10.2.2 in use.
osd.0 in node xt1, with cluster ip: 192.168.100.1, public ip: 192.168.10.88
osd.1 in node xt2, with cluster ip: 192.168.100.2, public ip: 192.168.10.88

netstat shows that there is only one connction between the two osds
cluster network(192.168.100.0/24), listen port is 6800 :

osd.0 (tcp client: 192.168.100.1:55380) -----> osd.1 (tcp server: 192.168.100.2:6800)

we should expect there is another pipe connection from
osd.1 (192.168.100.2:xxxxx) to osd.0 (192.168.100.1:6800), but it was not as expected .

the netstat shows below:

[root@xt1 ]# netstat -ntpa |grep ceph
tcp 0 0 192.168.10.88:6789 0.0.0.0:* LISTEN 17395/ceph-mon
tcp 0 0 192.168.100.1:6800 0.0.0.0:* LISTEN 17435/ceph-osd
tcp 0 0 192.168.10.88:6800 0.0.0.0:* LISTEN 17435/ceph-osd
tcp 0 0 192.168.10.88:6801 0.0.0.0:* LISTEN 17435/ceph-osd
tcp 0 0 192.168.100.1:6801 0.0.0.0:* LISTEN 17435/ceph-osd
tcp 0 0 192.168.100.1:48532 192.168.100.2:6800 ESTABLISHED 17435/ceph-osd
tcp 0 0 192.168.100.1:6801 192.168.100.2:43908 ESTABLISHED 17435/ceph-osd
tcp 0 0 192.168.100.1:51819 192.168.100.2:6801 ESTABLISHED 17435/ceph-osd
tcp 0 0 192.168.10.88:6801 192.168.10.89:32938 ESTABLISHED 17435/ceph-osd
tcp 0 9 192.168.10.88:6789 192.168.10.88:32834 ESTABLISHED 17395/ceph-mon
tcp 0 0 192.168.10.88:38788 192.168.10.89:6801 ESTABLISHED 17435/ceph-osd
tcp 0 0 192.168.10.88:6789 192.168.10.89:45039 ESTABLISHED 17395/ceph-mon
tcp 0 0 192.168.10.88:32834 192.168.10.88:6789 ESTABLISHED 17435/ceph-osd

[root@xt2]# netstat -ntpa |grep ceph
tcp 0 0 192.168.100.2:6800 0.0.0.0:* LISTEN 7642/ceph-osd
tcp 0 0 192.168.10.89:6800 0.0.0.0:* LISTEN 7642/ceph-osd
tcp 0 0 192.168.10.89:6801 0.0.0.0:* LISTEN 7642/ceph-osd
tcp 0 0 192.168.100.2:6801 0.0.0.0:* LISTEN 7642/ceph-osd
tcp 0 0 192.168.10.89:45039 192.168.10.88:6789 ESTABLISHED 7642/ceph-osd
tcp 0 0 192.168.100.2:6800 192.168.100.1:48532 ESTABLISHED 7642/ceph-osd
tcp 0 0 192.168.10.89:6801 192.168.10.88:38788 ESTABLISHED 7642/ceph-osd
tcp 0 0 192.168.100.2:43908 192.168.100.1:6801 ESTABLISHED 7642/ceph-osd
tcp 0 0 192.168.100.2:6801 192.168.100.1:51819 ESTABLISHED 7642/ceph-osd
tcp 0 0 192.168.10.89:32938 192.168.10.88:6801 ESTABLISHED 7642/ceph-osd

after analysis the osd log, the timeline of the error code logic was shown below:

1. during osd.1 start up, it first create and register the new Pipe-C in the cluster_messenger's
rank_pipe as ["192.168.100.1:6800", Pipe-C], and the call Pipe::Connect to osd.0
2. at the same time, osd.1 as the cluster_messenger tcp server side to Pipe::accept() the
connection request from osd.0. osd.1 will receive the peer_addr as "192.168.100.1:6800".
to call the _lookup_pipe in the rank_pipe map, it found the connecting Pipe-C in step-1.
it will close and replace Pipe-C the the new Pipe-A in the rank_pipe map, which finally is
["192.168.100.1:6800", Pipe-A]
3. since Pipe-C was closed, osd.0 Pipe::accept() encounter an error and gives up the connection.
4. the following IO from osd.1 was replicated to osd.0 through the osd.1 server-side Pipe-A,
however it is supposed to using the separate client-side Pipe-C to transfer the internal IO.

as my analysis shows: it will hurt the performance of the internal data transfer between the osds .


the log in the timeline order shows below:

time-1
in Node-xt2 (<<<--- is my comments):
2017-04-19 14:15:19.843828 7f3e79b19700 10 -- 192.168.100.2:6800/2283 connect_rank to 192.168.100.1:6800/12758, creating pipe and registering
2017-04-19 14:15:19.843866 7f3e79b19700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea24ce000 sd=-1 :0 s=1 pgs=0 cs=0 l=0 c=0x7f3ea13f2a00).register_pipe
<<<--- the osd.1 cluster_messenger client-side pipe will be registered in rank_pipe as ["192.168.100.1:6800", Pipe-C]
and start to call Pipe::connect() .
.......
.......
2017-04-19 14:15:19.845668 7f3e6e2fb700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea24ce000 sd=-1 :0 s=1 pgs=0 cs=0 l=0 c=0x7f3ea13f2a00).connect 0
2017-04-19 14:15:19.845678 7f3e6e2fb700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea24ce000 sd=53 :0 s=1 pgs=0 cs=0 l=0 c=0x7f3ea13f2a00).connecting to 192.168.100.1:6800/12758 .
.....

time-2
in Node-xt1:

2017-04-19 06:15:19.843643 7f2385164700 10 -- 192.168.100.1:6800/12758 >> 192.168.100.2:6800/2283 pipe(0x7f23b97eb400 sd=158 :0 s=1 pgs=0 cs=0 l=0 c=0x7f23b970d900).connecting to 192.168.100.2:6800/2283
<<<--- osd.0 cluster_messenger client-side pipe first call in Pipe::connect() to osd.1
.....
2017-04-19 06:15:19.848022 7f2384e61700 10 -- 192.168.100.1:6800/12758 >> :/0 pipe(0x7f23b9894000 sd=161 :0 s=0 pgs=0 cs=0 l=0 c=0x7f23b97ca300).accept
<<<--- osd.0 cluster_messenger server-side start to Pipe::Accept() for the osd.1 connect request. the accept fd is 161 .
2017-04-19 06:15:19.848133 7f2384e61700 1 -- 192.168.100.1:6800/12758 >> :/0 pipe(0x7f23b9894000 sd=161 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f23b97ca300).accept sd=161 192.168.100.2:53220/0
<<<--- osd.0 got the peer .

time-3
in Node-xt2 :

2017-04-19 14:15:19.846220 7f3e6e3fc700 1 -- 192.168.100.2:6800/2283 >> :/0 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept sd=45 192.168.100.1:45300/0
2017-04-19 14:15:19.846233 7f3e6e800700 10 -- 192.168.100.2:6801/2283 >> :/0 pipe(0x7f3ea0942000 sd=50 :0 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1871380).accept
<<<--- osd.1 cluster_messenger server-side pipe start Pipe::accept() for the osd.0 connection request .
2017-04-19 14:15:19.849302 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept peer addr is 192.168.100.1:6800/12758
2017-04-19 14:15:19.849306 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept of host_type 4, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0
2017-04-19 14:15:19.849308 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept my proto 10, their proto 10
2017-04-19 14:15:19.849369 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept: setting up session_security.
2017-04-19 14:15:19.849373 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept existing 0x7f3ea24ce000.gseq 0 <= 1, looks ok .
<<<---please note that 0x7f3ea24ce000 is our Pipe-C in ["192.168.100.1:6800", Pipe-C] in time-1 .
2017-04-19 14:15:19.849376 7f3e6e3fc700 0 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept connect_seq 0 vs existing 0 state connecting
2017-04-19 14:15:19.849378 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept connection race, existing 0x7f3ea24ce000.cseq 0 == 0, or we are server, replacing my attempt .
2017-04-19 14:15:19.849381 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea2bca800 sd=45 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f3ea1b5e180).accept replacing 0x7f3ea24ce000 . <<<---- will replace and stop the ongoing osd.1 cluster_messenger's client-side Pipe connection
2017-04-19 14:15:19.849383 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea24ce000 sd=53 :53220 s=1 pgs=0 cs=0 l=0 c=0x7f3ea13f2a00).stop .
2017-04-19 14:15:19.849444 7f3e6e3fc700 10 -- 192.168.100.2:6800/2283 >> 192.168.100.1:6800/12758 pipe(0x7f3ea24ce000 sd=53 :53220 s=4 pgs=0 cs=0 l=0 c=0x7f3ea13f2a00).unregister_pipe.
......
2017-04-19 14:15:19.849619 7f3e8c4d4700 10 -- 192.168.100.2:6800/2283 reaper deleted pipe 0x7f3ea24ce000 . <<<<--- finally the reaper thread will destroy the Pipe-C structure .

time-4
in Node-xt1 :

2017-04-19 06:15:19.851553 7f2384e61700 10 -- 192.168.100.1:6800/12758 >> :/0 pipe(0x7f23b9894000 sd=161 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f23b97ca300).accept couldn't read banner
<<<---due to the connection wrongly be destroyed by the osd.1 in time-3
2017-04-19 06:15:19.855975 7f2384e61700 10 -- 192.168.100.1:6800/12758 >> :/0 pipe(0x7f23b9894000 sd=161 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f23b97ca300). queued = 0
2017-04-19 06:15:19.855994 7f2384e61700 2 -- 192.168.100.1:6800/12758 >> :/0 pipe(0x7f23b9894000 sd=161 :6800 s=4 pgs=0 cs=0 l=0 c=0x7f23b97ca300).fault (0) Success
2017-04-19 06:15:19.856009 7f2384e61700 10 -- 192.168.100.1:6800/12758 >> :/0 pipe(0x7f23b9894000 sd=161 :6800 s=4 pgs=0 cs=0 l=0 c=0x7f23b97ca300).fault already closed|closing
2017-04-19 06:15:19.856043 7f2384e61700 10 -- 192.168.100.1:6800/12758 queue_reap 0x7f23b9894000
2017-04-19 06:15:19.856050 7f2384e61700 10 -- 192.168.100.1:6800/12758 >> :/0 pipe(0x7f23b9894000 sd=161 :6800 s=4 pgs=0 cs=0 l=0 c=0x7f23b97ca300).reader done

Actions #1

Updated by Greg Farnum almost 7 years ago

  • Status changed from New to Closed
  • Priority changed from Urgent to Normal

I don't see anything in here that's a bug. The OSDs turned on and had racing connection attempts, so one of them was declared the winner and is now used for two-way communication between the daemons. This is expected behavior; you say it will hurt performance but I don't see any evidence of that? (Keep in mind connections are long-lived and we aren't concerned about microsecond-level latencies on first message.)

Actions #2

Updated by Peng Xie almost 7 years ago

hi greg,

why not the osds internal cluster messenger using two separate pipes like the blew shows:

osd.0 osd.1
pipe-c ------> pipe-a
pipe-a <----- pipe-c

which using the separate fd and threads to exchange the internal data.

it is apparently more straightforward and high efficiency

Actions #3

Updated by Peng Xie almost 7 years ago

Greg Farnum wrote:

I don't see anything in here that's a bug. The OSDs turned on and had racing connection attempts, so one of them was declared the winner and is now used for two-way communication between the daemons. This is expected behavior; you say it will hurt performance but I don't see any evidence of that? (Keep in mind connections are long-lived and we aren't concerned about microsecond-level latencies on first message.)

hi greg,

why not the osds internal cluster messenger using two separate pipes like the blew shows:

osd.0 osd.1
pipe-c ------> pipe-a
pipe-a <----- pipe-c

which using the separate fd and threads to exchange the internal data.

it is apparently more straightforward and high efficiency

Actions #4

Updated by Greg Farnum almost 7 years ago

That's just not really how the network stack works. You aren't increasing efficiency by setting up separate connections for each direction, and you're vastly increasing complexity in failure situations. SimpleMessenger already has separate reader and writer threads (though they do share some locking); AsyncMessenger is even better about concurrency.

Actions

Also available in: Atom PDF