Bug #17807: async messenger should rebind using the new nonce - Ceph - Ceph

Actions

Copy link

Bug #17807

closed

async messenger should rebind using the new nonce

Added by Kefu Chai over 7 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Actions

Copy link

Updated by Kefu Chai over 7 years ago

Status changed from In Progress to Fix Under Review
Source changed from other to Community (dev)

https://github.com/ceph/ceph/pull/11804

Actions

Copy link

Updated by Kefu Chai over 7 years ago

@tchaikov

The command failing is:
`ceph tell osd.0 version`
whilest running:
`src/test/cephtool-test-mon.sh`

And running te command manually on the commandline:

sudo bin/ceph -c src/test/td/t-7202/ceph.conf tell osd.0 version
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2016-11-30 17:14:38.311288 801c16f00 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-11-30 17:14:38.332033 801c1ab00 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-11-30 17:14:38.363600 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:38.565084 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:38.973728 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:39.803632 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:41.429071 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:44.660891 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!

This last line continues until the command gets cancelled or osd.0 goes away.

In the OSD log I find repeats of:

2016-11-30 17:17:44.924579 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=51 -
2016-11-30 17:17:44.925175 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_bulk peer close file descriptor 51
2016-11-30 17:17:44.925234 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_until read failed
2016-11-30 17:17:44.925330 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection read peer banner and addr failed
2016-11-30 17:17:44.925420 b9c2d80  0 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2016-11-30 17:17:44.925559 b9c2d80  1 -- 127.0.0.1:6800/1030470 reap_dead start

You'd need to up the loging to get it more verbose. But it tells you that the OSD has gone to `127.0.0.1:6800/1030470` for the instance of the socket. And the rest of the world does not seem to be updated. My fix actually does get the message out to the reset of the deamons, and ceph is able to pick up on that.

BTW:
It also shows why I'd like to get ride of all these DeveloperMode and feature warnings whilest developing.
Huge logfile pollution.

Actions

Copy link