Project

General

Profile

Actions

Bug #17807

closed

async messenger should rebind using the new nonce

Added by Kefu Chai over 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Actions #1

Updated by Kefu Chai over 7 years ago

  • Status changed from In Progress to Fix Under Review
  • Source changed from other to Community (dev)
Actions #2

Updated by Kefu Chai over 7 years ago

@tchaikov

The command failing is:
`ceph tell osd.0 version`
whilest running:
`src/test/cephtool-test-mon.sh`

And running te command manually on the commandline:

sudo bin/ceph -c src/test/td/t-7202/ceph.conf tell osd.0 version
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2016-11-30 17:14:38.311288 801c16f00 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-11-30 17:14:38.332033 801c1ab00 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-11-30 17:14:38.363600 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:38.565084 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:38.973728 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:39.803632 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:41.429071 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:44.660891 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!

This last line continues until the command gets cancelled or osd.0 goes away.

In the OSD log I find repeats of:

2016-11-30 17:17:44.924579 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=51 -
2016-11-30 17:17:44.925175 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_bulk peer close file descriptor 51
2016-11-30 17:17:44.925234 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_until read failed
2016-11-30 17:17:44.925330 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection read peer banner and addr failed
2016-11-30 17:17:44.925420 b9c2d80  0 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2016-11-30 17:17:44.925559 b9c2d80  1 -- 127.0.0.1:6800/1030470 reap_dead start

You'd need to up the loging to get it more verbose. But it tells you that the OSD has gone to `127.0.0.1:6800/1030470` for the instance of the socket. And the rest of the world does not seem to be updated. My fix actually does get the message out to the reset of the deamons, and ceph is able to pick up on that.

BTW:
It also shows why I'd like to get ride of all these DeveloperMode and feature warnings whilest developing.
Huge logfile pollution.

Actions #3

Updated by Kefu Chai over 7 years ago

  • Status changed from Fix Under Review to New
  • Assignee deleted (Kefu Chai)

not able to reproduce it on linux. reassigning from me.

Actions #4

Updated by Greg Farnum almost 7 years ago

  • Status changed from New to Closed

I think this was a thing from the FreeBSD port, but AFAIK that's going fine now so this must have been resolved. Poke at things if it's not, Kefu. :)

Actions

Also available in: Atom PDF