Project

General

Profile

Actions

Bug #4246

closed

osd/msgr: rebound to an addr a peer osd already had open

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On osd.1, we see

2013-02-20 04:38:48.655074 7f564df38700  1 -- 10.200.63.133:6801/21178 <== mon.2 10.200.63.133:6789/0 75988 ==== osd_map(791..791 src has 541..791) v3 ==== 169+0+0 (1670149618 0 0) 0xb70ac00 con 0x1e21a20
2013-02-20 04:38:48.656066 7f5645424700  0 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=2 pgs=5 cs=1 l=0).fault, initiating reconnect
2013-02-20 04:38:48.656105 7f5644414700  1 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=1 pgs=5 cs=2 l=0).do_sendmsg error Broken pipe
2013-02-20 04:38:48.656163 7f5644414700  1 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=1 pgs=5 cs=2 l=0).writer error sending 0xcf70a00, 32: Broken pipe
2013-02-20 04:38:48.660974 7f564df38700  0 log [WRN] : map e791 wrongly marked me down
2013-02-20 04:38:48.673923 7f564df38700  1 -- 192.168.254.133:6801/21178 rebind avoid 6803
2013-02-20 04:38:48.673941 7f564df38700  1 -- 192.168.254.133:6801/21178 mark_down_all
2013-02-20 04:38:48.677875 7f5644414700  0 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=1 pgs=5 cs=2 l=0).fault
2013-02-20 04:38:48.680007 7f564df38700  1 accepter.accepter.rebind avoid 6803
2013-02-20 04:38:48.688725 7f564df38700  1 -- 192.168.254.133:0/21178 learned my addr 192.168.254.133:0/21178
2013-02-20 04:38:48.688737 7f564df38700  1 accepter.accepter.bind my_inst.addr is 192.168.254.133:6800/21178 need_addr=0
...

but on osd.0, when it tries to connect, we see:
2013-02-20 04:38:48.754363 7f108a00f700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 4 state connecting
2013-02-20 04:38:48.754379 7f108a00f700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=0 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing
2013-02-20 04:38:50.838721 7f108a00f700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=2 pgs=22 cs=1 l=0).reader got old message 1 <= 2345560 0xa3ddc00 osd_map(791..793 src has 541..793) v3, discarding
2013-02-20 04:38:50.839576 7f108a00f700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=2 pgs=22 cs=1 l=0).reader got old message 2 <= 2345560 0x965d6c0 pg_notify(2.13(15),2.d(15),0.f(15),1.e(15),0.d(15),1.c(15) epoch 793) v4, discarding
2013-02-20 04:38:50.839944 7f108a00f700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=2 pgs=22 cs=1 l=0).reader got old message 3 <= 2345560 0x965d6c0 pg_notify(2.b(15),0.8(15),2.5(15),0.7(15) epoch 793) v4, discarding
...

That addr should not be open there.

I don't think its the case here, since osd.0 should be caught up on the map, but in more crazy cases, it might be possible that we are reusing a port from a previous restart, and that this connection is legitimately still open. I suspect the msgr should be better about detecting that possibility and force a reset (of some variety). connect_seq 0 vs 4 should make it pretty obvious!


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #4006: osd: repeating 'wrong node' message in logResolvedSage Weil02/04/2013

Actions
Actions #1

Updated by Sage Weil about 11 years ago

see e229f8451d37913225c49481b2ce2896ca6788a2. i think maybe an explicit handshake is needed here on each connection before connect_seq can get bumped.

Actions #2

Updated by Sage Weil about 11 years ago

  • Status changed from Need More Info to Resolved
Actions

Also available in: Atom PDF