Actions
Bug #4246
closedosd/msgr: rebound to an addr a peer osd already had open
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
On osd.1, we see
2013-02-20 04:38:48.655074 7f564df38700 1 -- 10.200.63.133:6801/21178 <== mon.2 10.200.63.133:6789/0 75988 ==== osd_map(791..791 src has 541..791) v3 ==== 169+0+0 (1670149618 0 0) 0xb70ac00 con 0x1e21a20 2013-02-20 04:38:48.656066 7f5645424700 0 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=2 pgs=5 cs=1 l=0).fault, initiating reconnect 2013-02-20 04:38:48.656105 7f5644414700 1 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=1 pgs=5 cs=2 l=0).do_sendmsg error Broken pipe 2013-02-20 04:38:48.656163 7f5644414700 1 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=1 pgs=5 cs=2 l=0).writer error sending 0xcf70a00, 32: Broken pipe 2013-02-20 04:38:48.660974 7f564df38700 0 log [WRN] : map e791 wrongly marked me down 2013-02-20 04:38:48.673923 7f564df38700 1 -- 192.168.254.133:6801/21178 rebind avoid 6803 2013-02-20 04:38:48.673941 7f564df38700 1 -- 192.168.254.133:6801/21178 mark_down_all 2013-02-20 04:38:48.677875 7f5644414700 0 -- 192.168.254.133:6801/21178 >> 192.168.254.132:6802/18444 pipe(0xb583400 sd=30 :45392 s=1 pgs=5 cs=2 l=0).fault 2013-02-20 04:38:48.680007 7f564df38700 1 accepter.accepter.rebind avoid 6803 2013-02-20 04:38:48.688725 7f564df38700 1 -- 192.168.254.133:0/21178 learned my addr 192.168.254.133:0/21178 2013-02-20 04:38:48.688737 7f564df38700 1 accepter.accepter.bind my_inst.addr is 192.168.254.133:6800/21178 need_addr=0 ...
but on osd.0, when it tries to connect, we see:
2013-02-20 04:38:48.754363 7f108a00f700 0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 4 state connecting 2013-02-20 04:38:48.754379 7f108a00f700 0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=0 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing 2013-02-20 04:38:50.838721 7f108a00f700 0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=2 pgs=22 cs=1 l=0).reader got old message 1 <= 2345560 0xa3ddc00 osd_map(791..793 src has 541..793) v3, discarding 2013-02-20 04:38:50.839576 7f108a00f700 0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=2 pgs=22 cs=1 l=0).reader got old message 2 <= 2345560 0x965d6c0 pg_notify(2.13(15),2.d(15),0.f(15),1.e(15),0.d(15),1.c(15) epoch 793) v4, discarding 2013-02-20 04:38:50.839944 7f108a00f700 0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0x9054c80 sd=34 :6802 s=2 pgs=22 cs=1 l=0).reader got old message 3 <= 2345560 0x965d6c0 pg_notify(2.b(15),0.8(15),2.5(15),0.7(15) epoch 793) v4, discarding ...
That addr should not be open there.
I don't think its the case here, since osd.0 should be caught up on the map, but in more crazy cases, it might be possible that we are reusing a port from a previous restart, and that this connection is legitimately still open. I suspect the msgr should be better about detecting that possibility and force a reset (of some variety). connect_seq 0 vs 4 should make it pretty obvious!
Updated by Sage Weil about 11 years ago
see e229f8451d37913225c49481b2ce2896ca6788a2. i think maybe an explicit handshake is needed here on each connection before connect_seq can get bumped.
Updated by Sage Weil about 11 years ago
- Status changed from Need More Info to Resolved
Actions