Bug #1747
closedmsgr: osd connection originates from wrong port
0%
Description
osd.2 sends a couple messages to osd.1:
2011-11-22 15:26:41.198593 7fe057902700 osd.2 4 _share_map_outgoing osd.1 10.3.14.215:6801/9190 don't know epoch, doing nothing 2011-11-22 15:26:41.198611 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.1 10.3.14.215:6801/9190 -- pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v1 -- ?+0 0x1bd41c0 2011-11-22 15:26:41.198665 7fe057902700 osd.2 4 do_queries querying osd.0 on 7 PGs 2011-11-22 15:26:41.198679 7fe057902700 osd.2 4 _share_map_outgoing osd.0 10.3.14.214:6801/29211 don't know epoch, doing nothing 2011-11-22 15:26:41.198694 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.0 10.3.14.214:6801/29211 -- pg_query(0.8,0.0p2,0.1p2,1.7,1.0p2,2.6,2.1p2 epoch 4) v1 -- ?+0 0x1bd4000 2011-11-22 15:26:41.198712 7fe057902700 osd.2 4 do_queries querying osd.1 on 8 PGs 2011-11-22 15:26:41.198725 7fe057902700 osd.2 4 _share_map_outgoing osd.1 10.3.14.215:6801/9190 don't know epoch, doing nothing 2011-11-22 15:26:41.198740 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.1 10.3.14.215:6801/9190 -- pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1 -- ?+0 0x1c05a80
but when osd.1 gets them, the source port is weird:
2011-11-22 15:26:41.201640 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615 2011-11-22 15:26:41.201676 7fa791c82700 osd.1 4 new session 0x1aed240 con=0x1b01000 addr=10.3.14.136:49448/30505 2011-11-22 15:26:41.201699 7fa791c82700 osd.1 4 session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=) 2011-11-22 15:26:41.202020 7fa793586700 -- 10.3.14.215:6802/9190 <== osd.2 10.3.14.136:0/30505 1 ==== osd_ping(start_heartbeat e0 as_of 4) v1 ==== 61+0+0 (2688641745 0 0) 0x1afa700 con 0x1b01140 2011-11-22 15:26:41.202033 7fa793586700 osd.1 4 heartbeat_dispatch 0x1afa700 2011-11-22 15:26:41.202052 7fa793586700 osd.1 4 handle_osd_ping peer osd.2 10.3.14.136:0/30505 requesting heartbeats as_of 4 2011-11-22 15:26:41.202604 7fa794588700 -- 10.3.14.215:6801/9190 <== osd.2 10.3.14.136:49448/30505 1 ==== pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v2 ==== 5052+0+0 (1751659628 0 0) 0x1afa540 con 0x1b01000 2011-11-22 15:26:41.202626 7fa794588700 osd.1 4 _dispatch 0x1afa540 pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v2 2011-11-22 15:26:41.202635 7fa794588700 osd.1 4 handle_pg_notify from osd.2 2011-11-22 15:26:41.202645 7fa794588700 osd.1 4 require_same_or_newer_map 4 (i am 4) 0x1afa540 2011-11-22 15:26:41.202658 7fa794588700 osd.1 4 from dead osd.2, dropping, sharing map 2011-11-22 15:26:41.202674 7fa794588700 osd.1 4 send_incremental_map 4 -> 4 to osd.2 10.3.14.136:49448/30505 2011-11-22 15:26:41.202689 7fa794588700 -- 10.3.14.215:6801/9190 mark_down_on_empty 0x1b01000 -- 0x1ade780 closing (queue is empty) 2011-11-22 15:26:41.202714 7fa794588700 -- 10.3.14.215:6801/9190 mark_disposable 0x1b01000 -- 0x1ade780 2011-11-22 15:26:41.202752 7fa794588700 -- 10.3.14.215:6801/9190 <== osd.2 10.3.14.136:49448/30505 2 ==== pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1 ==== 668+0+0 (2955722336 0 0) 0x1afaa80 con 0x1b01000 2011-11-22 15:26:41.202767 7fa794588700 osd.1 4 _dispatch 0x1afaa80 pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1 2011-11-22 15:26:41.202779 7fa794588700 osd.1 4 handle_pg_query from osd.2 epoch 4 2011-11-22 15:26:41.202788 7fa794588700 osd.1 4 require_same_or_newer_map 4 (i am 4) 0x1afaa80 2011-11-22 15:26:41.202797 7fa794588700 osd.1 4 from dead osd.2, dropping, sharing map 2011-11-22 15:26:41.202807 7fa794588700 osd.1 4 send_incremental_map 4 -> 4 to osd.2 10.3.14.136:49448/30505 2011-11-22 15:26:41.202817 7fa794588700 -- 10.3.14.215:6801/9190 mark_down_on_empty 0x1b01000 -- 0x1ade780 closing (queue is empty) 2011-11-22 15:26:41.202829 7fa794588700 -- 10.3.14.215:6801/9190 mark_disposable 0x1b01000 -- 0x1ade780
and reconnects with the right port...
2011-11-22 15:26:41.203475 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615 2011-11-22 15:26:41.203500 7fa791c82700 osd.1 4 new session 0x1aed240 con=0x1b01780 addr=10.3.14.136:6801/30505 2011-11-22 15:26:41.203514 7fa791c82700 osd.1 4 session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=) 2011-11-22 15:26:41.203530 7fa791c82700 -- 10.3.14.215:6801/9190 >> 10.3.14.136:6801/30505 pipe(0x1aff500 sd=16 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION 2011-11-22 15:26:41.203869 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615 2011-11-22 15:26:41.203883 7fa791c82700 osd.1 4 session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=)
I hit this after ~300 iterations of the mon thrashing test.. teuthology blocked on waiting for clean cluster (a few pgs didn't peer because of the dropped message above).
logs at metropolis:~sage/osd.[12].log
Updated by Sage Weil over 12 years ago
- Target version changed from v0.39 to v0.40
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position set to 11
Updated by Greg Farnum over 12 years ago
- Status changed from New to In Progress
- Assignee set to Greg Farnum
The problem here is somewhere on osd.2 — osd.1 is using the address that osd.2 is providing, and you can see that osd.2 hasn't filled in eg all of its heartbeat_addr fields in the logs:
2011-11-22 15:26:41.198879 7fe057902700 -- :/30505 --> 10.3.14.215:6802/9190 -- osd_ping(start_heartbeat e0 as_of 4) v1 -- ?+0 0x185dc40 con 0x1d39640
I'm looking into how it could have a blank address set.
Updated by Greg Farnum over 12 years ago
- Status changed from In Progress to Need More Info
- Assignee deleted (
Greg Farnum)
The blank address isn't a problem; it's due to the in_hbmsgr not being bound (deliberately). Unfortunately I've been through this backwards and forwards and can't find any way for the port to be set wrong like this. Diagnosing it is going to require some level of messenger logging, I suspect. :(
Updated by Sage Weil over 12 years ago
- Target version deleted (
v0.40) - Translation missing: en.field_position deleted (
59) - Translation missing: en.field_position set to 27
Updated by Anonymous about 12 years ago
We only saw this the once, but we believe the bug and want to keep it open.
Updated by Sage Weil about 12 years ago
just hit this again. osd.1:
2012-02-28 12:01:31.519213 7f6a5ffff700 -- 10.3.14.140:6801/11158 --> osd.2 10.3.14.137:6801/27749 -- pg_notify(2.0p2,2.1p2,1.1p2,0.0p2,1.0p2,0.1p2,1.8,0.9,0.8,2.4,1.7,0.6,2.7,2.6,1.5 epoch 3 query_epoch 3) v2 -- ?+0 0x11f6550 2012-02-28 12:01:31.519268 7f6a5ffff700 osd.1 3 do_queries querying osd.0 on 12 PGs 2012-02-28 12:01:31.519283 7f6a5ffff700 osd.1 3 _share_map_outgoing osd.0 10.3.14.139:6801/21393 don't know epoch, doing nothing 2012-02-28 12:01:31.519302 7f6a5ffff700 -- 10.3.14.140:6801/11158 --> osd.0 10.3.14.139:6801/21393 -- pg_query(0.2,0.4,0.0p1,0.1p1,1.1,1.3,1.0p1,1.1p1,2.0,2.2,2.0p1,2.1p1 epoch 3) v1 -- ?+0 0x11f6ca0 2012-02-28 12:01:31.519314 7f6a5ffff700 osd.1 3 do_queries querying osd.2 on 3 PGs 2012-02-28 12:01:31.519327 7f6a5ffff700 osd.1 3 _share_map_outgoing osd.2 10.3.14.137:6801/27749 don't know epoch, doing nothing 2012-02-28 12:01:31.519340 7f6a5ffff700 -- 10.3.14.140:6801/11158 --> osd.2 10.3.14.137:6801/27749 -- pg_query(0.7,1.6,2.5 epoch 3) v1 -- ?+0 0x11f6e80
but osd.2 sees:
2012-02-28 12:01:31.519302 7f4443a8f700 -- 10.3.14.137:6801/27749 >> :/0 pipe(0x1e93640 sd=25 pgs=0 cs=0 l=0).accept sd=25 2012-02-28 12:01:31.520206 7f4443a8f700 osd.2 3 OSD::ms_verify_authorizer name=osd.1 auid=18446744073709551615 2012-02-28 12:01:31.520240 7f4443a8f700 osd.2 3 new session 0x1e93c80 con=0x1e938b0 addr=10.3.14.140:42360/11158 2012-02-28 12:01:31.520290 7f4443a8f700 osd.2 3 session 0x1e93c80 has caps osdcaps(pools={} default allow= default_deny=) ... 2012-02-28 12:01:31.568604 7f443645d700 -- 10.3.14.137:6801/27749 <== osd.1 10.3.14.140:42360/11158 2 ==== pg_query(0.7,1.6,2.5 epoch 3) v1 ==== 273+0+0 (2655710564 0 0) 0x1e971b0 con 0x1e938b0 2012-02-28 12:01:31.568616 7f443645d700 osd.2 3 _dispatch 0x1e971b0 pg_query(0.7,1.6,2.5 epoch 3) v1 2012-02-28 12:01:31.568631 7f443645d700 osd.2 3 handle_pg_query from osd.1 epoch 3 2012-02-28 12:01:31.568640 7f443645d700 osd.2 3 require_same_or_newer_map 3 (i am 3) 0x1e971b0 2012-02-28 12:01:31.568649 7f443645d700 osd.2 3 from dead osd.1, dropping, sharing map 2012-02-28 12:01:31.568660 7f443645d700 osd.2 3 send_incremental_map 3 -> 3 to osd.1 10.3.14.140:42360/11158
Updated by Sage Weil about 12 years ago
- Status changed from 4 to Resolved
- Assignee set to Sage Weil
- Source set to Q/A
commit:b1f264406f93af35600786f58e75908c393cf2ed
Updated by Greg Farnum about 5 years ago
- Project changed from Ceph to Messengers
- Category deleted (
msgr) - Target version deleted (
v0.44)