Project

General

Profile

Bug #1747

msgr: osd connection originates from wrong port

Added by Sage Weil over 12 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

osd.2 sends a couple messages to osd.1:


2011-11-22 15:26:41.198593 7fe057902700 osd.2 4 _share_map_outgoing osd.1 10.3.14.215:6801/9190 don't know epoch, doing nothing
2011-11-22 15:26:41.198611 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.1 10.3.14.215:6801/9190 -- pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v1 -- ?+0 0x1bd41c0
2011-11-22 15:26:41.198665 7fe057902700 osd.2 4 do_queries querying osd.0 on 7 PGs
2011-11-22 15:26:41.198679 7fe057902700 osd.2 4 _share_map_outgoing osd.0 10.3.14.214:6801/29211 don't know epoch, doing nothing
2011-11-22 15:26:41.198694 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.0 10.3.14.214:6801/29211 -- pg_query(0.8,0.0p2,0.1p2,1.7,1.0p2,2.6,2.1p2 epoch 4) v1 -- ?+0 0x1bd4000
2011-11-22 15:26:41.198712 7fe057902700 osd.2 4 do_queries querying osd.1 on 8 PGs
2011-11-22 15:26:41.198725 7fe057902700 osd.2 4 _share_map_outgoing osd.1 10.3.14.215:6801/9190 don't know epoch, doing nothing
2011-11-22 15:26:41.198740 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.1 10.3.14.215:6801/9190 -- pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1 -- ?+0 0x1c05a80

but when osd.1 gets them, the source port is weird:

2011-11-22 15:26:41.201640 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615
2011-11-22 15:26:41.201676 7fa791c82700 osd.1 4  new session 0x1aed240 con=0x1b01000 addr=10.3.14.136:49448/30505
2011-11-22 15:26:41.201699 7fa791c82700 osd.1 4  session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=)
2011-11-22 15:26:41.202020 7fa793586700 -- 10.3.14.215:6802/9190 <== osd.2 10.3.14.136:0/30505 1 ==== osd_ping(start_heartbeat e0 as_of 4) v1 ==== 61+0+0 (2688641745 0 0) 0x1afa700 con 0x1b01140
2011-11-22 15:26:41.202033 7fa793586700 osd.1 4 heartbeat_dispatch 0x1afa700
2011-11-22 15:26:41.202052 7fa793586700 osd.1 4 handle_osd_ping peer osd.2 10.3.14.136:0/30505 requesting heartbeats as_of 4
2011-11-22 15:26:41.202604 7fa794588700 -- 10.3.14.215:6801/9190 <== osd.2 10.3.14.136:49448/30505 1 ==== pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v2 ==== 5052+0+0 (1751659628 0 0) 0x1afa540 con 0x1b01000
2011-11-22 15:26:41.202626 7fa794588700 osd.1 4 _dispatch 0x1afa540 pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v2
2011-11-22 15:26:41.202635 7fa794588700 osd.1 4 handle_pg_notify from osd.2
2011-11-22 15:26:41.202645 7fa794588700 osd.1 4 require_same_or_newer_map 4 (i am 4) 0x1afa540
2011-11-22 15:26:41.202658 7fa794588700 osd.1 4 from dead osd.2, dropping, sharing map
2011-11-22 15:26:41.202674 7fa794588700 osd.1 4 send_incremental_map 4 -> 4 to osd.2 10.3.14.136:49448/30505
2011-11-22 15:26:41.202689 7fa794588700 -- 10.3.14.215:6801/9190 mark_down_on_empty 0x1b01000 -- 0x1ade780 closing (queue is empty)
2011-11-22 15:26:41.202714 7fa794588700 -- 10.3.14.215:6801/9190 mark_disposable 0x1b01000 -- 0x1ade780
2011-11-22 15:26:41.202752 7fa794588700 -- 10.3.14.215:6801/9190 <== osd.2 10.3.14.136:49448/30505 2 ==== pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1 ==== 668+0+0 (2955722336 0 0) 0x1afaa80 con 0x1b01000
2011-11-22 15:26:41.202767 7fa794588700 osd.1 4 _dispatch 0x1afaa80 pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1
2011-11-22 15:26:41.202779 7fa794588700 osd.1 4 handle_pg_query from osd.2 epoch 4
2011-11-22 15:26:41.202788 7fa794588700 osd.1 4 require_same_or_newer_map 4 (i am 4) 0x1afaa80
2011-11-22 15:26:41.202797 7fa794588700 osd.1 4 from dead osd.2, dropping, sharing map
2011-11-22 15:26:41.202807 7fa794588700 osd.1 4 send_incremental_map 4 -> 4 to osd.2 10.3.14.136:49448/30505
2011-11-22 15:26:41.202817 7fa794588700 -- 10.3.14.215:6801/9190 mark_down_on_empty 0x1b01000 -- 0x1ade780 closing (queue is empty)
2011-11-22 15:26:41.202829 7fa794588700 -- 10.3.14.215:6801/9190 mark_disposable 0x1b01000 -- 0x1ade780

and reconnects with the right port...
2011-11-22 15:26:41.203475 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615
2011-11-22 15:26:41.203500 7fa791c82700 osd.1 4  new session 0x1aed240 con=0x1b01780 addr=10.3.14.136:6801/30505
2011-11-22 15:26:41.203514 7fa791c82700 osd.1 4  session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=)
2011-11-22 15:26:41.203530 7fa791c82700 -- 10.3.14.215:6801/9190 >> 10.3.14.136:6801/30505 pipe(0x1aff500 sd=16 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
2011-11-22 15:26:41.203869 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615
2011-11-22 15:26:41.203883 7fa791c82700 osd.1 4  session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=)

I hit this after ~300 iterations of the mon thrashing test.. teuthology blocked on waiting for clean cluster (a few pgs didn't peer because of the dropped message above).

logs at metropolis:~sage/osd.[12].log

Associated revisions

Revision b1f26440 (diff)
Added by Sage Weil about 12 years ago

msgr: fix race in learned_addr()

- two connect() threads
- both hit if (need_addr) check
- one takes lock, sets addr, need_addr = false, unlocks
- continues to ::encode(ms_addr, ...);
- meanwhile, second thread set ms_addr again, but copies peer port into
place before adjusting it. racing ::encode() sees bad port and sends it
to the peer.

Fix this two ways:

- don't copy bad port into place; set it first
- re-check need_addr after taking lock

Fixes: #1747
Signed-off-by: Sage Weil <>
Reviewed-by: Greg Farnum <>

History

#1 Updated by Sage Weil over 12 years ago

  • Target version changed from v0.39 to v0.40

#2 Updated by Sage Weil over 12 years ago

  • translation missing: en.field_position set to 11

#3 Updated by Greg Farnum over 12 years ago

  • Status changed from New to In Progress
  • Assignee set to Greg Farnum

The problem here is somewhere on osd.2 — osd.1 is using the address that osd.2 is providing, and you can see that osd.2 hasn't filled in eg all of its heartbeat_addr fields in the logs:

2011-11-22 15:26:41.198879 7fe057902700 -- :/30505 --> 10.3.14.215:6802/9190 -- osd_ping(start_heartbeat e0 as_of 4) v1 -- ?+0 0x185dc40 con 0x1d39640

I'm looking into how it could have a blank address set.

#4 Updated by Greg Farnum over 12 years ago

  • Status changed from In Progress to Need More Info
  • Assignee deleted (Greg Farnum)

The blank address isn't a problem; it's due to the in_hbmsgr not being bound (deliberately). Unfortunately I've been through this backwards and forwards and can't find any way for the port to be set wrong like this. Diagnosing it is going to require some level of messenger logging, I suspect. :(

#5 Updated by Sage Weil about 12 years ago

  • Target version deleted (v0.40)
  • translation missing: en.field_position deleted (59)
  • translation missing: en.field_position set to 27

#6 Updated by Anonymous about 12 years ago

We only saw this the once, but we believe the bug and want to keep it open.

#7 Updated by Sage Weil about 12 years ago

just hit this again. osd.1:


2012-02-28 12:01:31.519213 7f6a5ffff700 -- 10.3.14.140:6801/11158 --> osd.2 10.3.14.137:6801/27749 -- pg_notify(2.0p2,2.1p2,1.1p2,0.0p2,1.0p2,0.1p2,1.8,0.9,0.8,2.4,1.7,0.6,2.7,2.6,1.5 epoch 3 query_epoch 3) v2 -- ?+0 0x11f6550
2012-02-28 12:01:31.519268 7f6a5ffff700 osd.1 3 do_queries querying osd.0 on 12 PGs
2012-02-28 12:01:31.519283 7f6a5ffff700 osd.1 3 _share_map_outgoing osd.0 10.3.14.139:6801/21393 don't know epoch, doing nothing
2012-02-28 12:01:31.519302 7f6a5ffff700 -- 10.3.14.140:6801/11158 --> osd.0 10.3.14.139:6801/21393 -- pg_query(0.2,0.4,0.0p1,0.1p1,1.1,1.3,1.0p1,1.1p1,2.0,2.2,2.0p1,2.1p1 epoch 3) v1 -- ?+0 0x11f6ca0
2012-02-28 12:01:31.519314 7f6a5ffff700 osd.1 3 do_queries querying osd.2 on 3 PGs
2012-02-28 12:01:31.519327 7f6a5ffff700 osd.1 3 _share_map_outgoing osd.2 10.3.14.137:6801/27749 don't know epoch, doing nothing
2012-02-28 12:01:31.519340 7f6a5ffff700 -- 10.3.14.140:6801/11158 --> osd.2 10.3.14.137:6801/27749 -- pg_query(0.7,1.6,2.5 epoch 3) v1 -- ?+0 0x11f6e80

but osd.2 sees:
2012-02-28 12:01:31.519302 7f4443a8f700 -- 10.3.14.137:6801/27749 >> :/0 pipe(0x1e93640 sd=25 pgs=0 cs=0 l=0).accept sd=25
2012-02-28 12:01:31.520206 7f4443a8f700 osd.2 3 OSD::ms_verify_authorizer name=osd.1 auid=18446744073709551615
2012-02-28 12:01:31.520240 7f4443a8f700 osd.2 3  new session 0x1e93c80 con=0x1e938b0 addr=10.3.14.140:42360/11158
2012-02-28 12:01:31.520290 7f4443a8f700 osd.2 3  session 0x1e93c80 has caps osdcaps(pools={} default allow= default_deny=)
...
2012-02-28 12:01:31.568604 7f443645d700 -- 10.3.14.137:6801/27749 <== osd.1 10.3.14.140:42360/11158 2 ==== pg_query(0.7,1.6,2.5 epoch 3) v1 ==== 273+0+0 (2655710564 0 0) 0x1e971b0 con 0x1e938b0
2012-02-28 12:01:31.568616 7f443645d700 osd.2 3 _dispatch 0x1e971b0 pg_query(0.7,1.6,2.5 epoch 3) v1
2012-02-28 12:01:31.568631 7f443645d700 osd.2 3 handle_pg_query from osd.1 epoch 3
2012-02-28 12:01:31.568640 7f443645d700 osd.2 3 require_same_or_newer_map 3 (i am 3) 0x1e971b0
2012-02-28 12:01:31.568649 7f443645d700 osd.2 3 from dead osd.1, dropping, sharing map
2012-02-28 12:01:31.568660 7f443645d700 osd.2 3 send_incremental_map 3 -> 3 to osd.1 10.3.14.140:42360/11158

#8 Updated by Sage Weil about 12 years ago

  • Status changed from Need More Info to 4

wip-1747

#9 Updated by Sage Weil about 12 years ago

  • Target version set to v0.44

#10 Updated by Sage Weil about 12 years ago

  • Status changed from 4 to Resolved
  • Assignee set to Sage Weil
  • Source set to Q/A

commit:b1f264406f93af35600786f58e75908c393cf2ed

#11 Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
  • Target version deleted (v0.44)

Also available in: Atom PDF