Project

General

Profile

Actions

Bug #1747

closed

msgr: osd connection originates from wrong port

Added by Sage Weil over 12 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

osd.2 sends a couple messages to osd.1:


2011-11-22 15:26:41.198593 7fe057902700 osd.2 4 _share_map_outgoing osd.1 10.3.14.215:6801/9190 don't know epoch, doing nothing
2011-11-22 15:26:41.198611 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.1 10.3.14.215:6801/9190 -- pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v1 -- ?+0 0x1bd41c0
2011-11-22 15:26:41.198665 7fe057902700 osd.2 4 do_queries querying osd.0 on 7 PGs
2011-11-22 15:26:41.198679 7fe057902700 osd.2 4 _share_map_outgoing osd.0 10.3.14.214:6801/29211 don't know epoch, doing nothing
2011-11-22 15:26:41.198694 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.0 10.3.14.214:6801/29211 -- pg_query(0.8,0.0p2,0.1p2,1.7,1.0p2,2.6,2.1p2 epoch 4) v1 -- ?+0 0x1bd4000
2011-11-22 15:26:41.198712 7fe057902700 osd.2 4 do_queries querying osd.1 on 8 PGs
2011-11-22 15:26:41.198725 7fe057902700 osd.2 4 _share_map_outgoing osd.1 10.3.14.215:6801/9190 don't know epoch, doing nothing
2011-11-22 15:26:41.198740 7fe057902700 -- 10.3.14.136:6801/30505 --> osd.1 10.3.14.215:6801/9190 -- pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1 -- ?+0 0x1c05a80

but when osd.1 gets them, the source port is weird:

2011-11-22 15:26:41.201640 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615
2011-11-22 15:26:41.201676 7fa791c82700 osd.1 4  new session 0x1aed240 con=0x1b01000 addr=10.3.14.136:49448/30505
2011-11-22 15:26:41.201699 7fa791c82700 osd.1 4  session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=)
2011-11-22 15:26:41.202020 7fa793586700 -- 10.3.14.215:6802/9190 <== osd.2 10.3.14.136:0/30505 1 ==== osd_ping(start_heartbeat e0 as_of 4) v1 ==== 61+0+0 (2688641745 0 0) 0x1afa700 con 0x1b01140
2011-11-22 15:26:41.202033 7fa793586700 osd.1 4 heartbeat_dispatch 0x1afa700
2011-11-22 15:26:41.202052 7fa793586700 osd.1 4 handle_osd_ping peer osd.2 10.3.14.136:0/30505 requesting heartbeats as_of 4
2011-11-22 15:26:41.202604 7fa794588700 -- 10.3.14.215:6801/9190 <== osd.2 10.3.14.136:49448/30505 1 ==== pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v2 ==== 5052+0+0 (1751659628 0 0) 0x1afa540 con 0x1b01000
2011-11-22 15:26:41.202626 7fa794588700 osd.1 4 _dispatch 0x1afa540 pg_notify(1.0p1,0.1p1,1.1p1,0.0p1,2.1p1,2.0p1,2.5,1.6,0.7,0.4,2.0,1.3,0.2,2.2,1.1 epoch 4 query_epoch 4) v2
2011-11-22 15:26:41.202635 7fa794588700 osd.1 4 handle_pg_notify from osd.2
2011-11-22 15:26:41.202645 7fa794588700 osd.1 4 require_same_or_newer_map 4 (i am 4) 0x1afa540
2011-11-22 15:26:41.202658 7fa794588700 osd.1 4 from dead osd.2, dropping, sharing map
2011-11-22 15:26:41.202674 7fa794588700 osd.1 4 send_incremental_map 4 -> 4 to osd.2 10.3.14.136:49448/30505
2011-11-22 15:26:41.202689 7fa794588700 -- 10.3.14.215:6801/9190 mark_down_on_empty 0x1b01000 -- 0x1ade780 closing (queue is empty)
2011-11-22 15:26:41.202714 7fa794588700 -- 10.3.14.215:6801/9190 mark_disposable 0x1b01000 -- 0x1ade780
2011-11-22 15:26:41.202752 7fa794588700 -- 10.3.14.215:6801/9190 <== osd.2 10.3.14.136:49448/30505 2 ==== pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1 ==== 668+0+0 (2955722336 0 0) 0x1afaa80 con 0x1b01000
2011-11-22 15:26:41.202767 7fa794588700 osd.1 4 _dispatch 0x1afaa80 pg_query(0.6,0.9,1.5,1.8,1.1p2,2.4,2.7,2.0p2 epoch 4) v1
2011-11-22 15:26:41.202779 7fa794588700 osd.1 4 handle_pg_query from osd.2 epoch 4
2011-11-22 15:26:41.202788 7fa794588700 osd.1 4 require_same_or_newer_map 4 (i am 4) 0x1afaa80
2011-11-22 15:26:41.202797 7fa794588700 osd.1 4 from dead osd.2, dropping, sharing map
2011-11-22 15:26:41.202807 7fa794588700 osd.1 4 send_incremental_map 4 -> 4 to osd.2 10.3.14.136:49448/30505
2011-11-22 15:26:41.202817 7fa794588700 -- 10.3.14.215:6801/9190 mark_down_on_empty 0x1b01000 -- 0x1ade780 closing (queue is empty)
2011-11-22 15:26:41.202829 7fa794588700 -- 10.3.14.215:6801/9190 mark_disposable 0x1b01000 -- 0x1ade780

and reconnects with the right port...
2011-11-22 15:26:41.203475 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615
2011-11-22 15:26:41.203500 7fa791c82700 osd.1 4  new session 0x1aed240 con=0x1b01780 addr=10.3.14.136:6801/30505
2011-11-22 15:26:41.203514 7fa791c82700 osd.1 4  session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=)
2011-11-22 15:26:41.203530 7fa791c82700 -- 10.3.14.215:6801/9190 >> 10.3.14.136:6801/30505 pipe(0x1aff500 sd=16 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
2011-11-22 15:26:41.203869 7fa791c82700 osd.1 4 OSD::ms_verify_authorizer name=osd.2 auid=18446744073709551615
2011-11-22 15:26:41.203883 7fa791c82700 osd.1 4  session 0x1aed240 has caps osdcaps(pools={} default allow= default_deny=)

I hit this after ~300 iterations of the mon thrashing test.. teuthology blocked on waiting for clean cluster (a few pgs didn't peer because of the dropped message above).

logs at metropolis:~sage/osd.[12].log

Actions

Also available in: Atom PDF