Project

General

Profile

Actions

Bug #15010

closed

simple->async peer got "failed lossy con, dropping message"

Added by Sage Weil about 8 years ago. Updated about 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-03-08 05:42:17.669564 7f26b1d12700  0 -- 172.21.15.47:6805/2508 >> 172.21.15.64:6801/3976 pipe(0x7f26ee3b7400 sd=70 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f26ee341000).injecting socket failure
2016-03-08 05:42:17.669651 7f26b1d12700 10 osd.4 3  new session 0x7f26ee083ee0 con=0x7f26ee341000 addr=172.21.15.64:6801/3976
2016-03-08 05:42:17.669672 7f26b1d12700 10 osd.4 3  session 0x7f26ee083ee0 osd.0 has caps osdcap[grant(*)] 'allow *'
...
2016-03-08 05:42:17.671738 7f26cd767700  1 -- 172.21.15.47:6805/2508 --> 172.21.15.64:6801/3976 -- pg_query(0.15 epoch 3) v3 -- ?+0 0x7f26ee2b6e00 con 0x7f26ee341000
2016-03-08 05:42:17.671747 7f26cd767700  0 -- 172.21.15.47:6805/2508 submit_message pg_query(0.15 epoch 3) v3 remote, 172.21.15.64:6801/3976, failed lossy con, dropping message 0x7f26ee2b6e00

and on the other end (async),

2016-03-08 05:42:17.672869 7fa2ed769700  1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7fa30c8d4600 pg_query(0.c epoch 3) v3
2016-03-08 05:42:17.673922 7fa2f9102700  1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=69 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_bulk peer close file descriptor 69
2016-03-08 05:42:17.673944 7fa2f9102700  1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=69 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until read failed
2016-03-08 05:42:17.673957 7fa2f9102700  1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=69 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._process_connection read connect reply failed

Actions #1

Updated by Sage Weil about 8 years ago

/a/sage-2016-03-07_17:50:09-rados-wip-sage-testing---basic-smithi/45466

Actions #2

Updated by Haomai Wang about 8 years ago

  • Assignee set to Haomai Wang

Initial judge:

simple side:
1. accept async connect
2. inject happen when enter https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L359
3. because we have buffered message, so we can continue
4. register pipe then https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L728
5. met previous inject error https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L732'
6. because we don't have replace process and inqueue message, we will mark this close https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L787
7. we will clear pipe and connection relation and queue reset
8. osd try to get this connection and send message, soon it found it's a failed connection without pipe
9. message missed

I'm a little confused by the gap between we close/clear pipe and unregister pipe. I will try to improve debug level and reproduce this tomorrow

Actions #3

Updated by Greg Farnum about 8 years ago

I'm not following the first part of that story — if we injected an error that was detected in line 359, we'd jump to fail_unlocked at line 777 and skip over most of the other parts. Or did you mean the async messenger injected an error once we'd passed that line?

Actions #4

Updated by Haomai Wang about 8 years ago

@Greg Farnum

the clue is these lines:
2016-03-08 05:42:17.669564 7f26b1d12700 0 -- 172.21.15.47:6805/2508 >> 172.21.15.64:6801/3976 pipe(0x7f26ee3b7400 sd=70 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f26ee341000).injecting socket failure
2016-03-08 05:42:17.669651 7f26b1d12700 10 osd.4 3 new session 0x7f26ee083ee0 con=0x7f26ee341000 addr=172.21.15.64:6801/3976
2016-03-08 05:42:17.669672 7f26b1d12700 10 osd.4 3 session 0x7f26ee083ee0 osd.0 has caps osdcap[grant(*)] 'allow *'

pipe called verify_authorizer after we inject socket failure, so it showed pipe successfully got connect msg!

Actions #6

Updated by Haomai Wang about 8 years ago

  • Status changed from New to Need More Info
  • Priority changed from Urgent to High
Actions #7

Updated by Sage Weil about 8 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF