Bug #15010
closedsimple->async peer got "failed lossy con, dropping message"
0%
Description
2016-03-08 05:42:17.669564 7f26b1d12700 0 -- 172.21.15.47:6805/2508 >> 172.21.15.64:6801/3976 pipe(0x7f26ee3b7400 sd=70 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f26ee341000).injecting socket failure 2016-03-08 05:42:17.669651 7f26b1d12700 10 osd.4 3 new session 0x7f26ee083ee0 con=0x7f26ee341000 addr=172.21.15.64:6801/3976 2016-03-08 05:42:17.669672 7f26b1d12700 10 osd.4 3 session 0x7f26ee083ee0 osd.0 has caps osdcap[grant(*)] 'allow *' ... 2016-03-08 05:42:17.671738 7f26cd767700 1 -- 172.21.15.47:6805/2508 --> 172.21.15.64:6801/3976 -- pg_query(0.15 epoch 3) v3 -- ?+0 0x7f26ee2b6e00 con 0x7f26ee341000 2016-03-08 05:42:17.671747 7f26cd767700 0 -- 172.21.15.47:6805/2508 submit_message pg_query(0.15 epoch 3) v3 remote, 172.21.15.64:6801/3976, failed lossy con, dropping message 0x7f26ee2b6e00
and on the other end (async),
2016-03-08 05:42:17.672869 7fa2ed769700 1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7fa30c8d4600 pg_query(0.c epoch 3) v3 2016-03-08 05:42:17.673922 7fa2f9102700 1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=69 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_bulk peer close file descriptor 69 2016-03-08 05:42:17.673944 7fa2f9102700 1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=69 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until read failed 2016-03-08 05:42:17.673957 7fa2f9102700 1 -- 172.21.15.64:6801/3976 >> 172.21.15.47:6805/2508 conn(0x7fa30c9b8800 sd=69 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._process_connection read connect reply failed
Updated by Sage Weil about 8 years ago
/a/sage-2016-03-07_17:50:09-rados-wip-sage-testing---basic-smithi/45466
Updated by Haomai Wang about 8 years ago
- Assignee set to Haomai Wang
Initial judge:
simple side:
1. accept async connect
2. inject happen when enter https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L359
3. because we have buffered message, so we can continue
4. register pipe then https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L728
5. met previous inject error https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L732'
6. because we don't have replace process and inqueue message, we will mark this close https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L787
7. we will clear pipe and connection relation and queue reset
8. osd try to get this connection and send message, soon it found it's a failed connection without pipe
9. message missed
I'm a little confused by the gap between we close/clear pipe and unregister pipe. I will try to improve debug level and reproduce this tomorrow
Updated by Greg Farnum about 8 years ago
I'm not following the first part of that story — if we injected an error that was detected in line 359, we'd jump to fail_unlocked at line 777 and skip over most of the other parts. Or did you mean the async messenger injected an error once we'd passed that line?
Updated by Haomai Wang about 8 years ago
the clue is these lines:
2016-03-08 05:42:17.669564 7f26b1d12700 0 -- 172.21.15.47:6805/2508 >> 172.21.15.64:6801/3976 pipe(0x7f26ee3b7400 sd=70 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f26ee341000).injecting socket failure
2016-03-08 05:42:17.669651 7f26b1d12700 10 osd.4 3 new session 0x7f26ee083ee0 con=0x7f26ee341000 addr=172.21.15.64:6801/3976
2016-03-08 05:42:17.669672 7f26b1d12700 10 osd.4 3 session 0x7f26ee083ee0 osd.0 has caps osdcap[grant(*)] 'allow *'
pipe called verify_authorizer after we inject socket failure, so it showed pipe successfully got connect msg!
Updated by Haomai Wang about 8 years ago
Updated by Haomai Wang about 8 years ago
- Status changed from New to Need More Info
- Priority changed from Urgent to High
Updated by Sage Weil about 8 years ago
- Status changed from Need More Info to Can't reproduce