Project

General

Profile

Bug #20093

msg/async: dropped a message

Added by Sage Weil 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
Start date:
05/26/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

osd.0 sees:

2017-05-26 08:52:41.063845 7f40e619c700  1 -- 172.21.15.60:6805/9925 <== osd.2 172.21.15.60:6809/11139 14 ==== pg_log(0.0 epoch 24 log log((17'1000,17'1007], crt=17'1007) pi (empty) query_epoch 24) v5 ==== 2210+0+0 (3600795338 0 0) 0x7f4101918700 con 0x7f4104ba9000
2017-05-26 08:52:41.063953 7f40e619c700  1 -- 172.21.15.60:6805/9925 <== osd.2 172.21.15.60:6809/11139 15 ==== pg_log(0.4 epoch 24 log log((17'1897,17'1998], crt=17'1998) pi (empty) query_epoch 24) v5 ==== 18942+0+0 (1925472980 0 0) 0x7f4101761500 con 0x7f4104ba9000
2017-05-26 08:52:41.064700 7f40dbac4700  1 -- 172.21.15.60:6805/9925 --> 172.21.15.60:6809/11139 -- pg_log(0.1 epoch 24 log log((17'975,22'2877], crt=22'2877) pi (empty) query_epoch 24) v5 -- 0x7f410181a000 con 0
2017-05-26 08:52:41.065170 7f40dc2c5700  1 -- 172.21.15.60:6805/9925 --> 172.21.15.60:6809/11139 -- pg_log(0.3 epoch 24 log log((17'1026,22'3136], crt=22'3136) pi (empty) query_epoch 24) v5 -- 0x7f4101908e00 con 0
2017-05-26 08:52:42.148310 7f40dbac4700  1 -- 172.21.15.60:6805/9925 --> 172.21.15.60:6809/11139 -- pg_notify((query:25 sent:25 0.3( v 22'3136 (17'1000,22'3136] local-lis/les=21/22 n=1026 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/24))=([21,23] intervals=([21,22] acting 0)) epoch 25) v6 -- 0x7f41025aeb20 con 0
2017-05-26 08:52:42.148332 7f40dc2c5700  1 -- 172.21.15.60:6805/9925 --> 172.21.15.60:6809/11139 -- pg_notify((query:25 sent:25 0.1( v 22'2877 (17'900,22'2877] local-lis/les=21/22 n=975 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/24))=([21,23] intervals=([21,22] acting 0)) epoch 25) v6 -- 0x7f41029c1860 con 0
2017-05-26 08:52:42.148599 7f40ef28f700  1 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4104ba9000 :6805 s=STATE_OPEN pgs=3 cs=3 l=0).read_bulk peer close file descriptor 43
2017-05-26 08:52:42.148622 7f40ef28f700  1 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4104ba9000 :6805 s=STATE_OPEN pgs=3 cs=3 l=0).read_until read failed
2017-05-26 08:52:42.148633 7f40ef28f700  1 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4104ba9000 :6805 s=STATE_OPEN pgs=3 cs=3 l=0).process read tag failed
2017-05-26 08:52:42.148713 7f40ef28f700  0 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4104ba9000 :6805 s=STATE_OPEN pgs=3 cs=3 l=0).fault initiating reconnect
2017-05-26 08:52:42.152066 7f40ef28f700  0 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4104ba9000 :6805 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=3 cs=4 l=0)._try_send injecting socket failure
2017-05-26 08:52:42.152101 7f40ef28f700  1 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4104ba9000 :6805 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=3 cs=4 l=0)._try_send send error: (32) Broken pipe
2017-05-26 08:52:42.155487 7f40dbac4700  1 -- 172.21.15.60:6805/9925 --> 172.21.15.60:6809/11139 -- pg_log(0.5 epoch 25 log log((17'2039,22'5984], crt=22'5984) pi (empty) query_epoch 25) v5 -- 0x7f4101761500 con 0
2017-05-26 08:52:42.160012 7f40dbac4700  1 -- 172.21.15.60:6805/9925 --> 172.21.15.60:6809/11139 -- pg_log(0.4 epoch 25 log log((17'1998,22'6038], crt=22'6038) pi (empty) query_epoch 25) v5 -- 0x7f4101918700 con 0
2017-05-26 08:52:42.162396 7f40dbac4700  1 -- 172.21.15.60:6805/9925 --> 172.21.15.60:6809/11139 -- pg_log(0.0 epoch 25 log log((17'1007,22'3049], crt=22'3049) pi (empty) query_epoch 25) v5 -- 0x7f4100df7500 con 0
2017-05-26 08:52:42.191196 7f40eea8e700 10 osd.0 25  new session 0x7f41041a6600 con=0x7f4100cf7800 addr=172.21.15.60:6809/11139
2017-05-26 08:52:42.191272 7f40eea8e700  0 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4100cf7800 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 3 vs existing csq=4 existing_state=STATE_CONNECTING
2017-05-26 08:52:42.191524 7f40eea8e700  0 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4100cf7800 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 5 vs existing csq=4 existing_state=STATE_CONNECTING
2017-05-26 08:52:42.191548 7f40eea8e700  1 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4100cf7800 :6805 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2017-05-26 08:52:42.193552 7f40ef28f700 10 osd.0 25  new session 0x7f4109befa00 con=0x7f4106ac3000 addr=172.21.15.60:6809/11139
2017-05-26 08:52:42.193612 7f40ef28f700  0 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4106ac3000 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 7 vs existing csq=6 existing_state=STATE_OPEN
2017-05-26 08:52:42.193645 7f40ef28f700  1 -- 172.21.15.60:6805/9925 >> 172.21.15.60:6809/11139 conn(0x7f4106ac3000 :6805 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing

note the second batch o fpg_logs we send.. 0.5, 0.4, then 0.0.

osd.0 sees:

2017-05-26 08:52:41.061284 7f0a1c59f700  1 -- 172.21.15.60:6809/11139 --> 172.21.15.60:6805/9925 -- pg_log(0.0 epoch 24 log log((17'1000,17'1007], crt=17'1007) pi (empty) query_epoch 24) v5 -- 0x7f0a40cfa700 con 0
2017-05-26 08:52:41.061405 7f0a1c59f700  1 -- 172.21.15.60:6809/11139 --> 172.21.15.60:6805/9925 -- pg_log(0.4 epoch 24 log log((17'1897,17'1998], crt=17'1998) pi (empty) query_epoch 24) v5 -- 0x7f0a40cfc300 con 0
2017-05-26 08:52:41.068272 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 11 ==== pg_log(0.1 epoch 24 log log((17'975,22'2877], crt=22'2877) pi (empty) query_epoch 24) v5 ==== 338246+0+0 (2959863701 0 0) 0x7f0a40cfc300 con 0x7f0a40c12000
2017-05-26 08:52:41.071337 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 12 ==== pg_log(0.3 epoch 24 log log((17'1026,22'3136], crt=22'3136) pi (empty) query_epoch 24) v5 ==== 375166+0+0 (3680419514 0 0) 0x7f0a40cfbc00 con 0x7f0a40c12000
2017-05-26 08:52:42.148521 7f0a2fd6a700  0 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=1947 cs=3 l=0).read_until injecting socket failure
2017-05-26 08:52:42.148731 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1947 cs=3 l=0).read_bulk peer close file descriptor 44
2017-05-26 08:52:42.148749 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1947 cs=3 l=0).read_until read failed
2017-05-26 08:52:42.148759 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1947 cs=3 l=0).process read tag failed
2017-05-26 08:52:42.148794 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 13 ==== pg_notify((query:25 sent:25 0.3( v 22'3136 (17'1000,22'3136] local-lis/les=21/22 n=1026 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/24))=([21,23] intervals=([21,22] acting 0)) epoch 25) v6 ==== 972+0+0 (1564670753 0 0) 0x7f0a40cf2960 con 0x7f0a40c12000
2017-05-26 08:52:42.148933 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 14 ==== pg_notify((query:25 sent:25 0.1( v 22'2877 (17'900,22'2877] local-lis/les=21/22 n=975 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/24))=([21,23] intervals=([21,22] acting 0)) epoch 25) v6 ==== 972+0+0 (3339023517 0 0) 0x7f0a40c76b40 con 0x7f0a40c12000
2017-05-26 08:52:42.152105 7f0a2ed68700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c85800 :6809 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 28
2017-05-26 08:52:42.152129 7f0a2ed68700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c85800 :6809 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2017-05-26 08:52:42.152141 7f0a2ed68700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c85800 :6809 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
2017-05-26 08:52:42.190681 7f0a1c59f700  1 -- 172.21.15.60:6809/11139 --> 172.21.15.60:6805/9925 -- pg_notify((query:25 sent:25 0.5( v 17'2039 (17'1999,17'2039] local-lis/les=5/6 n=2039 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/5))=([5,23] intervals=([21,22] acting 0)) epoch 25) v6 -- 0x7f0a40402580 con 0
2017-05-26 08:52:42.191449 7f0a1bd9e700  1 -- 172.21.15.60:6809/11139 --> 172.21.15.60:6805/9925 -- pg_info((query:25 sent:25 0.1( v 22'2877 (17'900,22'2877] local-lis/les=24/25 n=975 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/24))=([5,23] intervals=([21,22] acting 0)) epoch 25) v5 -- 0x7f0a40cf25a0 con 0
2017-05-26 08:52:42.192392 7f0a1c59f700  1 -- 172.21.15.60:6809/11139 --> 172.21.15.60:6805/9925 -- pg_notify((query:25 sent:25 0.4( v 17'1998 (17'1897,17'1998] local-lis/les=5/6 n=1998 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/5))=([5,23] intervals=([21,22] acting 0)) (query:25 sent:25 0.0( v 17'1007 (17'1000,17'1007] local-lis/les=5/6 n=1007 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/5))=([5,23] intervals=([21,22] acting 0)) epoch 25) v6 -- 0x7f0a40401680 con 0
2017-05-26 08:52:42.192436 7f0a1c59f700  1 -- 172.21.15.60:6809/11139 --> 172.21.15.60:6805/9925 -- pg_info((query:25 sent:25 0.3( v 22'3136 (17'1000,22'3136] local-lis/les=24/25 n=1026 ec=1/1 lis/c 21/21 les/c/f 22/22/0 23/24/24))=([5,23] intervals=([21,22] acting 0)) epoch 25) v5 -- 0x7f0a40cf4b20 con 0
2017-05-26 08:52:42.193161 7f0a2fd6a700  0 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1949 cs=6 l=0)._try_send injecting socket failure
2017-05-26 08:52:42.193180 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1949 cs=6 l=0)._try_send send error: (32) Broken pipe
2017-05-26 08:52:42.193194 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1949 cs=6 l=0).handle_write send msg failed
2017-05-26 08:52:42.193230 7f0a2fd6a700  0 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1949 cs=6 l=0).fault initiating reconnect
2017-05-26 08:52:42.212609 7f0a2fd6a700  0 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1950 cs=8 l=0).process missed message?  skipped from seq 14 to 16
2017-05-26 08:52:42.212663 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 16 ==== pg_log(0.4 epoch 25 log log((17'1998,22'6038], crt=22'6038) pi (empty) query_epoch 25) v5 ==== 717372+0+0 (1173838626 0 0) 0x7f0a40cfa000 con 0x7f0a40c12000
2017-05-26 08:52:42.214929 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 17 ==== pg_log(0.0 epoch 25 log log((17'1007,22'3049], crt=22'3049) pi (empty) query_epoch 25) v5 ==== 363042+0+0 (2173604098 0 0) 0x7f0a41464e00 con 0x7f0a40c12000
2017-05-26 08:52:42.214935 7f0a2fd6a700  0 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN_MESSAGE_READ_FRONT pgs=1950 cs=8 l=0).read_until injecting socket failure
2017-05-26 08:52:42.214984 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 18 ==== pg_info((query:25 sent:25 0.3( v 22'3136 (17'1000,22'3136] local-lis/les=24/25 n=1026 ec=1/1 lis/c 24/21 les/c/f 25/22/0 23/24/24))=(empty) epoch 25) v5 ==== 910+0+0 (750799936 0 0) 0x7f0a40c77c20 con 0x7f0a40c12000
2017-05-26 08:52:42.214997 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1950 cs=8 l=0).read_bulk peer close file descriptor 28
2017-05-26 08:52:42.215014 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1950 cs=8 l=0).read_until read failed
2017-05-26 08:52:42.215018 7f0a2fd6a700  1 -- 172.21.15.60:6809/11139 >> 172.21.15.60:6805/9925 conn(0x7f0a40c12000 :-1 s=STATE_OPEN pgs=1950 cs=8 l=0).process read tag failed
2017-05-26 08:52:42.215076 7f0a26476700  1 -- 172.21.15.60:6809/11139 <== osd.0 172.21.15.60:6805/9925 19 ==== pg_info((query:25 sent:25 0.1( v 22'2877 (17'900,22'2877] local-lis/les=24/25 n=975 ec=1/1 lis/c 24/21 les/c/f 25/22/0 23/24/24))=(empty) epoch 25) v5 ==== 910+0+0 (2229513654 0 0) 0x7f0a40c78580 con 0x7f0a40c12000

we get the 0.4 and 0.0 after the reconnect, but not the 0.5.

/a/sage-2017-05-26_06:56:21-rados-wip-sage-testing---basic-smithi/1231378

(osd logs are in the test dir)

History

#1 Updated by Sage Weil 6 months ago

  • Status changed from Verified to Resolved

Also available in: Atom PDF