Project

General

Profile

Actions

Bug #19878

closed

msg/simple: reports lossy con on osd->osd connection

Added by Sage Weil almost 7 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-05-06 06:43:21.889123 7fb9953cf700  1 -- 172.21.15.24:6806/6238 <== osd.4 172.21.15.24:6810/1092 2 ==== pg_query(0.6 epoch 31) v4 ==== 158+0+0 (2521712471 0 0) 0x7fb9b72a0400 con 0x7fb9b692c700
2017-05-06 06:43:21.916744 7fb98c3bd700  1 -- 172.21.15.24:6806/6238 --> 172.21.15.24:6810/1092 -- pg_notify((query:29 sent:29 0.6( empty local-lis/les=5/6 n=0 ec=1 lis/c 5/5 les/c/f 6/6/0 20/20/5))=([5,19] intervals=([5,19] acting 0,4,5)) (query:29 sent:29 1.3( v 19'32 (0'0,19'32] local-lis/les=8/9 n=20 ec=8 lis/c 8/8 les/c/f 9/10/0 20/20/8))=([8,19] intervals=([8,19] acting 3,4,5)) epoch 31) v6 -- ?+0 0x7fb9b7480780 con 0x7fb9b692c700
2017-05-06 06:43:21.916760 7fb98c3bd700  0 -- 172.21.15.24:6806/6238 submit_message pg_notify((query:29 sent:29 0.6( empty local-lis/les=5/6 n=0 ec=1 lis/c 5/5 les/c/f 6/6/0 20/20/5))=([5,19] intervals=([5,19] acting 0,4,5)) (query:29 sent:29 1.3( v 19'32 (0'0,19'32] local-lis/les=8/9 n=20 ec=8 lis/c 8/8 les/c/f 9/10/0 20/20/8))=([8,19] intervals=([8,19] acting 3,4,5)) epoch 31) v6 remote, 172.21.15.24:6810/1092, failed lossy con, dropping message 0x7fb9b7480780
2017-05-06 06:43:21.925242 7fb98c3bd700  1 -- 172.21.15.24:6806/6238 --> 172.21.15.24:6810/1092 -- pg_notify((query:31 sent:31 1.3( v 19'32 (0'0,19'32] local-lis/les=8/9 n=20 ec=8 lis/c 8/8 les/c/f 9/10/0 31/31/8))=([8,30] intervals=([20,30] acting 3,4)) (query:31 sent:31 0.6( empty local-lis/les=5/6 n=0 ec=1 lis/c 5/5 les/c/f 6/6/0 31/31/5))=([5,30] intervals=([20,30] acting 0,4)) epoch 31) v6 -- ?+0 0x7fb9b6433de0 con 0x7fb9b692c700
2017-05-06 06:43:21.925263 7fb98c3bd700  0 -- 172.21.15.24:6806/6238 submit_message pg_notify((query:31 sent:31 1.3( v 19'32 (0'0,19'32] local-lis/les=8/9 n=20 ec=8 lis/c 8/8 les/c/f 9/10/0 31/31/8))=([8,30] intervals=([20,30] acting 3,4)) (query:31 sent:31 0.6( empty local-lis/les=5/6 n=0 ec=1 lis/c 5/5 les/c/f 6/6/0 31/31/5))=([5,30] intervals=([20,30] acting 0,4)) epoch 31) v6 remote, 172.21.15.24:6810/1092, failed lossy con, dropping message 0x7fb9b6433de0
2017-05-06 06:43:21.926811 7fb98c3bd700  1 -- 172.21.15.24:6806/6238 --> 172.21.15.24:6810/1092 -- pg_notify((query:31 sent:31 0.6( empty local-lis/les=5/6 n=0 ec=1 lis/c 20/20 les/c/f 21/21/0 31/31/5))=([5,30] intervals=([20,30] acting 0,4)) (query:31 sent:31 1.3( v 19'32 (0'0,19'32] local-lis/les=8/9 n=20 ec=8 lis/c 20/20 les/c/f 21/21/0 31/31/8))=([8,30] intervals=([20,30] acting 3,4)) epoch 31) v6 -- ?+0 0x7fb9b66e9c00 con 0x7fb9b692c700
2017-05-06 06:43:21.926830 7fb98c3bd700  0 -- 172.21.15.24:6806/6238 submit_message pg_notify((query:31 sent:31 0.6( empty local-lis/les=5/6 n=0 ec=1 lis/c 20/20 les/c/f 21/21/0 31/31/5))=([5,30] intervals=([20,30] acting 0,4)) (query:31 sent:31 1.3( v 19'32 (0'0,19'32] local-lis/les=8/9 n=20 ec=8 lis/c 20/20 les/c/f 21/21/0 31/31/8))=([8,30] intervals=([20,30] acting 3,4)) epoch 31) v6 remote, 172.21.15.24:6810/1092, failed lossy con, dropping message 0x7fb9b66e9c00
...

that is from the osd.5 log.

the other end (osd.4) injected a single failure:

2017-05-06 06:43:21.886236 7f8a2e7ec700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- pg_query(1.3 epoch 31) v4 -- ?+0 0x7f8a5a5cc000 con 0x7f8a59ccbc00
2017-05-06 06:43:21.886905 7f8a2efed700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- pg_query(0.6 epoch 31) v4 -- ?+0 0x7f8a59b7bc00 con 0x7f8a59ccbc00
2017-05-06 06:43:21.887498 7f8a2079f700  0 -- 172.21.15.24:6810/1092 >> 172.21.15.24:6806/6238 pipe(0x7f8a5b1e2000 sd=152 :35128 s=1 pgs=0 cs=0 l=0 c=0x7f8a59ccbc00).injecting socket failure
2017-05-06 06:43:21.887541 7f8a2079f700  0 -- 172.21.15.24:6810/1092 >> 172.21.15.24:6806/6238 pipe(0x7f8a5b1e2000 sd=152 :35128 s=1 pgs=0 cs=0 l=0 c=0x7f8a59ccbc00).fault
2017-05-06 06:43:21.888751 7f8a2079f700 10 osd.4 31  new session (outgoing) 0x7f8a59b79800 con=0x7f8a59ccbc00 addr=172.21.15.24:6806/6238
2017-05-06 06:43:22.957287 7f8a2e7ec700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- pg_notify((query:32 sent:32 0.10( empty local-lis/les=20/21 n=0 ec=1 lis/c 20/20 les/c/f 21/21/0 31/32/32))=([20,31] intervals=([20,30] acting 0,4)) epoch 32) v6 -- ?+0 0x7f8a5a137c00 con 0x7f8a59ccbc00
2017-05-06 06:43:24.010697 7f8a2efed700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- pg_notify((query:33 sent:33 0.10( empty local-lis/les=20/21 n=0 ec=1 lis/c 20/20 les/c/f 21/21/0 31/32/32))=([20,31] intervals=([20,30] acting 0,4)) epoch 33) v6 -- ?+0 0x7f8a5a216760 con 0x7f8a59ccbc00
2017-05-06 06:43:26.562970 7f8a2efed700 10 osd.4 34 send_incremental_map 33 -> 34 to 0x7f8a59ccbc00 172.21.15.24:6806/6238
2017-05-06 06:43:26.562976 7f8a2efed700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- osd_map(34..34 src has 1..34) v3 -- ?+0 0x7f8a59c55840 con 0x7f8a59ccbc00
2017-05-06 06:43:26.562984 7f8a2efed700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- pg_notify((query:34 sent:34 0.10( empty local-lis/les=20/21 n=0 ec=1 lis/c 20/20 les/c/f 21/21/0 31/32/32))=([20,31] intervals=([20,30] acting 0,4)) epoch 34) v6 -- ?+0 0x7f8a59ae9680 con 0x7f8a59ccbc00
2017-05-06 06:43:27.655483 7f8a2e7ec700 10 osd.4 35 send_incremental_map 34 -> 35 to 0x7f8a59ccbc00 172.21.15.24:6806/6238
2017-05-06 06:43:27.655561 7f8a2e7ec700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- osd_map(35..35 src has 1..35) v3 -- ?+0 0x7f8a59ba6f40 con 0x7f8a59ccbc00
2017-05-06 06:43:27.655590 7f8a2e7ec700  1 -- 172.21.15.24:6810/1092 --> 172.21.15.24:6806/6238 -- pg_notify((query:35 sent:35 0.10( empty local-lis/les=20/21 n=0 ec=1 lis/c 20/20 les/c/f 21/21/0 31/32/32))=([20,31] intervals=([20,30] acting 0,4)) epoch 35) v6 -- ?+0 0x7f8a5a216b20 con 0x7f8a59ccbc00
...

/a/sage-2017-05-06_05:54:44-rados-wip-sage-testing2---basic-smithi/1108881

Actions #1

Updated by Greg Farnum almost 7 years ago

  • Assignee set to Sage Weil

Aren't OSD<->OSD connections supposed to be lossy? Not sure what you think the issue is here, although (without having looked at the PR yet) it strikes me this could be fallout from the heartbeat change...

Actions #2

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Can't reproduce

osd<->osd connections are non-lossy. i think this was a problem with the hb change i was testing, though, closing!

Actions

Also available in: Atom PDF