Project

General

Profile

Actions

Bug #38247

closed

msg/async: v2 reconnect_seq assert due to interfering connections

Added by Sage Weil about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -1> 2019-02-08 20:36:09.240 7f974fe13700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3532-ga41eb63/rpm/el7/BUILD/ceph-14.0.1-3532-ga41eb63/src/msg/async/ProtocolV2.cc: In function 'Ct<ProtocolV2>
* ProtocolV2::handle_message_complete()' thread 7f974fe13700 time 2019-02-08 20:36:09.234149
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3532-ga41eb63/rpm/el7/BUILD/ceph-14.0.1-3532-ga41eb63/src/msg/async/ProtocolV2.cc: 1914: FAILED ceph_assert(0 == "old msgs despite reconnect_seq feature")

 ceph version 14.0.1-3532-ga41eb63 (a41eb634749f7f6ff52903dcc1051060d132d6f0) nautilus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55e97358c77e]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55e97358c94c]
 3: (ProtocolV2::handle_message_complete()+0x1664) [0x55e974018c84]
 4: (ProtocolV2::read_message_data()+0x1a7) [0x55e974018e67]

/a/sage-2019-02-08_15:26:54-rados-wip-sage-testing-2019-02-08-0710-distro-basic-smithi/3564801

osd.3 sees

2019-02-08 20:36:09.212 7f972e8f1700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_query(2.b epoch 30) v4 -- 0x55e97e5edc20 con 0x55e97e701400
2019-02-08 20:36:09.212 7f972a8e9700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_query(2.3 epoch 30) v4 -- 0x55e97e752000 con 0x55e97e701400
2019-02-08 20:36:09.221 7f972b8eb700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 -- 0x55e97e5ecb40 con 0x55e97e701400
2019-02-08 20:36:09.226 7f974fe13700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 1 ==== pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (868179085 0 0) 0x55e97e7532c0 con 0x55e97e701400
2019-02-08 20:36:09.227 7f974fe13700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 2 ==== pg_query(2.1 epoch 30) v4 ==== 162+0+0 (1286079452 0 0) 0x55e97e7532c0 con 0x55e97e701400
2019-02-08 20:36:09.227 7f972f8f3700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 -- 0x55e97e753680 con 0x55e97e701400
2019-02-08 20:36:09.227 7f974fe13700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 3 ==== pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (1773979790 0 0) 0x55e97e7532c0 con 0x55e97e701400

then crash.

interestingly, on osd.2, we have some weirdness too:

2019-02-08 20:36:09.200 7f7db54c2700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa041f680 con 0x55cfa0762000
2019-02-08 20:36:09.200 7f7dba4cc700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_query(2.1 epoch 30) v4 -- 0x55cfa058c000 con 0x55cfa0762000
2019-02-08 20:36:09.200 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058d2c0 con 0x55cfa0762000
2019-02-08 20:36:09.228 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 1 ==== pg_query(2.b epoch 30) v4 ==== 162+0+0 (4288771879 0 0) 0x55cfa058c960 con 0x55cfa0762000
2019-02-08 20:36:09.228 7f7db54c2700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 1 ==== pg_query(2.b epoch 30) v4 ==== 162+0+0 (4288771879 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058c1e0 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 2 ==== pg_query(2.3 epoch 30) v4 ==== 162+0+0 (3368727633 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058cb40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 3 ==== pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (4208337317 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.233 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 4 ==== pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (4208337317 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:10.205 7f7db54c2700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:31 sent:31 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 31) v6 -- 0x55cfa0723c20 con 0x55cfa0762000
2019-02-08 20:36:10.205 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:31 sent:31 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 31) v6 -- 0x55cfa058c960 con 0x55cfa0762000
2019-02-08 20:36:10.206 7f7db64c4700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_info((query:31 sent:31 2.1( v 19'34 (0'0,19'34] local-lis/les=30/31 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 31) v5 -- 0x55cfa058c3c0 con 0x55cfa0762000

...where the first message is delivered twice!

Actions

Also available in: Atom PDF