Bug #38247
closedmsg/async: v2 reconnect_seq assert due to interfering connections
0%
Description
-1> 2019-02-08 20:36:09.240 7f974fe13700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3532-ga41eb63/rpm/el7/BUILD/ceph-14.0.1-3532-ga41eb63/src/msg/async/ProtocolV2.cc: In function 'Ct<ProtocolV2> * ProtocolV2::handle_message_complete()' thread 7f974fe13700 time 2019-02-08 20:36:09.234149 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3532-ga41eb63/rpm/el7/BUILD/ceph-14.0.1-3532-ga41eb63/src/msg/async/ProtocolV2.cc: 1914: FAILED ceph_assert(0 == "old msgs despite reconnect_seq feature") ceph version 14.0.1-3532-ga41eb63 (a41eb634749f7f6ff52903dcc1051060d132d6f0) nautilus (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55e97358c77e] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55e97358c94c] 3: (ProtocolV2::handle_message_complete()+0x1664) [0x55e974018c84] 4: (ProtocolV2::read_message_data()+0x1a7) [0x55e974018e67]
/a/sage-2019-02-08_15:26:54-rados-wip-sage-testing-2019-02-08-0710-distro-basic-smithi/3564801
osd.3 sees
2019-02-08 20:36:09.212 7f972e8f1700 1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_query(2.b epoch 30) v4 -- 0x55e97e5edc20 con 0x55e97e701400 2019-02-08 20:36:09.212 7f972a8e9700 1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_query(2.3 epoch 30) v4 -- 0x55e97e752000 con 0x55e97e701400 2019-02-08 20:36:09.221 7f972b8eb700 1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 -- 0x55e97e5ecb40 con 0x55e97e701400 2019-02-08 20:36:09.226 7f974fe13700 1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 1 ==== pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (868179085 0 0) 0x55e97e7532c0 con 0x55e97e701400 2019-02-08 20:36:09.227 7f974fe13700 1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 2 ==== pg_query(2.1 epoch 30) v4 ==== 162+0+0 (1286079452 0 0) 0x55e97e7532c0 con 0x55e97e701400 2019-02-08 20:36:09.227 7f972f8f3700 1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 -- 0x55e97e753680 con 0x55e97e701400 2019-02-08 20:36:09.227 7f974fe13700 1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 3 ==== pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (1773979790 0 0) 0x55e97e7532c0 con 0x55e97e701400
then crash.
interestingly, on osd.2, we have some weirdness too:
2019-02-08 20:36:09.200 7f7db54c2700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa041f680 con 0x55cfa0762000 2019-02-08 20:36:09.200 7f7dba4cc700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_query(2.1 epoch 30) v4 -- 0x55cfa058c000 con 0x55cfa0762000 2019-02-08 20:36:09.200 7f7db94ca700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058d2c0 con 0x55cfa0762000 2019-02-08 20:36:09.228 7f7dda1eb700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 1 ==== pg_query(2.b epoch 30) v4 ==== 162+0+0 (4288771879 0 0) 0x55cfa058c960 con 0x55cfa0762000 2019-02-08 20:36:09.228 7f7db54c2700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa0723a40 con 0x55cfa0762000 2019-02-08 20:36:09.232 7f7dda1eb700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 1 ==== pg_query(2.b epoch 30) v4 ==== 162+0+0 (4288771879 0 0) 0x55cfa0723a40 con 0x55cfa0762000 2019-02-08 20:36:09.232 7f7db94ca700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058c1e0 con 0x55cfa0762000 2019-02-08 20:36:09.232 7f7dda1eb700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 2 ==== pg_query(2.3 epoch 30) v4 ==== 162+0+0 (3368727633 0 0) 0x55cfa0723a40 con 0x55cfa0762000 2019-02-08 20:36:09.232 7f7db94ca700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058cb40 con 0x55cfa0762000 2019-02-08 20:36:09.232 7f7dda1eb700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 3 ==== pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (4208337317 0 0) 0x55cfa0723a40 con 0x55cfa0762000 2019-02-08 20:36:09.233 7f7dda1eb700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 4 ==== pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (4208337317 0 0) 0x55cfa0723a40 con 0x55cfa0762000 2019-02-08 20:36:10.205 7f7db54c2700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:31 sent:31 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 31) v6 -- 0x55cfa0723c20 con 0x55cfa0762000 2019-02-08 20:36:10.205 7f7db94ca700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:31 sent:31 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 31) v6 -- 0x55cfa058c960 con 0x55cfa0762000 2019-02-08 20:36:10.206 7f7db64c4700 1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_info((query:31 sent:31 2.1( v 19'34 (0'0,19'34] local-lis/les=30/31 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 31) v5 -- 0x55cfa058c3c0 con 0x55cfa0762000
...where the first message is delivered twice!
Updated by Sage Weil about 5 years ago
- Subject changed from msg/async: v2 reconnect_seq assert to msg/async: v2 reconnect_seq assert due to interfering connections
- Status changed from New to 12
- Assignee set to Sage Weil
- Priority changed from Normal to Urgent
The problem: We send a client_ident identifying ourselves, but not who we are trying to connect to. And on the server side, we respond to that client_ident by, potentially, affecting an existing session...but we don't even know they client is connecting to th eright person.
In this case, osd.3 has an unrelated outgoing connection to v2:172.21.15.29:6820/35057
2019-02-08 20:36:09.228 7f974fe13700 5 --2- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] >> [v2:172.21.15.29:6820/35057,v1:172.21.15.29:6821/35057] conn(0x55e97e7a0000 0x55e97e75df80 :-1 s=CONNECTING pgs=0 cs=0 l=0).send_client_ident sending identification: addrs=[v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] gid=3 global_seq=3 features_supported=3ffddff8ffacffff features_required=2e01020042046201 flags=0
and on osd.2, we let it interfere with our other session,
2019-02-08 20:36:09.228 7f7ddb1ed700 5 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_ident received client identification: addrs=[v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] gid=3 global_seq=3 features_supported=3ffddff8ffacffff features_required=2e01020042046201 flags=0 2019-02-08 20:36:09.228 7f7ddb1ed700 10 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 msgr2=0x55cfa0771b00 :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._infer_target_addr [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -> v2:172.21.15.29:6828/114228 2019-02-08 20:36:09.228 7f7ddb1ed700 20 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).handle_existing_connection existing=0x55cfa0762000 2019-02-08 20:36:09.228 7f7ddb1ed700 1 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).handle_existing_connection found previous session existing=0x55cfa0762000, peer must have reseted. 2019-02-08 20:36:09.228 7f7ddb1ed700 20 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).reuse_connection existing=0x55cfa0762000 reconnect=0 2019-02-08 20:36:09.228 7f7ddb1ed700 5 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).reuse_connection stop myself to swap existing 2019-02-08 20:36:09.228 7f7ddb1ed700 1 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).stop 2019-02-08 20:36:09.228 7f7ddb1ed700 10 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).discard_out_queue started
Updated by Sage Weil about 5 years ago
easy fix:
- change client_ident to include who we want to connect to, so receiving end can ignore it early.
harder (but possibly better?) fix:
- send server_ident immediately after auth_done, so that client knows who they are talking to right away, and can decide whether to continue or not. this will subtly reorder a lot of the protocol's behavior.
Updated by Sage Weil about 5 years ago
- Status changed from 12 to Fix Under Review
Updated by Sage Weil about 5 years ago
- Status changed from Fix Under Review to Resolved
Updated by Greg Farnum about 5 years ago
- Project changed from RADOS to Messengers