Project

General

Profile

Actions

Bug #38247

closed

msg/async: v2 reconnect_seq assert due to interfering connections

Added by Sage Weil about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -1> 2019-02-08 20:36:09.240 7f974fe13700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3532-ga41eb63/rpm/el7/BUILD/ceph-14.0.1-3532-ga41eb63/src/msg/async/ProtocolV2.cc: In function 'Ct<ProtocolV2>
* ProtocolV2::handle_message_complete()' thread 7f974fe13700 time 2019-02-08 20:36:09.234149
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3532-ga41eb63/rpm/el7/BUILD/ceph-14.0.1-3532-ga41eb63/src/msg/async/ProtocolV2.cc: 1914: FAILED ceph_assert(0 == "old msgs despite reconnect_seq feature")

 ceph version 14.0.1-3532-ga41eb63 (a41eb634749f7f6ff52903dcc1051060d132d6f0) nautilus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55e97358c77e]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55e97358c94c]
 3: (ProtocolV2::handle_message_complete()+0x1664) [0x55e974018c84]
 4: (ProtocolV2::read_message_data()+0x1a7) [0x55e974018e67]

/a/sage-2019-02-08_15:26:54-rados-wip-sage-testing-2019-02-08-0710-distro-basic-smithi/3564801

osd.3 sees

2019-02-08 20:36:09.212 7f972e8f1700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_query(2.b epoch 30) v4 -- 0x55e97e5edc20 con 0x55e97e701400
2019-02-08 20:36:09.212 7f972a8e9700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_query(2.3 epoch 30) v4 -- 0x55e97e752000 con 0x55e97e701400
2019-02-08 20:36:09.221 7f972b8eb700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 -- 0x55e97e5ecb40 con 0x55e97e701400
2019-02-08 20:36:09.226 7f974fe13700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 1 ==== pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (868179085 0 0) 0x55e97e7532c0 con 0x55e97e701400
2019-02-08 20:36:09.227 7f974fe13700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 2 ==== pg_query(2.1 epoch 30) v4 ==== 162+0+0 (1286079452 0 0) 0x55e97e7532c0 con 0x55e97e701400
2019-02-08 20:36:09.227 7f972f8f3700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] --> [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] -- pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 -- 0x55e97e753680 con 0x55e97e701400
2019-02-08 20:36:09.227 7f974fe13700  1 -- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] <== osd.2 v2:172.21.15.29:6820/114039 3 ==== pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (1773979790 0 0) 0x55e97e7532c0 con 0x55e97e701400

then crash.

interestingly, on osd.2, we have some weirdness too:

2019-02-08 20:36:09.200 7f7db54c2700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa041f680 con 0x55cfa0762000
2019-02-08 20:36:09.200 7f7dba4cc700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_query(2.1 epoch 30) v4 -- 0x55cfa058c000 con 0x55cfa0762000
2019-02-08 20:36:09.200 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058d2c0 con 0x55cfa0762000
2019-02-08 20:36:09.228 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 1 ==== pg_query(2.b epoch 30) v4 ==== 162+0+0 (4288771879 0 0) 0x55cfa058c960 con 0x55cfa0762000
2019-02-08 20:36:09.228 7f7db54c2700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 1 ==== pg_query(2.b epoch 30) v4 ==== 162+0+0 (4288771879 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058c1e0 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 2 ==== pg_query(2.3 epoch 30) v4 ==== 162+0+0 (3368727633 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:30 sent:30 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 30) v6 -- 0x55cfa058cb40 con 0x55cfa0762000
2019-02-08 20:36:09.232 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 3 ==== pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (4208337317 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:09.233 7f7dda1eb700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] <== osd.3 v2:172.21.15.29:6828/114228 4 ==== pg_notify((query:30 sent:30 2.1( v 19'34 (0'0,19'34] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 30) v6 ==== 1023+0+0 (4208337317 0 0) 0x55cfa0723a40 con 0x55cfa0762000
2019-02-08 20:36:10.205 7f7db54c2700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:31 sent:31 2.3( empty local-lis/les=17/18 n=0 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 31) v6 -- 0x55cfa0723c20 con 0x55cfa0762000
2019-02-08 20:36:10.205 7f7db94ca700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_notify((query:31 sent:31 2.b( v 19'26 (0'0,19'26] local-lis/les=17/18 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/30))=([17,29] intervals=([17,25] acting 2,3)) epoch 31) v6 -- 0x55cfa058c960 con 0x55cfa0762000
2019-02-08 20:36:10.206 7f7db64c4700  1 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] --> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -- pg_info((query:31 sent:31 2.1( v 19'34 (0'0,19'34] local-lis/les=30/31 n=1 ec=17/17 lis/c 17/17 les/c/f 18/18/0 30/30/27))=([17,29] intervals=([27,28] acting 2,3)) epoch 31) v5 -- 0x55cfa058c3c0 con 0x55cfa0762000

...where the first message is delivered twice!

Actions #1

Updated by Sage Weil about 5 years ago

  • Subject changed from msg/async: v2 reconnect_seq assert to msg/async: v2 reconnect_seq assert due to interfering connections
  • Status changed from New to 12
  • Assignee set to Sage Weil
  • Priority changed from Normal to Urgent

The problem: We send a client_ident identifying ourselves, but not who we are trying to connect to. And on the server side, we respond to that client_ident by, potentially, affecting an existing session...but we don't even know they client is connecting to th eright person.

In this case, osd.3 has an unrelated outgoing connection to v2:172.21.15.29:6820/35057

2019-02-08 20:36:09.228 7f974fe13700  5 --2- [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] >> [v2:172.21.15.29:6820/35057,v1:172.21.15.29:6821/35057] conn(0x55e97e7a0000 0x55e97e75df80 :-1 s=CONNECTING pgs=0 cs=0 l=0).send_client_ident sending identification: addrs=[v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] gid=3 global_seq=3 features_supported=3ffddff8ffacffff features_required=2e01020042046201 flags=0

and on osd.2, we let it interfere with our other session,
2019-02-08 20:36:09.228 7f7ddb1ed700  5 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >>  conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_ident received client identification: addrs=[v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] gid=3 global_seq=3 features_supported=3ffddff8ffacffff features_required=2e01020042046201 flags=0
2019-02-08 20:36:09.228 7f7ddb1ed700 10 -- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 msgr2=0x55cfa0771b00 :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._infer_target_addr [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] -> v2:172.21.15.29:6828/114228
2019-02-08 20:36:09.228 7f7ddb1ed700 20 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).handle_existing_connection existing=0x55cfa0762000
2019-02-08 20:36:09.228 7f7ddb1ed700  1 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).handle_existing_connection found previous session existing=0x55cfa0762000, peer must have reseted.
2019-02-08 20:36:09.228 7f7ddb1ed700 20 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).reuse_connection existing=0x55cfa0762000 reconnect=0
2019-02-08 20:36:09.228 7f7ddb1ed700  5 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).reuse_connection stop myself to swap existing
2019-02-08 20:36:09.228 7f7ddb1ed700  1 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).stop
2019-02-08 20:36:09.228 7f7ddb1ed700 10 --2- [v2:172.21.15.29:6820/114039,v1:172.21.15.29:6821/114039] >> [v2:172.21.15.29:6828/114228,v1:172.21.15.29:6829/114228] conn(0x55cfa071d400 0x55cfa0771b00 :-1 s=ACCEPTING_SESSION pgs=3 cs=0 l=0).discard_out_queue started

Actions #2

Updated by Sage Weil about 5 years ago

easy fix:

- change client_ident to include who we want to connect to, so receiving end can ignore it early.

harder (but possibly better?) fix:

- send server_ident immediately after auth_done, so that client knows who they are talking to right away, and can decide whether to continue or not. this will subtly reorder a lot of the protocol's behavior.

Actions #3

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #4

Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to Resolved
Actions #5

Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers
Actions

Also available in: Atom PDF