Project

General

Profile

Actions

Bug #23957

open

msg/async: read connect reply failed, but not retry

Added by David Zafman about 6 years ago. Updated over 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
AsyncMessenger
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

on sending side,

2018-05-01 02:00:44.701 7f65dcb55700 10 osd.5 20 send_incremental_map 19 -> 20 to 0x55c34a5f2760 172.21.15.50:6806/12882
2018-05-01 02:00:44.701 7f65dcb55700  1 -- 172.21.15.166:6805/12966 --> 172.21.15.50:6806/12882 -- osd_map(20..20 src has 1..20) v4 -- ?+0 0x55c34a8fa840 con 0x55c34a5f2760
2018-05-01 02:00:44.701 7f65dcb55700  1 -- 172.21.15.166:6805/12966 --> 172.21.15.50:6806/12882 -- pg_query(2.2 epoch 20) v4 -- ?+0 0x55c34a6a9840 con 0x55c34a5f2760

on accepting side,
2018-05-01 02:00:03.950 7f7f113bb700 10 osd.1 15  new session 0x555ea53d4780 con=0x555ea53cb800 addr=172.21.15.166:6805/12966
2018-05-01 02:00:03.950 7f7f113bb700 10 osd.1 15  session 0x555ea53d4780 osd.5 has caps osdcap[grant(*)] 'allow *'
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea53cb800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 69
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
...
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  new session 0x555ea2a6bc00 con=0x555ea53c9500 addr=172.21.15.166:6801/12910
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  session 0x555ea2a6bc00 osd.4 has caps osdcap[grant(*)] 'allow *'
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 68
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  new session 0x555ea53d4c80 con=0x555ea53c8e00 addr=172.21.15.50:6810/12951
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  session 0x555ea53d4c80 osd.2 has caps osdcap[grant(*)] 'allow *'
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 66
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 OSD::ms_get_authorizer type=osd
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 OSD::ms_get_authorizer type=osd
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_bulk peer close file descriptor 66
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._process_connection read connect reply failed

/a/dzafman-2018-04-30_18:45:29-rados:thrash-wip-zafman-testing-distro-basic-smithi/2458469

http://pulpito.ceph.com/dzafman-2018-04-30_18:45:29-rados:thrash-wip-zafman-testing-distro-basic-smithi/2458469

rados:thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/default.yaml backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/upmap.yaml msgr-failures/fastclose.yaml msgr/random.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/none.yaml thrashosds-health.yaml workloads/radosbench.yaml}

radosbench times out with pgid 2.2 in state "creating+peering"

Actions

Also available in: Atom PDF