Project

General

Profile

Bug #23957

msg/async: read connect reply failed, but not retry

Added by David Zafman almost 6 years ago. Updated over 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
AsyncMessenger
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

on sending side,

2018-05-01 02:00:44.701 7f65dcb55700 10 osd.5 20 send_incremental_map 19 -> 20 to 0x55c34a5f2760 172.21.15.50:6806/12882
2018-05-01 02:00:44.701 7f65dcb55700  1 -- 172.21.15.166:6805/12966 --> 172.21.15.50:6806/12882 -- osd_map(20..20 src has 1..20) v4 -- ?+0 0x55c34a8fa840 con 0x55c34a5f2760
2018-05-01 02:00:44.701 7f65dcb55700  1 -- 172.21.15.166:6805/12966 --> 172.21.15.50:6806/12882 -- pg_query(2.2 epoch 20) v4 -- ?+0 0x55c34a6a9840 con 0x55c34a5f2760

on accepting side,
2018-05-01 02:00:03.950 7f7f113bb700 10 osd.1 15  new session 0x555ea53d4780 con=0x555ea53cb800 addr=172.21.15.166:6805/12966
2018-05-01 02:00:03.950 7f7f113bb700 10 osd.1 15  session 0x555ea53d4780 osd.5 has caps osdcap[grant(*)] 'allow *'
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea53cb800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 69
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f113bb700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
...
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  new session 0x555ea2a6bc00 con=0x555ea53c9500 addr=172.21.15.166:6801/12910
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  session 0x555ea2a6bc00 osd.4 has caps osdcap[grant(*)] 'allow *'
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 68
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  new session 0x555ea53d4c80 con=0x555ea53c8e00 addr=172.21.15.50:6810/12951
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15  session 0x555ea53d4c80 osd.2 has caps osdcap[grant(*)] 'allow *'
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 66
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 OSD::ms_get_authorizer type=osd
2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 OSD::ms_get_authorizer type=osd
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_bulk peer close file descriptor 66
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until read failed
2018-05-01 02:00:03.950 7f7f11bbc700  1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._process_connection read connect reply failed

/a/dzafman-2018-04-30_18:45:29-rados:thrash-wip-zafman-testing-distro-basic-smithi/2458469

http://pulpito.ceph.com/dzafman-2018-04-30_18:45:29-rados:thrash-wip-zafman-testing-distro-basic-smithi/2458469

rados:thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/default.yaml backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/upmap.yaml msgr-failures/fastclose.yaml msgr/random.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/none.yaml thrashosds-health.yaml workloads/radosbench.yaml}

radosbench times out with pgid 2.2 in state "creating+peering"

History

#1 Updated by Josh Durgin almost 6 years ago

  • Priority changed from Normal to High

#2 Updated by Sage Weil almost 6 years ago

  • Subject changed from PG stuck peering+creating to msg/async: read connect reply failed, but not retry
  • Description updated (diff)
  • Status changed from New to 12

#3 Updated by Sage Weil almost 6 years ago

another instance. on receiving end,

2018-06-26 17:43:25.316 7f6d463c5700  1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ea8000 :6813 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_STANDBY
2018-06-26 17:43:25.316 7f6d463c5700 10 osd.7 14  session 0x55ac52d9cc00 osd.3 has caps osdcap[grant(*)] 'allow *'
2018-06-26 17:43:25.316 7f6d463c5700  1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ea8000 :6813 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 1 vs existing csq=0 existing_state=STATE_STANDBY
2018-06-26 17:43:25.316 7f6d463c5700  1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ea8000 :6813 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2018-06-26 17:43:25.316 7f6d358be700  2 osd.7 14 ms_handle_reset con 0x55ac52ea8000 session 0x55ac52d9cc00
2018-06-26 17:43:25.316 7f6d463c5700  0 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ee5200 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).read_until injecting socket failure
2018-06-26 17:43:25.316 7f6d463c5700 10 osd.7 14  new session 0x55ac52ef1600 con=0x55ac52ee5200 addr=172.21.15.104:6814/30853
2018-06-26 17:43:25.316 7f6d463c5700 10 osd.7 14  session 0x55ac52ef1600 osd.3 has caps osdcap[grant(*)] 'allow *'
2018-06-26 17:43:25.316 7f6d463c5700  1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ee5200 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=16 cs=2 l=0)._try_send send error: (32) Broken pipe
2018-06-26 17:43:25.316 7f6d358be700  2 osd.7 14 ms_handle_reset con 0x55ac52ee5200 session 0x55ac52ef1600

/a/sage-2018-06-26_15:58:10-rados-wip-sage3-testing-2018-06-26-0616-distro-basic-smithi/2706809

#4 Updated by Sage Weil almost 6 years ago

also /a/sage-2018-06-26_15:58:10-rados-wip-sage3-testing-2018-06-26-0616-distro-basic-smithi/2706939 (same test run!)

#5 Updated by Sage Weil over 5 years ago

  • Priority changed from High to Urgent

Another instance. On receiving end,

teuthology:2910414  02:04 PM $ grep -a 172.21.15.48:6805/10418 remote/*/log/*osd.0*
2018-08-17 06:38:42.871 7f2a4407f700  0 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73bc300 legacy :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: challenging authorizer
2018-08-17 06:38:42.871 7f2a4407f700 10 osd.0 13  new session 0x55e8f73fe000 con=0x55e8f73bc300 addr=172.21.15.48:6805/10418
2018-08-17 06:38:42.875 7f2a4407f700  1 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73bc300 legacy :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
2018-08-17 06:38:42.875 7f2a4407f700  0 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73f2000 legacy :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).read_until injecting socket failure
2018-08-17 06:38:42.875 7f2a4407f700  1 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73f2000 legacy :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._try_send send error: (32) Broken pipe

/a/sage-2018-08-16_19:46:40-rados-wip-sage2-testing-2018-08-16-1313-distro-basic-smithi/2910414

#6 Updated by Neha Ojha over 5 years ago

  • Priority changed from Urgent to High

#7 Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers

#8 Updated by Greg Farnum about 5 years ago

  • Category set to AsyncMessenger

#9 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New

Also available in: Atom PDF