Bug #23957
msg/async: read connect reply failed, but not retry
Status:
New
Priority:
High
Assignee:
-
Category:
AsyncMessenger
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
on sending side,
2018-05-01 02:00:44.701 7f65dcb55700 10 osd.5 20 send_incremental_map 19 -> 20 to 0x55c34a5f2760 172.21.15.50:6806/12882 2018-05-01 02:00:44.701 7f65dcb55700 1 -- 172.21.15.166:6805/12966 --> 172.21.15.50:6806/12882 -- osd_map(20..20 src has 1..20) v4 -- ?+0 0x55c34a8fa840 con 0x55c34a5f2760 2018-05-01 02:00:44.701 7f65dcb55700 1 -- 172.21.15.166:6805/12966 --> 172.21.15.50:6806/12882 -- pg_query(2.2 epoch 20) v4 -- ?+0 0x55c34a6a9840 con 0x55c34a5f2760
on accepting side,
2018-05-01 02:00:03.950 7f7f113bb700 10 osd.1 15 new session 0x555ea53d4780 con=0x555ea53cb800 addr=172.21.15.166:6805/12966 2018-05-01 02:00:03.950 7f7f113bb700 10 osd.1 15 session 0x555ea53d4780 osd.5 has caps osdcap[grant(*)] 'allow *' 2018-05-01 02:00:03.950 7f7f113bb700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea53cb800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING 2018-05-01 02:00:03.950 7f7f113bb700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 69 2018-05-01 02:00:03.950 7f7f113bb700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed 2018-05-01 02:00:03.950 7f7f113bb700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6814/13049 conn(0x555ea53c9c00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed ... 2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 new session 0x555ea2a6bc00 con=0x555ea53c9500 addr=172.21.15.166:6801/12910 2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 session 0x555ea2a6bc00 osd.4 has caps osdcap[grant(*)] 'allow *' 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 68 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6801/12910 conn(0x555ea53c9500 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed 2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 new session 0x555ea53d4c80 con=0x555ea53c8e00 addr=172.21.15.50:6810/12951 2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 session 0x555ea53d4c80 osd.2 has caps osdcap[grant(*)] 'allow *' 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 66 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.50:6810/12951 conn(0x555ea53c8e00 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed 2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 OSD::ms_get_authorizer type=osd 2018-05-01 02:00:03.950 7f7f11bbc700 10 osd.1 15 OSD::ms_get_authorizer type=osd 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_bulk peer close file descriptor 66 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until read failed 2018-05-01 02:00:03.950 7f7f11bbc700 1 -- 172.21.15.50:6806/12882 >> 172.21.15.166:6805/12966 conn(0x555ea5420000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._process_connection read connect reply failed
/a/dzafman-2018-04-30_18:45:29-rados:thrash-wip-zafman-testing-distro-basic-smithi/2458469
rados:thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml 2-recovery-overrides/default.yaml backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/upmap.yaml msgr-failures/fastclose.yaml msgr/random.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/none.yaml thrashosds-health.yaml workloads/radosbench.yaml}
radosbench times out with pgid 2.2 in state "creating+peering"
History
#1 Updated by Josh Durgin almost 6 years ago
- Priority changed from Normal to High
#2 Updated by Sage Weil almost 6 years ago
- Subject changed from PG stuck peering+creating to msg/async: read connect reply failed, but not retry
- Description updated (diff)
- Status changed from New to 12
#3 Updated by Sage Weil almost 6 years ago
another instance. on receiving end,
2018-06-26 17:43:25.316 7f6d463c5700 1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ea8000 :6813 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_STANDBY 2018-06-26 17:43:25.316 7f6d463c5700 10 osd.7 14 session 0x55ac52d9cc00 osd.3 has caps osdcap[grant(*)] 'allow *' 2018-06-26 17:43:25.316 7f6d463c5700 1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ea8000 :6813 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 1 vs existing csq=0 existing_state=STATE_STANDBY 2018-06-26 17:43:25.316 7f6d463c5700 1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ea8000 :6813 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing 2018-06-26 17:43:25.316 7f6d358be700 2 osd.7 14 ms_handle_reset con 0x55ac52ea8000 session 0x55ac52d9cc00 2018-06-26 17:43:25.316 7f6d463c5700 0 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ee5200 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).read_until injecting socket failure 2018-06-26 17:43:25.316 7f6d463c5700 10 osd.7 14 new session 0x55ac52ef1600 con=0x55ac52ee5200 addr=172.21.15.104:6814/30853 2018-06-26 17:43:25.316 7f6d463c5700 10 osd.7 14 session 0x55ac52ef1600 osd.3 has caps osdcap[grant(*)] 'allow *' 2018-06-26 17:43:25.316 7f6d463c5700 1 -- 172.21.15.134:6813/30909 >> 172.21.15.104:6814/30853 conn(0x55ac52ee5200 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=16 cs=2 l=0)._try_send send error: (32) Broken pipe 2018-06-26 17:43:25.316 7f6d358be700 2 osd.7 14 ms_handle_reset con 0x55ac52ee5200 session 0x55ac52ef1600
/a/sage-2018-06-26_15:58:10-rados-wip-sage3-testing-2018-06-26-0616-distro-basic-smithi/2706809
#4 Updated by Sage Weil almost 6 years ago
also /a/sage-2018-06-26_15:58:10-rados-wip-sage3-testing-2018-06-26-0616-distro-basic-smithi/2706939 (same test run!)
#5 Updated by Sage Weil over 5 years ago
- Priority changed from High to Urgent
Another instance. On receiving end,
teuthology:2910414 02:04 PM $ grep -a 172.21.15.48:6805/10418 remote/*/log/*osd.0* 2018-08-17 06:38:42.871 7f2a4407f700 0 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73bc300 legacy :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: challenging authorizer 2018-08-17 06:38:42.871 7f2a4407f700 10 osd.0 13 new session 0x55e8f73fe000 con=0x55e8f73bc300 addr=172.21.15.48:6805/10418 2018-08-17 06:38:42.875 7f2a4407f700 1 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73bc300 legacy :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY 2018-08-17 06:38:42.875 7f2a4407f700 0 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73f2000 legacy :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).read_until injecting socket failure 2018-08-17 06:38:42.875 7f2a4407f700 1 -- 172.21.15.16:6806/10338 >> 172.21.15.48:6805/10418 conn(0x55e8f73f2000 legacy :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._try_send send error: (32) Broken pipe
/a/sage-2018-08-16_19:46:40-rados-wip-sage2-testing-2018-08-16-1313-distro-basic-smithi/2910414
#6 Updated by Neha Ojha over 5 years ago
- Priority changed from Urgent to High
#7 Updated by Greg Farnum about 5 years ago
- Project changed from RADOS to Messengers
#8 Updated by Greg Farnum about 5 years ago
- Category set to AsyncMessenger
#9 Updated by Patrick Donnelly over 4 years ago
- Status changed from 12 to New