Project

General

Profile

Actions

Bug #36612

closed

msg/async: connection stall

Added by Sage Weil over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
AsyncMessenger
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-10-28 23:23:07.173 7ff73e5b8700  1 -- 172.21.15.118:6813/34020 <== osd.0 172.21.15.118:6801/97454 124 ==== rep_scrubmap(1.c e310 from shard 0) v2 ==== 40+0+42 (2203346794 0 3641534724) 0x561093d72d80 con 0x561092add800
2018-10-28 23:23:07.173 7ff71e4f0700 20 osd.3 969 share_map osd.0 172.21.15.118:6801/97454 310
2018-10-28 23:23:07.173 7ff71e4f0700 20 osd.3 969 should_share_map osd.0 172.21.15.118:6801/97454 310
2018-10-28 23:23:07.173 7ff7224f8700  1 -- 172.21.15.118:6813/34020 --> 172.21.15.118:6801/97454 -- MOSDScrubReserve(1.c RELEASE e969) v1 -- 0x56109353aa00 con 0
2018-10-28 23:23:07.173 7ff7224f8700  1 -- 172.21.15.118:6813/34020 --> 172.21.15.118:6801/97454 -- pg_info((query:969 sent:969 1.c( empty local-lis/les=310/311 n=0 ec=282/14 lis/c 310/310 les/c/f 311/311/0 310/310/14))=([0,0] intervals=) epoch 969) v5 -- 0x561092990f00 con 0
2018-10-28 23:23:54.655 7ff73e5b8700  1 -- 172.21.15.118:6813/34020 >> 172.21.15.118:6801/97454 conn(0x561092add800 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 46
2018-10-28 23:23:54.655 7ff73e5b8700  1 -- 172.21.15.118:6813/34020 >> 172.21.15.118:6801/97454 conn(0x561092add800 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2018-10-28 23:23:54.655 7ff73e5b8700  1 -- 172.21.15.118:6813/34020 >> 172.21.15.118:6801/97454 conn(0x561092add800 legacy :-1 s=OPENED pgs=3 cs=1 l=0).handle_message read tag failed

on other end,
2018-10-28 23:23:07.173 7fdedfddc700  1 -- 172.21.15.118:6801/97454 --> 172.21.15.118:6813/34020 -- rep_scrubmap(1.c e310 from shard 0) v2 -- 0x55c5621bb200 con 0
2018-10-28 23:23:07.174 7fdf03eac700  1 -- 172.21.15.118:6801/97454 <== osd.3 172.21.15.118:6813/34020 225 ==== MOSDScrubReserve(1.c RELEASE e969) v1 ==== 43+0+0 (1924339382 0 0) 0x55c55ea32000 con 0x55c55e29b000
2018-10-28 23:23:07.174 7fdf03eac700  1 -- 172.21.15.118:6801/97454 <== osd.3 172.21.15.118:6813/34020 226 ==== pg_info((query:969 sent:969 1.c( empty local-lis/les=310/311 n=0 ec=282/14 lis/c 310/310 les/c/f 311/311/0 310/310/14))=([0,0] intervals=) epoch 969) v5 ==== 961+0+0 (630120914 0 0) 0x55c56338b860 con 0x55c55e29b000
2018-10-28 23:23:07.174 7fdee3de4700 20 osd.0 969 share_map osd.3 172.21.15.118:6813/34020 969
2018-10-28 23:23:07.174 7fdee3de4700 20 osd.0 969 should_share_map osd.3 172.21.15.118:6813/34020 969

/a/sage-2018-10-28_18:53:56-rados-wip-sage2-testing-2018-10-28-0942-distro-basic-smithi/3197235

Related issues 6 (0 open6 closed)

Related to RADOS - Bug #21143: bad RESETSESSION between OSDs?DuplicateHaomai Wang08/26/2017

Actions
Has duplicate Messengers - Bug #36666: msg: rejoin message queued but not sentDuplicate

Actions
Has duplicate RADOS - Bug #42058: OSD reconnected across map epochs, inconsistent pg logs createdDuplicate09/26/201910/31/2019

Actions
Has duplicate CephFS - Bug #36507: client: connection failure during reconnect causes client to hangDuplicatePatrick Donnelly

Actions
Copied to Messengers - Backport #37520: mimic: msg/async: connection stallRejectedNathan CutlerActions
Copied to Messengers - Backport #37521: luminous: msg/async: connection stallRejectedNathan CutlerActions
Actions

Also available in: Atom PDF