Project

General

Profile

Actions

Bug #23329

open

async messager lost session when IO performance testing, not recover util to restart

Added by Yong Wang about 6 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
AsyncMessenger
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

server is listening on port 6830
client output no reply from server.
from ss -np check, the tcp connect is not exits.
why async can't to recover it util to restart.

ceph version 12.2.2

centos 7.3

default ceph.conf

ms_type = async+posix

2018-03-13 14:58:14.647389 7f9c9c55c700 1 -- 10.0.30.26:6848/1004050 --> 10.0.30.27:0/4036 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.642824) v4 -- 0x7f9cbdc68000 con 0
2018-03-13 14:58:14.647435 7f9c9cd5d700 1 -- 10.0.30.26:6846/1004050 <== osd.44 10.0.30.27:0/4036 26192 ==== osd_ping(ping e339 stamp 2018-03-13 14:58:14.642824) v4 ==== 2004+0+0 (230542212 0 0) 0x7f9cca254200 con 0x7f9cba551800
2018-03-13 14:58:14.647473 7f9c9cd5d700 1 -- 10.0.30.26:6846/1004050 --> 10.0.30.27:0/4036 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.642824) v4 -- 0x7f9cd11f2c00 con 0
2018-03-13 14:58:14.713552 7f9c9c55c700 1 -- 10.0.30.26:6846/1004050 <== osd.19 10.0.30.25:0/3012 26102 ==== osd_ping(ping e339 stamp 2018-03-13 14:58:14.707686) v4 ==== 2004+0+0 (2179933530 0 0) 0x7f9cd7427e00 con 0x7f9cbb2e4000
2018-03-13 14:58:14.713469 7f9c9cd5d700 1 -- 10.0.30.26:6848/1004050 <== osd.19 10.0.30.25:0/3012 26102 ==== osd_ping(ping e339 stamp 2018-03-13 14:58:14.707686) v4 ==== 2004+0+0 (2179933530 0 0) 0x7f9cbdc6ba00 con 0x7f9cba458000
2018-03-13 14:58:14.713627 7f9c9c55c700 1 -- 10.0.30.26:6846/1004050 --> 10.0.30.25:0/3012 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.707686) v4 -- 0x7f9cbdc68000 con 0
2018-03-13 14:58:14.713786 7f9c9cd5d700 1 -- 10.0.30.26:6848/1004050 --> 10.0.30.25:0/3012 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.707686) v4 -- 0x7f9cd11f2c00 con 0
2018-03-13 14:58:14.714629 7f9c9d55e700 1 -- 10.0.30.26:6846/1004050 <== osd.0 10.0.30.25:0/3016 26185 ==== osd_ping(ping e339 stamp 2018-03-13 14:58:14.708800) v4 ==== 2004+0+0 (3781320873 0 0) 0x7f9ce455da00 con 0x7f9cba561000
2018-03-13 14:58:14.714679 7f9c9d55e700 1 -- 10.0.30.26:6846/1004050 --> 10.0.30.25:0/3016 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.708800) v4 -- 0x7f9cc09afc00 con 0
2018-03-13 14:58:14.714821 7f9c9cd5d700 1 -- 10.0.30.26:6848/1004050 <== osd.0 10.0.30.25:0/3016 26185 ==== osd_ping(ping e339 stamp 2018-03-13 14:58:14.708800) v4 ==== 2004+0+0 (3781320873 0 0) 0x7f9cca254200 con 0x7f9cba698800
2018-03-13 14:58:14.714863 7f9c9cd5d700 1 -- 10.0.30.26:6848/1004050 --> 10.0.30.25:0/3016 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.708800) v4 -- 0x7f9cd11f2c00 con 0
2018-03-13 14:58:14.766279 7f9c9d55e700 1 -- 10.0.30.26:6846/1004050 <== osd.46 10.0.30.27:0/4040 26123 ==== osd_ping(ping e339 stamp 2018-03-13 14:58:14.761398) v4 ==== 2004+0+0 (3946795183 0 0) 0x7f9cc76eb400 con 0x7f9cba553000
2018-03-13 14:58:14.766317 7f9c9d55e700 1 -- 10.0.30.26:6846/1004050 --> 10.0.30.27:0/4040 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.761398) v4 -- 0x7f9cc09afc00 con 0
2018-03-13 14:58:14.766445 7f9c9c55c700 1 -- 10.0.30.26:6848/1004050 <== osd.46 10.0.30.27:0/4040 26123 ==== osd_ping(ping e339 stamp 2018-03-13 14:58:14.761398) v4 ==== 2004+0+0 (3946795183 0 0) 0x7f9cd9eef600 con 0x7f9cba8e9800
2018-03-13 14:58:14.766496 7f9c9c55c700 1 -- 10.0.30.26:6848/1004050 --> 10.0.30.27:0/4040 -- osd_ping(ping_reply e339 stamp 2018-03-13 14:58:14.761398) v4 -- 0x7f9cbdc68000 con 0

netstat -tanp|grep 4026 |grep 6830
tcp 0 0 10.0.30.27:6830 0.0.0.0:* LISTEN 4026/ceph-osd
tcp 0 0 10.0.30.27:6830 10.0.30.25:55486 ESTABLISHED 4026/ceph-osd
Actions #1

Updated by Greg Farnum about 6 years ago

Can you explain the issue in more detail? Did the accepter thread break or something?

Actions #2

Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers
  • Category deleted (Administration/Usability)
Actions #3

Updated by Greg Farnum about 5 years ago

  • Category set to AsyncMessenger
Actions

Also available in: Atom PDF