Bug #42060: Slow ops seen when one ceph private interface is shut down - RADOS - Ceph

Actions

Copy link

Bug #42060

open

Slow ops seen when one ceph private interface is shut down

Added by Nokia ceph-users over 4 years ago. Updated over 4 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.2

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Environment -
5 node Nautilus cluster
67 OSDs per node - 4TB HDD per OSD

We are trying a use case where we shut down the private interface on one of the nodes and check the status of the cluster. After the private interface is shutdown and all OSDs respective to the node are marked as down, we are constantly seeing 'slow ops' reported. We did the same use case in Luminous as well. But we didnt see this issue then.

The interface of CN1 was shut down at Mon Sep 23 16:58:38 UTC 2019.
At 2019-09-23 17:01:40 it is reported that 1 host is down(67 osds down).
But until we start the interface at Mon Sep 23 17:26:40 UTC 2019 we are continuously seeing these slow ops messages.

We also see that the OSDs are being reported failed and then getting cancelled often like below even after the log reported that 1 host is down.

2019-09-23 17:10:05.713754 mon.cn1 (mon.0) 144660 : cluster [DBG] osd.257 reported failed by osd.42
2019-09-23 17:10:08.563028 mon.cn1 (mon.0) 144987 : cluster [DBG] osd.257 failure report canceled by osd.39
2019-09-23 17:10:09.812303 mon.cn1 (mon.0) 145132 : cluster [DBG] osd.257 failure report canceled by osd.42

Attaching the respective ceph log, ceph conf and ceph osd tree output

Files

Download all files

ceph.log-20190924-extract.gz (626 KB) ceph.log-20190924-extract.gz		Nokia ceph-users, 09/26/2019 07:21 AM
ceph.conf (2.25 KB) ceph.conf		Nokia ceph-users, 09/26/2019 07:24 AM
ceph-osd-tree.txt (21.2 KB) ceph-osd-tree.txt		Nokia ceph-users, 09/26/2019 10:25 AM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #42060

Slow ops seen when one ceph private interface is shut down

Updated by Nokia ceph-users over 4 years ago

Updated by Greg Farnum over 4 years ago

Updated by Nokia ceph-users over 4 years ago

Updated by Greg Farnum over 4 years ago

Updated by Nokia ceph-users over 4 years ago