Project

General

Profile

Actions

Bug #42060

open

Slow ops seen when one ceph private interface is shut down

Added by Nokia ceph-users over 4 years ago. Updated over 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Environment -
5 node Nautilus cluster
67 OSDs per node - 4TB HDD per OSD

We are trying a use case where we shut down the private interface on one of the nodes and check the status of the cluster. After the private interface is shutdown and all OSDs respective to the node are marked as down, we are constantly seeing 'slow ops' reported. We did the same use case in Luminous as well. But we didnt see this issue then.

The interface of CN1 was shut down at Mon Sep 23 16:58:38 UTC 2019.
At 2019-09-23 17:01:40 it is reported that 1 host is down(67 osds down).
But until we start the interface at Mon Sep 23 17:26:40 UTC 2019 we are continuously seeing these slow ops messages.

We also see that the OSDs are being reported failed and then getting cancelled often like below even after the log reported that 1 host is down.

2019-09-23 17:10:05.713754 mon.cn1 (mon.0) 144660 : cluster [DBG] osd.257 reported failed by osd.42
2019-09-23 17:10:08.563028 mon.cn1 (mon.0) 144987 : cluster [DBG] osd.257 failure report canceled by osd.39
2019-09-23 17:10:09.812303 mon.cn1 (mon.0) 145132 : cluster [DBG] osd.257 failure report canceled by osd.42

Attaching the respective ceph log, ceph conf and ceph osd tree output


Files

ceph.log-20190924-extract.gz (626 KB) ceph.log-20190924-extract.gz Nokia ceph-users, 09/26/2019 07:21 AM
ceph.conf (2.25 KB) ceph.conf Nokia ceph-users, 09/26/2019 07:24 AM
ceph-osd-tree.txt (21.2 KB) ceph-osd-tree.txt Nokia ceph-users, 09/26/2019 10:25 AM
Actions

Also available in: Atom PDF