Project

General

Profile

Actions

Bug #42060

open

Slow ops seen when one ceph private interface is shut down

Added by Nokia ceph-users over 4 years ago. Updated over 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Environment -
5 node Nautilus cluster
67 OSDs per node - 4TB HDD per OSD

We are trying a use case where we shut down the private interface on one of the nodes and check the status of the cluster. After the private interface is shutdown and all OSDs respective to the node are marked as down, we are constantly seeing 'slow ops' reported. We did the same use case in Luminous as well. But we didnt see this issue then.

The interface of CN1 was shut down at Mon Sep 23 16:58:38 UTC 2019.
At 2019-09-23 17:01:40 it is reported that 1 host is down(67 osds down).
But until we start the interface at Mon Sep 23 17:26:40 UTC 2019 we are continuously seeing these slow ops messages.

We also see that the OSDs are being reported failed and then getting cancelled often like below even after the log reported that 1 host is down.

2019-09-23 17:10:05.713754 mon.cn1 (mon.0) 144660 : cluster [DBG] osd.257 reported failed by osd.42
2019-09-23 17:10:08.563028 mon.cn1 (mon.0) 144987 : cluster [DBG] osd.257 failure report canceled by osd.39
2019-09-23 17:10:09.812303 mon.cn1 (mon.0) 145132 : cluster [DBG] osd.257 failure report canceled by osd.42

Attaching the respective ceph log, ceph conf and ceph osd tree output


Files

ceph.log-20190924-extract.gz (626 KB) ceph.log-20190924-extract.gz Nokia ceph-users, 09/26/2019 07:21 AM
ceph.conf (2.25 KB) ceph.conf Nokia ceph-users, 09/26/2019 07:24 AM
ceph-osd-tree.txt (21.2 KB) ceph-osd-tree.txt Nokia ceph-users, 09/26/2019 10:25 AM
Actions #1

Updated by Nokia ceph-users over 4 years ago

Hi,
When i mention private network i am referring to the cluster_network.

Actions #2

Updated by Greg Farnum over 4 years ago

  • Status changed from New to Need More Info

What workload are you running; does it have its own metrics? Is there evidence that Nautilus is slower or behaving worse than Luminous apart from the slow op reports? (ie, could it be improved transparency rather than worse behavior?)

Did the OSDs on the host with a down cluster network stay down or did they try and turn back on? OSD failures getting canceled can happen if the OSD is genuinely slow but staying alive.

Actions #3

Updated by Nokia ceph-users over 4 years ago

We monitor rados outage in both scenarios.
For luminous, when the interface was shut down - ~60 seconds rados outage
For nautilus, ~84 seconds rados outage

Yes, the OSDs on the respective host tried to turn back on. We came to know that the heartbeat for the OSDs happen via both the cluster network and public network. Here we have stopped only the cluster network; the public network is still up. So the heartbeat will still be successful via the public network. Could it be that because of this, the OSD failures are getting cancelled?

Actions #4

Updated by Greg Farnum over 4 years ago

Do the OSDs ever stay down once their cluster network is disabled?

Generally speaking, if they only have the cluster network down, they can delay getting marked down but it should proceed eventually. It's possible there's a bug with the heartbeating when the network states disagree, though.

Actions #5

Updated by Nokia ceph-users over 4 years ago

Yes, ~3 minutes after disabling the network, the OSDs became down. I started the network after 5 minutes and until then, the OSDs stayed down.

Before the respective OSDs are marked down, the OSDs from other nodes get reported as failed and then the report gets cancelled within seconds. This process happens multiple times before the expected OSDs become down.

Actions

Also available in: Atom PDF