Project

General

Profile

Actions

Bug #46978

closed

OSD: shutdown of a OSD Host causes slow requests

Added by Manuel Lausch over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

while stopping all OSDs on a host I get for some seconds slow ops. Sometimes this don't happen and mostly it works as expected on stopping a single OSD.

I found the osd_fast_shutdown parameter which is per default true. In this case the OSD don't announce its shutdown to the mons before it stops.

If I stop all OSDs on a host (24 OSDs) I see the following several hundred times (of course with different OSD numbers) in my ceph.log
cluster [DBG] osd.317 reported immediately failed by osd.202
The first message like this is about 5 to 7 seconds after stopping the OSD. After down detection the cluster starts its peering and after this all is good. But the long time causes slow ops.

If i set the option osd_fast_shutdown to false I will seed more or less immediately the following in the ceph.log (one message per OSD)
cluster [INF] osd.837 marked itself down
In this case the whole process of detection and peering is much faster and I don't get any slow ops

I found this two PRs:
https://github.com/ceph/ceph/pull/31677
https://github.com/rook/rook/pull/4328

After reading the conversation I don't get it why the fast_sthudown shout be faster and better (related to the down detection) than so normal one. But I can understand that it is not necessary to stop all subsystems.
I wonder if it would make sense to send the shutdown message to the mons before stopping the OSD even in the fast shutdown process.

What do you think?

My Cluster:
Nautilus (v14.2.11)
44 Nodes
1056 OSDs
each node is connected via 2x10G LACP Channel

Manuel


Related issues 3 (0 open3 closed)

Copied to RADOS - Backport #49681: octopus: OSD: shutdown of a OSD Host causes slow requestsResolvedYuri WeinsteinActions
Copied to RADOS - Backport #49682: nautilus: OSD: shutdown of a OSD Host causes slow requestsResolvedActions
Copied to RADOS - Backport #49683: pacific: OSD: shutdown of a OSD Host causes slow requestsResolvedActions
Actions

Also available in: Atom PDF