Bug #46978: OSD: shutdown of a OSD Host causes slow requests - RADOS - Ceph

Actions

Copy link

Bug #46978

closed

OSD: shutdown of a OSD Host causes slow requests

Added by Manuel Lausch over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific,octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.10

ceph-qa-suite:

Component(RADOS):

Pull request ID:

38909

Crash signature (v1):

Crash signature (v2):

Description

Hi,

while stopping all OSDs on a host I get for some seconds slow ops. Sometimes this don't happen and mostly it works as expected on stopping a single OSD.

I found the osd_fast_shutdown parameter which is per default true. In this case the OSD don't announce its shutdown to the mons before it stops.

If I stop all OSDs on a host (24 OSDs) I see the following several hundred times (of course with different OSD numbers) in my ceph.log
cluster [DBG] osd.317 reported immediately failed by osd.202
The first message like this is about 5 to 7 seconds after stopping the OSD. After down detection the cluster starts its peering and after this all is good. But the long time causes slow ops.

If i set the option osd_fast_shutdown to false I will seed more or less immediately the following in the ceph.log (one message per OSD)
cluster [INF] osd.837 marked itself down
In this case the whole process of detection and peering is much faster and I don't get any slow ops

I found this two PRs:
https://github.com/ceph/ceph/pull/31677
https://github.com/rook/rook/pull/4328

After reading the conversation I don't get it why the fast_sthudown shout be faster and better (related to the down detection) than so normal one. But I can understand that it is not necessary to stop all subsystems.
I wonder if it would make sense to send the shutdown message to the mons before stopping the OSD even in the fast shutdown process.

What do you think?

My Cluster:
Nautilus (v14.2.11)
44 Nodes
1056 OSDs
each node is connected via 2x10G LACP Channel

Manuel

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #46978

OSD: shutdown of a OSD Host causes slow requests

Updated by Mauricio Oliveira over 3 years ago

Updated by Mauricio Oliveira over 3 years ago

Updated by Kefu Chai over 3 years ago

Updated by Mauricio Oliveira over 3 years ago

Updated by Nathan Cutler over 3 years ago

Updated by Dan Hill over 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Mauricio Oliveira about 3 years ago

Updated by Igor Fedotov about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Mauricio Oliveira about 3 years ago

Updated by Konstantin Shalygin about 3 years ago

Updated by Loïc Dachary about 3 years ago

Updated by Mauricio Oliveira about 3 years ago

Updated by Sage Weil about 3 years ago

Updated by singuliere _ about 3 years ago

Updated by singuliere _ about 3 years ago

Updated by Mauricio Oliveira about 3 years ago

Updated by Konstantin Shalygin about 3 years ago

Updated by Konstantin Shalygin about 3 years ago

Updated by Mauricio Oliveira about 3 years ago

Updated by Loïc Dachary about 3 years ago