Project

General

Profile

Bug #46978

OSD: shutdown of a OSD Host causes slow requests

Added by Manuel Lausch 10 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

while stopping all OSDs on a host I get for some seconds slow ops. Sometimes this don't happen and mostly it works as expected on stopping a single OSD.

I found the osd_fast_shutdown parameter which is per default true. In this case the OSD don't announce its shutdown to the mons before it stops.

If I stop all OSDs on a host (24 OSDs) I see the following several hundred times (of course with different OSD numbers) in my ceph.log
cluster [DBG] osd.317 reported immediately failed by osd.202
The first message like this is about 5 to 7 seconds after stopping the OSD. After down detection the cluster starts its peering and after this all is good. But the long time causes slow ops.

If i set the option osd_fast_shutdown to false I will seed more or less immediately the following in the ceph.log (one message per OSD)
cluster [INF] osd.837 marked itself down
In this case the whole process of detection and peering is much faster and I don't get any slow ops

I found this two PRs:
https://github.com/ceph/ceph/pull/31677
https://github.com/rook/rook/pull/4328

After reading the conversation I don't get it why the fast_sthudown shout be faster and better (related to the down detection) than so normal one. But I can understand that it is not necessary to stop all subsystems.
I wonder if it would make sense to send the shutdown message to the mons before stopping the OSD even in the fast shutdown process.

What do you think?

My Cluster:
Nautilus (v14.2.11)
44 Nodes
1056 OSDs
each node is connected via 2x10G LACP Channel

Manuel


Related issues

Copied to RADOS - Backport #49681: octopus: OSD: shutdown of a OSD Host causes slow requests Resolved
Copied to RADOS - Backport #49682: nautilus: OSD: shutdown of a OSD Host causes slow requests Resolved
Copied to RADOS - Backport #49683: pacific: OSD: shutdown of a OSD Host causes slow requests Resolved

History

#1 Updated by Mauricio Oliveira 5 months ago

Hi Manuel,

Would you be able to test a patch for this issue?

If so, what OS and ceph packages/version you run?

cheers,
Mauricio

#2 Updated by Mauricio Oliveira 5 months ago

Test steps, plus testing with vstart, run-make-check, and qa/run-standalone look good.

https://github.com/ceph/ceph/pull/38909

#3 Updated by Kefu Chai 5 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 38909

#4 Updated by Mauricio Oliveira 5 months ago

If the PR is merged, can this be backported to Octopus/Nautilus? (I can't update the fields.) Thanks!

#5 Updated by Nathan Cutler 5 months ago

  • Subject changed from nautilus: OSD: shutdown of a OSD Host causes slow requests to OSD: shutdown of a OSD Host causes slow requests
  • Backport set to octopus, nautilus

#6 Updated by Dan Hill 5 months ago

  • Backport changed from octopus, nautilus to pacific, octopus, nautilus

#7 Updated by Neha Ojha 4 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#8 Updated by Mauricio Oliveira 3 months ago

The master PR has been merged.

Can someone update Status to Pending Backport, please?

Thanks!

#9 Updated by Igor Fedotov 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#10 Updated by Backport Bot 3 months ago

  • Copied to Backport #49681: octopus: OSD: shutdown of a OSD Host causes slow requests added

#11 Updated by Backport Bot 3 months ago

  • Copied to Backport #49682: nautilus: OSD: shutdown of a OSD Host causes slow requests added

#12 Updated by Backport Bot 3 months ago

  • Copied to Backport #49683: pacific: OSD: shutdown of a OSD Host causes slow requests added

#13 Updated by Mauricio Oliveira 3 months ago

Igor, thanks.

I'd like to / can work on submitting the backport PRs, if that's OK.

In the future, if I want to open backport tracker issues with the
script, is it possible to get access so not to get ForbiddenError?
(I tried slightly before it was done, and hit this.)

INFO:root:Processing issue list ->46978<-
INFO:root:Processing 1 issues with status Pending Backport
Traceback (most recent call last):
  File "./src/script/backport-create-issue", line 343, in <module>
...
redminelib.exceptions.ForbiddenError: Requested resource is forbidden

#14 Updated by Konstantin Shalygin 3 months ago

Mauricio, just make a backport PR at GitHub, we'll attach it to tracker later.

#15 Updated by Loïc Dachary 3 months ago

Hi Mauricio,

You are welcome to join the Stable Release team on IRC at #ceph-backports to discuss and resolve the issue you have with the script.

Cheers

#16 Updated by Mauricio Oliveira 3 months ago

Hey Konstantin and Loïc,

Understood; thanks!

#17 Updated by Sage Weil 3 months ago

  • Status changed from Pending Backport to Resolved

#18 Updated by singuliere _ 3 months ago

Since this issue is resolved and only the pacific backport was done, I assume it means the octopus & nautilus backports are no longer needed and can be cancelled.

#19 Updated by singuliere _ 3 months ago

  • Backport changed from pacific, octopus, nautilus to pacific

#20 Updated by Mauricio Oliveira 3 months ago

Hi @singuliere _,

Could you please revert the backport field to include Octopus and Nautilus?

Such backports have been done (see 'Related issues'), but are waiting reviews.
Only Pacific has been merged at this time perhaps due to release timing/focus.

There's no other comment or discussion in the backport bugs or PRs that suggest
these being no longer needed; so I'd like to ask for the backports to be kept.

Thank you!

#21 Updated by Konstantin Shalygin 3 months ago

  • Status changed from Resolved to Pending Backport
  • Backport changed from pacific to pacific,octopus,nautilus
  • Affected Versions deleted (v14.2.11)

#22 Updated by Konstantin Shalygin 3 months ago

@Mauricio, I was update issue backports and status.

#23 Updated by Mauricio Oliveira 3 months ago

Thanks, Konstantin!

#24 Updated by Loïc Dachary 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF