Project

General

Profile

Actions

Bug #46978

closed

OSD: shutdown of a OSD Host causes slow requests

Added by Manuel Lausch over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

while stopping all OSDs on a host I get for some seconds slow ops. Sometimes this don't happen and mostly it works as expected on stopping a single OSD.

I found the osd_fast_shutdown parameter which is per default true. In this case the OSD don't announce its shutdown to the mons before it stops.

If I stop all OSDs on a host (24 OSDs) I see the following several hundred times (of course with different OSD numbers) in my ceph.log
cluster [DBG] osd.317 reported immediately failed by osd.202
The first message like this is about 5 to 7 seconds after stopping the OSD. After down detection the cluster starts its peering and after this all is good. But the long time causes slow ops.

If i set the option osd_fast_shutdown to false I will seed more or less immediately the following in the ceph.log (one message per OSD)
cluster [INF] osd.837 marked itself down
In this case the whole process of detection and peering is much faster and I don't get any slow ops

I found this two PRs:
https://github.com/ceph/ceph/pull/31677
https://github.com/rook/rook/pull/4328

After reading the conversation I don't get it why the fast_sthudown shout be faster and better (related to the down detection) than so normal one. But I can understand that it is not necessary to stop all subsystems.
I wonder if it would make sense to send the shutdown message to the mons before stopping the OSD even in the fast shutdown process.

What do you think?

My Cluster:
Nautilus (v14.2.11)
44 Nodes
1056 OSDs
each node is connected via 2x10G LACP Channel

Manuel


Related issues 3 (0 open3 closed)

Copied to RADOS - Backport #49681: octopus: OSD: shutdown of a OSD Host causes slow requestsResolvedYuri WeinsteinActions
Copied to RADOS - Backport #49682: nautilus: OSD: shutdown of a OSD Host causes slow requestsResolvedActions
Copied to RADOS - Backport #49683: pacific: OSD: shutdown of a OSD Host causes slow requestsResolvedActions
Actions #1

Updated by Mauricio Oliveira over 3 years ago

Hi Manuel,

Would you be able to test a patch for this issue?

If so, what OS and ceph packages/version you run?

cheers,
Mauricio

Actions #2

Updated by Mauricio Oliveira over 3 years ago

Test steps, plus testing with vstart, run-make-check, and qa/run-standalone look good.

https://github.com/ceph/ceph/pull/38909

Actions #3

Updated by Kefu Chai over 3 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 38909
Actions #4

Updated by Mauricio Oliveira over 3 years ago

If the PR is merged, can this be backported to Octopus/Nautilus? (I can't update the fields.) Thanks!

Actions #5

Updated by Nathan Cutler over 3 years ago

  • Subject changed from nautilus: OSD: shutdown of a OSD Host causes slow requests to OSD: shutdown of a OSD Host causes slow requests
  • Backport set to octopus, nautilus
Actions #6

Updated by Dan Hill about 3 years ago

  • Backport changed from octopus, nautilus to pacific, octopus, nautilus
Actions #7

Updated by Neha Ojha about 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #8

Updated by Mauricio Oliveira about 3 years ago

The master PR has been merged.

Can someone update Status to Pending Backport, please?

Thanks!

Actions #9

Updated by Igor Fedotov about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49681: octopus: OSD: shutdown of a OSD Host causes slow requests added
Actions #11

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49682: nautilus: OSD: shutdown of a OSD Host causes slow requests added
Actions #12

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49683: pacific: OSD: shutdown of a OSD Host causes slow requests added
Actions #13

Updated by Mauricio Oliveira about 3 years ago

Igor, thanks.

I'd like to / can work on submitting the backport PRs, if that's OK.

In the future, if I want to open backport tracker issues with the
script, is it possible to get access so not to get ForbiddenError?
(I tried slightly before it was done, and hit this.)

INFO:root:Processing issue list ->46978<-
INFO:root:Processing 1 issues with status Pending Backport
Traceback (most recent call last):
  File "./src/script/backport-create-issue", line 343, in <module>
...
redminelib.exceptions.ForbiddenError: Requested resource is forbidden
Actions #14

Updated by Konstantin Shalygin about 3 years ago

Mauricio, just make a backport PR at GitHub, we'll attach it to tracker later.

Actions #15

Updated by Loïc Dachary about 3 years ago

Hi Mauricio,

You are welcome to join the Stable Release team on IRC at #ceph-backports to discuss and resolve the issue you have with the script.

Cheers

Actions #16

Updated by Mauricio Oliveira about 3 years ago

Hey Konstantin and Loïc,

Understood; thanks!

Actions #17

Updated by Sage Weil about 3 years ago

  • Status changed from Pending Backport to Resolved
Actions #18

Updated by singuliere _ about 3 years ago

Since this issue is resolved and only the pacific backport was done, I assume it means the octopus & nautilus backports are no longer needed and can be cancelled.

Actions #19

Updated by singuliere _ about 3 years ago

  • Backport changed from pacific, octopus, nautilus to pacific
Actions #20

Updated by Mauricio Oliveira about 3 years ago

Hi @singuliere _ _,

Could you please revert the backport field to include Octopus and Nautilus?

Such backports have been done (see 'Related issues'), but are waiting reviews.
Only Pacific has been merged at this time perhaps due to release timing/focus.

There's no other comment or discussion in the backport bugs or PRs that suggest
these being no longer needed; so I'd like to ask for the backports to be kept.

Thank you!

Actions #21

Updated by Konstantin Shalygin about 3 years ago

  • Status changed from Resolved to Pending Backport
  • Backport changed from pacific to pacific,octopus,nautilus
  • Affected Versions deleted (v14.2.11)
Actions #22

Updated by Konstantin Shalygin about 3 years ago

@Mauricio, I was update issue backports and status.

Actions #23

Updated by Mauricio Oliveira about 3 years ago

Thanks, Konstantin!

Actions #24

Updated by Loïc Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF