Bug #45792: cephadm: zapped OSD gets re-added to the cluster. - Orchestrator - Ceph

Actions

Copy link

Bug #45792

closed

cephadm: zapped OSD gets re-added to the cluster.

Added by David Capone almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

cephadm

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

35744

Crash signature (v1):

Crash signature (v2):

Description

Using version 15.2.1 with octopus cluster running centos 8.

When the cluster was initially deployed, OSDs were created using

ceph orch apply osd --all-available-devices

2 weeks later any drive added to the server is immediately provisioned as an OSD by cephadm process. The orchestrator apparently never stops looking to provision new drives. This also prevents a problem is you attempt to zap a drive after removing it from the cluster with the intent to safely remove the drive physically from the server. The moment the zap completes successfully, cephadm sees the drive as available and reprovisions the drive as a new OSD.

If this is by design some documentation on how to stop this background progress when a user no longer wants it to deploy new drives would be appreciated.

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by David Capone almost 4 years ago

One thing I didn't mentioned is that cephadm attempting to apply OSDs to available devices SURVIVES disabling and enabling cephadm.

If you run ceph mgr module disable cephadm, it will stop this behavior WHILE cephadm remains disabled, however the moment cephadm is reenabled with ceph mgr module enable cephadm, cephadm immediately starts looking for available devices to apply as OSDs.

Actions

Copy link

Updated by Kiefer Chang almost 4 years ago

Related to Bug #44808: mgr/dashboard: Allow users to specify an unmanaged ServiceSpec when creating OSDs added

Actions

Copy link

Updated by Kiefer Chang almost 4 years ago

When using command `ceph orch apply osd --all-available-devices` to create OSDs, an OSDSpec is created to apply all usable devices as OSDs.
And the spec is "managed" within cephadm, which means the spec will be applied by cephadm repeatedly (10 minutes by default).

A PR (https://github.com/ceph/ceph/pull/35084) was created to allow setting the "managed" state to false.
But unfortunately it's not backported to v15 yet.

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Subject changed from cephadm: apply osd all available devices never stops running to cephadm: zapped OSD gets re-added to the cluster.
Description updated (diff)
Category changed from cephadm (binary) to cephadm

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Priority changed from Urgent to High

Actions

Copy link

Updated by Joshua Schmid almost 4 years ago

I'd say that this is the intended behavior. There is a tracking issue for adding a `blacklisting` command that allows you 'lock'(or make unavailable) a certain disk.

See: https://tracker.ceph.com/issues/45374

Actions

Copy link

Updated by Jan Fajerski almost 4 years ago

Hmm there is something racey going on. I have a cluster too deployed with --all-available-devices. I remove and osd via ceph orch osd rm 5 which works fine, seeing osd count go from 20 to 19. Then I go and zap that device ceph orch device zap master /dev/vdb --force and while that runs, ceph -s reports that that I have 20 in OSDs but only 19 up and its rebalancing data. Zap comes back successful, but the cluster ends up with 20 OSDs, i.e. my removed and zapped OSD was resurrected.
So the apply seems to race with remove or zap? Certainly not what I expected.

Ok I should have thought for two minutes longer before commenting. I suppose that is the design. While I see (and like) the logic behind it, this is very surprising to unsuspecting users. Maybe blacklisting a device by default after zapping can make this more predictable?

Actions

Copy link

Updated by David Capone almost 4 years ago

That is exactly the behavior we see.

I do not think that is intuitive or should be the expected behavior.

Maybe the addition of an option like --once that can be used with --all-available-devices that would apply the spec and then remove the spec so it doesn't run forever.

Actions

Copy link

#10

Updated by Igor Fedotov almost 4 years ago

We stepped into that too, IMO quite non-intuitive behavior too. One removes an OSD and it reappears again shortly after....

Actions

Copy link

#11

Updated by David Capone almost 4 years ago

As an additional comment on this...

I also think similar logic should apply to any drivespec that is used to apply osds. There should be some time limit where cephadm tries to match the drivespec for and then it should stop.

I foresee a situation where a particular drivespec is applied when a cluster is initially deployed. Then 6 months down the road new OSDs nodes are added to the cluster where it might be more appropriate to manually provision the new OSDs at that point and the admin does not recall the initial OSD drive spec was used to configure the cluster initially and is wondering why all of the new drives being added to the cluster have suddenly starting autoprovisioning when they didn't intend that to happen.

An expensive rebalance is then needed to correct the errant / unexpected OSD deployments.

Actions

Copy link

#12

Updated by Joshua Schmid almost 4 years ago

There is an `unmanaged` flag that can be set for any ServiceSpec

service_type: osd
service_id: foo
...
unmanaged: True

That would prevent cephadm from automatically provisioning new or available disks (any new daemons really).

re:

I foresee a situation where a particular drivespec is applied when a cluster is initially deployed. Then 6 months down the road new OSDs nodes are added to the cluster where it might be more appropriate to manually provision the new OSDs at that point and the admin does not recall the initial OSD drive spec was used to configure the cluster initially and is wondering why all of the new drives being added to the cluster have suddenly starting autoprovisioning when they didn't intend that to happen.

That can also be prevented by correctly setting the `placement` specification which explicitly matches the existing hosts (and not newly added ones) or using `unmanaged: True`.

However, with the increasing amount of reports that we get regarding this specific behavior we probably have to rethink the workflow.
It feels that the 'rolling' fashion in which cephadm acts is counter-intuitive to most users. I'm not sure if this is due to missing documentation or just actually counter-intuitive..

Actions

Copy link

#13

Updated by Sebastian Wagner almost 4 years ago

Related to Feature #45374: Add support for BLACKLISTED_DEVICES env var parsing. added

Actions

Copy link

#14

Updated by Sebastian Wagner almost 4 years ago

Tags set to ux

Actions

Copy link

#15

Updated by Sebastian Wagner almost 4 years ago

Related to Bug #45907: cepham: daemon rm for managed services is completely broken added

Actions

Copy link

#16

Updated by Joshua Schmid almost 4 years ago

Status changed from New to Fix Under Review
Pull request ID set to 35744

As this is the intended behavior, this needs to be documented. See https://github.com/ceph/ceph/pull/35744

Actions

Copy link

#17

Updated by David Capone almost 4 years ago

I am still putting additional input and requesting that if this behavior remains as is that a way to stop the process be implemented.

In development we have run into 2 issues with this orchestration and the inability to stop individual orchestration events.

1. https://tracker.ceph.com/issues/44749 because of that bug new db drives and data drives cannot be added without first pausing the entire orchestration engine. It would be much better to be able to pause/stop just a drive spec (or any spec) deployment.

2. With orch osd rm ... similar issue of being able to unable to delete queued orchestration events. I have run into a situation on 3 node clusters where the osds do not fully rebalance and rm "hangs" because rebalance cannot fully complete. As an administrator I want to be able to forcibly destroy that OSD since I explicitly understand the risks of doing so. You are able to forcibly destroy it by stopping the container and marking it destroyed, but once a new drive is replaced, added, and available, reuses the osd id, and is deployed, cephadm continues with the previously issued remove command on the empty, newly redeployed osd.

Functionality and control would be greatly improved by implementation of some way to stop individual orchestration events.

Actions

Copy link

#18

Updated by Sebastian Wagner almost 4 years ago

Priority changed from High to Urgent

Actions

Copy link

#19

Updated by Sebastian Wagner over 3 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #45792

cephadm: zapped OSD gets re-added to the cluster.

Updated by David Capone almost 4 years ago

Updated by Kiefer Chang almost 4 years ago

Updated by Kiefer Chang almost 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Joshua Schmid almost 4 years ago

Updated by Jan Fajerski almost 4 years ago

Updated by David Capone almost 4 years ago

Updated by Igor Fedotov almost 4 years ago

Updated by David Capone almost 4 years ago

Updated by Joshua Schmid almost 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Joshua Schmid almost 4 years ago

Updated by David Capone almost 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Sebastian Wagner over 3 years ago