Project

General

Profile

Bug #50878

ceph-volume can purge OSD being re-created

Added by Alexandre Marangone over 1 year ago. Updated 10 months ago.

Status:
In Progress
Priority:
Urgent
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific,octopus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph-volume issues a `ceph osd new <>` early in the deployment process. A failure after that could purge the OSD.
We implemented a quick fix internally by not rolling back on failure when an OSD ID is specified in the args but I'm wondering what a proper fix would be since ceph osd destroy requires more perms than boostrap-osd provides. Any pointer? I don't mind taking ownership of the ticket.

Wrongful purge example. This is on Nautilus but as far as I can tell any version above 14.2.0 could trigger this:

2021/05/13 17:40:39 Running: [ceph-volume --cluster ceph lvm batch --yes --bluestore --osds-per-device=1 --osd-ids 4 --dmcrypt /dev/nvme2n1 --no-systemd]
2021/05/13 17:40:45 result: --> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 1.0
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 629c1c07-22e5-4637-afa5-4ebe179869e5 4
Running command: /usr/sbin/vgcreate --force --yes ceph-0d45500e-5f6f-4b87-956f-4849c5108002 /dev/nvme2n1
 stderr: Failed to find PV /dev/nvme2n1
--> Was unable to complete a new OSD, will rollback changes
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.4 --yes-i-really-mean-it
 stderr: purged osd.4
-->  RuntimeError: command returned non-zero exit status: 52021/05/13 17:40:45 nvme2n1: failed to recreate OSD (error=exit status 1) 

History

#1 Updated by Guillaume Abrioux 10 months ago

  • Status changed from New to In Progress
  • Assignee set to Guillaume Abrioux
  • Backport changed from nautilus, octopus, pacific to quincy,pacific,octopus

Also available in: Atom PDF