Actions
Bug #50878
openceph-volume can purge OSD being re-created
% Done:
0%
Source:
Tags:
Backport:
quincy,pacific,octopus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ceph-volume issues a `ceph osd new <>` early in the deployment process. A failure after that could purge the OSD.
We implemented a quick fix internally by not rolling back on failure when an OSD ID is specified in the args but I'm wondering what a proper fix would be since ceph osd destroy requires more perms than boostrap-osd provides. Any pointer? I don't mind taking ownership of the ticket.
Wrongful purge example. This is on Nautilus but as far as I can tell any version above 14.2.0 could trigger this:
2021/05/13 17:40:39 Running: [ceph-volume --cluster ceph lvm batch --yes --bluestore --osds-per-device=1 --osd-ids 4 --dmcrypt /dev/nvme2n1 --no-systemd] 2021/05/13 17:40:45 result: --> DEPRECATION NOTICE --> You are using the legacy automatic disk sorting behavior --> The Pacific release will change the default to --no-auto --> passed data devices: 1 physical, 0 LVM --> relative data size: 1.0 Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 629c1c07-22e5-4637-afa5-4ebe179869e5 4 Running command: /usr/sbin/vgcreate --force --yes ceph-0d45500e-5f6f-4b87-956f-4849c5108002 /dev/nvme2n1 stderr: Failed to find PV /dev/nvme2n1 --> Was unable to complete a new OSD, will rollback changes Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.4 --yes-i-really-mean-it stderr: purged osd.4 --> RuntimeError: command returned non-zero exit status: 52021/05/13 17:40:45 nvme2n1: failed to recreate OSD (error=exit status 1)
Updated by Guillaume Abrioux about 2 years ago
- Status changed from New to In Progress
- Assignee set to Guillaume Abrioux
- Backport changed from nautilus, octopus, pacific to quincy,pacific,octopus
Actions