Project

General

Profile

Actions

Bug #52919

closed

ceph orch device zap validation can result in osd issues and problematic error messages

Added by Paul Cuzner over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
cephadm
Target version:
% Done:

100%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Any fat-fingered moment with hostname or device path can cause zap to do things it shouldn't.

For example;
1. bogus host name
[ceph: root@f34cluster /]# ceph orch device zap orac disk --force
Error EINVAL: host address is empty

---> clearly the hostname was not empty - the msg should be that the host is not a member of the cluster

2. host in maintenance .. not checked for. If the host is in maintenance we should not attempt any actions against it.

3. valid host, but a bogus device
[ceph: root@f34cluster /]# ceph orch device zap f34cluster disk --force
Error EINVAL: Zap failed: ceph-volume lvm list d
i
s
k

---> not a very clear error message for the admin

4. osd stopped, but still valid...this fails the "is-active" check, allowing the zap to proceed against a valid osd that just happened to be down at that point. The result looks like this (acting against osd.3)
[ceph: root@f34cluster /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 4.00000 root default
-3 4.00000 host f34cluster
0 hdd 1.00000 osd.0 up 1.00000 1.00000
1 hdd 1.00000 osd.1 up 1.00000 1.00000
2 hdd 1.00000 osd.2 up 1.00000 1.00000
3 hdd 1.00000 osd.3 down 1.00000 1.00000

---> osd.3 was zapped, but to ceph it still exists and obviously the host still has it as an entry in systemd. the orch osd rm should have been run first


Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #51028: device zap doesn't perform any checksClosedPaul Cuzner

Actions
Actions

Also available in: Atom PDF