Bug #52919
closedceph orch device zap validation can result in osd issues and problematic error messages
100%
Description
Any fat-fingered moment with hostname or device path can cause zap to do things it shouldn't.
For example;
1. bogus host name
[ceph: root@f34cluster /]# ceph orch device zap orac disk --force
Error EINVAL: host address is empty
---> clearly the hostname was not empty - the msg should be that the host is not a member of the cluster
2. host in maintenance .. not checked for. If the host is in maintenance we should not attempt any actions against it.
3. valid host, but a bogus device
[ceph: root@f34cluster /]# ceph orch device zap f34cluster disk --force
Error EINVAL: Zap failed: ceph-volume lvm list d
i
s
k
---> not a very clear error message for the admin
4. osd stopped, but still valid...this fails the "is-active" check, allowing the zap to proceed against a valid osd that just happened to be down at that point. The result looks like this (acting against osd.3)
[ceph: root@f34cluster /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 4.00000 root default
-3 4.00000 host f34cluster
0 hdd 1.00000 osd.0 up 1.00000 1.00000
1 hdd 1.00000 osd.1 up 1.00000 1.00000
2 hdd 1.00000 osd.2 up 1.00000 1.00000
3 hdd 1.00000 osd.3 down 1.00000 1.00000
---> osd.3 was zapped, but to ceph it still exists and obviously the host still has it as an entry in systemd. the orch osd rm should have been run first