Bug #52919
closedceph orch device zap validation can result in osd issues and problematic error messages
100%
Description
Any fat-fingered moment with hostname or device path can cause zap to do things it shouldn't.
For example;
1. bogus host name
[ceph: root@f34cluster /]# ceph orch device zap orac disk --force
Error EINVAL: host address is empty
---> clearly the hostname was not empty - the msg should be that the host is not a member of the cluster
2. host in maintenance .. not checked for. If the host is in maintenance we should not attempt any actions against it.
3. valid host, but a bogus device
[ceph: root@f34cluster /]# ceph orch device zap f34cluster disk --force
Error EINVAL: Zap failed: ceph-volume lvm list d
i
s
k
---> not a very clear error message for the admin
4. osd stopped, but still valid...this fails the "is-active" check, allowing the zap to proceed against a valid osd that just happened to be down at that point. The result looks like this (acting against osd.3)
[ceph: root@f34cluster /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 4.00000 root default
-3 4.00000 host f34cluster
0 hdd 1.00000 osd.0 up 1.00000 1.00000
1 hdd 1.00000 osd.1 up 1.00000 1.00000
2 hdd 1.00000 osd.2 up 1.00000 1.00000
3 hdd 1.00000 osd.3 down 1.00000 1.00000
---> osd.3 was zapped, but to ceph it still exists and obviously the host still has it as an entry in systemd. the orch osd rm should have been run first
Updated by Paul Cuzner over 2 years ago
- % Done changed from 0 to 70
- Pull request ID set to 43560
The PR is currently in draft for comment.
The patch provides the following checks and interactions with the admin
[ceph: root@f34cluster /]# ceph orch device zap bogus device --force Error EINVAL: Host 'bogus' is not a member of the cluster [ceph: root@f34cluster /]# ceph orch device zap f34cluster device --force Error EINVAL: Device path 'device' not found on host 'f34cluster' [ceph: root@f34cluster /]# ceph orch device zap f34cluster /dev/sdd --force Error EINVAL: Unable to zap: device '/dev/sdd' on f34cluster has 1 active OSD (osd.2). Use 'ceph orch osd rm' first. [ceph: root@f34cluster /]# ceph orch host maintenance enter f34cluster2 Daemons for Ceph cluster b01f4b1c-2d35-11ec-89bb-005056833d58 stopped on host f34cluster2. Host f34cluster2 moved to maintenance mode [ceph: root@f34cluster /]# ceph orch host ls HOST ADDR LABELS STATUS f34cluster 10.70.39.226 _admin f34cluster2 10.70.39.212 Maintenance [ceph: root@f34cluster /]# ceph orch device zap f34cluster2 bogus --force Error EINVAL: Host 'f34cluster2' is in maintenance mode, which prevents any actions against it. [ceph: root@f34cluster /]# ceph orch host maintenance exit f34cluster2 Ceph cluster b01f4b1c-2d35-11ec-89bb-005056833d58 on f34cluster2 has exited maintenance mode [ceph: root@f34cluster /]# ceph orch device zap f34cluster2 bogus --force Error EINVAL: Device path 'bogus' not found on host 'f34cluster2' [ceph: root@f34cluster /]# ceph orch device zap f34cluster /dev/sdd --force Error EINVAL: Unable to zap: device '/dev/sdd' on f34cluster has 1 active OSD (osd.2). Use 'ceph orch osd rm' first. [ceph: root@f34cluster /]# ceph orch osd rm 2 Scheduled OSD(s) for removal [ceph: root@f34cluster /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 2.00000 root default -3 2.00000 host f34cluster 0 hdd 1.00000 osd.0 up 1.00000 1.00000 1 hdd 1.00000 osd.1 up 1.00000 1.00000 [ceph: root@f34cluster /]# ceph orch device zap f34cluster /dev/sdd --force zap successful for /dev/sdd on f34cluster
Updated by Sebastian Wagner over 2 years ago
- Related to Bug #51028: device zap doesn't perform any checks added
Updated by Paul Cuzner over 2 years ago
- Status changed from New to Pending Backport
Updated by Sebastian Wagner over 2 years ago
- Status changed from Pending Backport to Resolved