Project

General

Profile

Actions

Bug #52919

closed

ceph orch device zap validation can result in osd issues and problematic error messages

Added by Paul Cuzner over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
cephadm
Target version:
% Done:

100%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Any fat-fingered moment with hostname or device path can cause zap to do things it shouldn't.

For example;
1. bogus host name
[ceph: root@f34cluster /]# ceph orch device zap orac disk --force
Error EINVAL: host address is empty

---> clearly the hostname was not empty - the msg should be that the host is not a member of the cluster

2. host in maintenance .. not checked for. If the host is in maintenance we should not attempt any actions against it.

3. valid host, but a bogus device
[ceph: root@f34cluster /]# ceph orch device zap f34cluster disk --force
Error EINVAL: Zap failed: ceph-volume lvm list d
i
s
k

---> not a very clear error message for the admin

4. osd stopped, but still valid...this fails the "is-active" check, allowing the zap to proceed against a valid osd that just happened to be down at that point. The result looks like this (acting against osd.3)
[ceph: root@f34cluster /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 4.00000 root default
-3 4.00000 host f34cluster
0 hdd 1.00000 osd.0 up 1.00000 1.00000
1 hdd 1.00000 osd.1 up 1.00000 1.00000
2 hdd 1.00000 osd.2 up 1.00000 1.00000
3 hdd 1.00000 osd.3 down 1.00000 1.00000

---> osd.3 was zapped, but to ceph it still exists and obviously the host still has it as an entry in systemd. the orch osd rm should have been run first


Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #51028: device zap doesn't perform any checksClosedPaul Cuzner

Actions
Actions #1

Updated by Paul Cuzner over 2 years ago

  • % Done changed from 0 to 70
  • Pull request ID set to 43560

The PR is currently in draft for comment.

The patch provides the following checks and interactions with the admin

[ceph: root@f34cluster /]# ceph orch device zap bogus device --force 
Error EINVAL: Host 'bogus' is not a member of the cluster

[ceph: root@f34cluster /]# ceph orch device zap f34cluster device --force 
Error EINVAL: Device path 'device' not found on host 'f34cluster'

[ceph: root@f34cluster /]# ceph orch device zap f34cluster /dev/sdd --force 
Error EINVAL: Unable to zap: device '/dev/sdd' on f34cluster has 1 active OSD (osd.2). Use 'ceph orch osd rm' first.

[ceph: root@f34cluster /]# ceph orch host maintenance enter f34cluster2
Daemons for Ceph cluster b01f4b1c-2d35-11ec-89bb-005056833d58 stopped on host f34cluster2. Host f34cluster2 moved to maintenance mode
[ceph: root@f34cluster /]# ceph orch host ls 
HOST         ADDR          LABELS  STATUS       
f34cluster   10.70.39.226  _admin               
f34cluster2  10.70.39.212          Maintenance  
[ceph: root@f34cluster /]# ceph orch device zap f34cluster2 bogus --force
Error EINVAL: Host 'f34cluster2' is in maintenance mode, which prevents any actions against it.

[ceph: root@f34cluster /]# ceph orch host maintenance exit f34cluster2
Ceph cluster b01f4b1c-2d35-11ec-89bb-005056833d58 on f34cluster2 has exited maintenance mode

[ceph: root@f34cluster /]# ceph orch device zap f34cluster2 bogus --force
Error EINVAL: Device path 'bogus' not found on host 'f34cluster2'

[ceph: root@f34cluster /]# ceph orch device zap f34cluster /dev/sdd --force
Error EINVAL: Unable to zap: device '/dev/sdd' on f34cluster has 1 active OSD (osd.2). Use 'ceph orch osd rm' first.
[ceph: root@f34cluster /]# ceph orch osd rm 2
Scheduled OSD(s) for removal
[ceph: root@f34cluster /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME            STATUS  REWEIGHT  PRI-AFF
-1         2.00000  root default                                  
-3         2.00000      host f34cluster                           
 0    hdd  1.00000          osd.0            up   1.00000  1.00000
 1    hdd  1.00000          osd.1            up   1.00000  1.00000
[ceph: root@f34cluster /]# ceph orch device zap f34cluster /dev/sdd --force
zap successful for /dev/sdd on f34cluster

Actions #2

Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #51028: device zap doesn't perform any checks added
Actions #3

Updated by Paul Cuzner over 2 years ago

  • Status changed from New to Pending Backport
Actions #4

Updated by Paul Cuzner over 2 years ago

  • % Done changed from 70 to 100

PR merged Oct 26

Actions #5

Updated by Sebastian Wagner over 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF