Bug #54142
closedquincy cephadm-purge-cluster needs work
0%
Description
For the sake of tracking ...
The purge process in quincy is not yet ready for prime time in this early stage. The preflight & purge playbooks were used but ultimately I went through the manual steps I've used previously when the process fails somewhere ...
# Clean all hosts excluding bootstrap cephadm_in_host=$(ls /var/lib/ceph/$fsid/cephadm*) python3 $cephadm_in_host rm-cluster --fsid $fsid --force systemctl stop ceph.target systemctl disable ceph.target rm -f /etc/systemd/system/ceph.target systemctl daemon-reload systemctl reset-failed rm -rf /var/log/ceph/* rm -rf /var/lib/ceph/* # clean bootstrap cephadm_in_host=$(ls /var/lib/ceph/$fsid/cephadm*) python3 $cephadm_in_host rm-cluster --fsid $fsid --force #cephadm rm-cluster --fsid $fsid --force systemctl stop ceph.target systemctl disable ceph.target rm -f /etc/systemd/system/ceph.target systemctl daemon-reload systemctl reset-failed rm -rf /etc/ceph/* rm -rf /var/log/ceph/* rm -rf /var/lib/ceph/* # on OSD nodes declare -a devList=("/dev/nvme0n1" "/dev/nvme1n1" "/dev/sdc" "/dev/sdd" "/dev/sde" "/dev/sdf" "/dev/sdg" "/dev/sdh" "/dev/sdi" "/dev/sdj" "/dev/sdk" "/dev/sdl" "/dev/sdm" "/dev/sdn" "/dev/sdo" "/dev/sdp" "/dev/sdq" "/dev/sdr" "/dev/sds" "/dev/sdt" "/dev/sdu" "/dev/sdv" "/dev/sdw" "/dev/sdx" "/dev/sdy" "/dev/sdz" "/dev/sdaa" "/dev/sdab" "/dev/sdac" "/dev/sdad" "/dev/sdae" "/dev/sdaf" "/dev/sdag" "/dev/sdah" "/dev/sdai" "/dev/sdaj" "/dev/sdak" "/dev/sdal") for device in ${devList[@]}; do echo $device sgdisk --zap-all $device done for fsid in `systemctl list-units ceph*.target |grep target|grep -v services|awk '{print$NF}'` ; do echo $fsid /perf1/tim/tools/svc-clean.sh $fsid done for fsid in `ls /etc/systemd/system/ceph-*.target |cut -c 26- |cut -d. -f1` ; do echo $fsid /perf1/tim/tools/svc-clean.sh $fsid done for i in `lsblk -ro NAME |grep ceph` ; do echo $i dmsetup remove -f $i done
... but that was insufficient. Subsequent Pacific deployments would fail due to remnant pods still running and holding onto ports, etc. Those had to be searched out and stopped. A couple of purge output examples are included FWIW.
Files
Updated by Vikhyat Umrao over 2 years ago
- Category changed from orchestrator to cephadm
Updated by Redouane Kachach Elhichou about 2 years ago
- Related to Bug #54018: Suspicious behavior when deleting a cluster (by running cephadm rm-cluster) added
Updated by Redouane Kachach Elhichou about 2 years ago
- Related to Feature #53815: cephadm rm-cluster should delete log files added
Updated by Redouane Kachach Elhichou about 2 years ago
- Related to Bug #53010: cehpadm rm-cluster does not clean up /var/run/ceph added
Updated by Redouane Kachach Elhichou about 2 years ago
Logs and other issues were fixed as part of the related BUGs. But I'm not sure about the OSDs part.
Updated by Redouane Kachach Elhichou about 2 years ago
- Assignee set to Redouane Kachach Elhichou
Updated by Redouane Kachach Elhichou about 2 years ago
- Status changed from New to In Progress
Updated by Tim Wilkinson almost 2 years ago
I was able to return to quincy deployments and purges using cephadm-17.2.0-0.el8.noarch and have had no problems running the preflight/purge/preflight/bootstrap procedure. There was no need to manually prepare any previously used devices or search & destroy remnant pods.
My only comment would be /var/run/ceph is not wiped and as such older cluster fsid's remain ...
root@f22-h01-000-6048r:~ # ll /var/{run,lib,log}/ceph /var/lib/ceph: total 8.0K drwxr-x--- 3 ceph ceph 50 Apr 18 22:36 . drwxr-xr-x. 37 root root 4.0K May 12 13:09 .. drwx------ 30 ceph ceph 4.0K May 12 13:08 c56d7946-d1f2-11ec-8d0b-000af7995d6c /var/log/ceph: total 4.0M drwxrws--T 3 ceph ceph 69 Apr 18 22:36 . drwxr-xr-x. 11 root root 4.0K May 12 12:54 .. drwxrwx--- 2 ceph ceph 4.0K May 12 13:08 c56d7946-d1f2-11ec-8d0b-000af7995d6c -rw-r--r-- 1 root ceph 4.0M May 12 19:36 cephadm.log /var/run/ceph: total 0 drwxrwx--- 4 root root 80 May 12 13:00 . drwxr-xr-x 38 root root 2.2K May 12 13:08 .. drwxrwx--- 2 ceph ceph 540 May 11 23:00 aa8ec022-ca1c-11ec-a5a0-000af7995d6c # old deployment drwxrwx--- 2 ceph ceph 540 May 12 13:08 c56d7946-d1f2-11ec-8d0b-000af7995d6c root@f22-h01-000-6048r:~ # ll /var/run/ceph/aa8ec022-ca1c-11ec-a5a0-000af7995d6c total 0 drwxrwx--- 2 ceph ceph 540 May 11 23:00 . drwxrwx--- 4 root root 80 May 12 13:00 .. srwxr-xr-x 1 ceph ceph 0 May 11 23:00 ceph-client.rgw.rgws.f22-h01-000-6048r.nbegha.7.94591718512160.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.107.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.114.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.122.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.12.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.130.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.137.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.145.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.153.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.161.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.169.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.177.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.185.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.20.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.28.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.35.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.42.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.49.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.57.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.5.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.66.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.74.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:48 ceph-osd.82.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:48 ceph-osd.90.asok srwxr-xr-x 1 ceph ceph 0 May 2 13:48 ceph-osd.97.asok
Updated by Redouane Kachach Elhichou over 1 year ago
- Status changed from In Progress to Resolved
I'm not able to reproduce these issues with the code on the main branch anymore. Please, feel free to re-open if you think the related BUG is still valid.