Bug #54142
closed
quincy cephadm-purge-cluster needs work
Added by Tim Wilkinson over 2 years ago.
Updated over 1 year ago.
Description
For the sake of tracking ...
The purge process in quincy is not yet ready for prime time in this early stage. The preflight & purge playbooks were used but ultimately I went through the manual steps I've used previously when the process fails somewhere ...
# Clean all hosts excluding bootstrap
cephadm_in_host=$(ls /var/lib/ceph/$fsid/cephadm*)
python3 $cephadm_in_host rm-cluster --fsid $fsid --force
systemctl stop ceph.target
systemctl disable ceph.target
rm -f /etc/systemd/system/ceph.target
systemctl daemon-reload
systemctl reset-failed
rm -rf /var/log/ceph/*
rm -rf /var/lib/ceph/*
# clean bootstrap
cephadm_in_host=$(ls /var/lib/ceph/$fsid/cephadm*)
python3 $cephadm_in_host rm-cluster --fsid $fsid --force
#cephadm rm-cluster --fsid $fsid --force
systemctl stop ceph.target
systemctl disable ceph.target
rm -f /etc/systemd/system/ceph.target
systemctl daemon-reload
systemctl reset-failed
rm -rf /etc/ceph/*
rm -rf /var/log/ceph/*
rm -rf /var/lib/ceph/*
# on OSD nodes
declare -a devList=("/dev/nvme0n1" "/dev/nvme1n1" "/dev/sdc" "/dev/sdd" "/dev/sde" "/dev/sdf" "/dev/sdg" "/dev/sdh" "/dev/sdi" "/dev/sdj" "/dev/sdk" "/dev/sdl" "/dev/sdm" "/dev/sdn" "/dev/sdo" "/dev/sdp" "/dev/sdq" "/dev/sdr" "/dev/sds" "/dev/sdt" "/dev/sdu" "/dev/sdv" "/dev/sdw" "/dev/sdx" "/dev/sdy" "/dev/sdz" "/dev/sdaa" "/dev/sdab" "/dev/sdac" "/dev/sdad" "/dev/sdae" "/dev/sdaf" "/dev/sdag" "/dev/sdah" "/dev/sdai" "/dev/sdaj" "/dev/sdak" "/dev/sdal")
for device in ${devList[@]}; do
echo $device
sgdisk --zap-all $device
done
for fsid in `systemctl list-units ceph*.target |grep target|grep -v services|awk '{print$NF}'` ; do
echo $fsid
/perf1/tim/tools/svc-clean.sh $fsid
done
for fsid in `ls /etc/systemd/system/ceph-*.target |cut -c 26- |cut -d. -f1` ; do
echo $fsid
/perf1/tim/tools/svc-clean.sh $fsid
done
for i in `lsblk -ro NAME |grep ceph` ; do
echo $i
dmsetup remove -f $i
done
... but that was insufficient. Subsequent Pacific deployments would fail due to remnant pods still running and holding onto ports, etc. Those had to be searched out and stopped. A couple of purge output examples are included FWIW.
Files
- Category changed from orchestrator to cephadm
- Related to Bug #54018: Suspicious behavior when deleting a cluster (by running cephadm rm-cluster) added
- Related to Feature #53815: cephadm rm-cluster should delete log files added
- Related to Bug #53010: cehpadm rm-cluster does not clean up /var/run/ceph added
Logs and other issues were fixed as part of the related BUGs. But I'm not sure about the OSDs part.
- Assignee set to Redouane Kachach Elhichou
- Status changed from New to In Progress
I was able to return to quincy deployments and purges using cephadm-17.2.0-0.el8.noarch and have had no problems running the preflight/purge/preflight/bootstrap procedure. There was no need to manually prepare any previously used devices or search & destroy remnant pods.
My only comment would be /var/run/ceph is not wiped and as such older cluster fsid's remain ...
root@f22-h01-000-6048r:~
# ll /var/{run,lib,log}/ceph
/var/lib/ceph:
total 8.0K
drwxr-x--- 3 ceph ceph 50 Apr 18 22:36 .
drwxr-xr-x. 37 root root 4.0K May 12 13:09 ..
drwx------ 30 ceph ceph 4.0K May 12 13:08 c56d7946-d1f2-11ec-8d0b-000af7995d6c
/var/log/ceph:
total 4.0M
drwxrws--T 3 ceph ceph 69 Apr 18 22:36 .
drwxr-xr-x. 11 root root 4.0K May 12 12:54 ..
drwxrwx--- 2 ceph ceph 4.0K May 12 13:08 c56d7946-d1f2-11ec-8d0b-000af7995d6c
-rw-r--r-- 1 root ceph 4.0M May 12 19:36 cephadm.log
/var/run/ceph:
total 0
drwxrwx--- 4 root root 80 May 12 13:00 .
drwxr-xr-x 38 root root 2.2K May 12 13:08 ..
drwxrwx--- 2 ceph ceph 540 May 11 23:00 aa8ec022-ca1c-11ec-a5a0-000af7995d6c # old deployment
drwxrwx--- 2 ceph ceph 540 May 12 13:08 c56d7946-d1f2-11ec-8d0b-000af7995d6c
root@f22-h01-000-6048r:~
# ll /var/run/ceph/aa8ec022-ca1c-11ec-a5a0-000af7995d6c
total 0
drwxrwx--- 2 ceph ceph 540 May 11 23:00 .
drwxrwx--- 4 root root 80 May 12 13:00 ..
srwxr-xr-x 1 ceph ceph 0 May 11 23:00 ceph-client.rgw.rgws.f22-h01-000-6048r.nbegha.7.94591718512160.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.107.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.114.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.122.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.12.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.130.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.137.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:45 ceph-osd.145.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.153.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.161.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.169.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.177.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.185.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.20.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:46 ceph-osd.28.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.35.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.42.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.49.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.57.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.5.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.66.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:47 ceph-osd.74.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:48 ceph-osd.82.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:48 ceph-osd.90.asok
srwxr-xr-x 1 ceph ceph 0 May 2 13:48 ceph-osd.97.asok
- Status changed from In Progress to Resolved
I'm not able to reproduce these issues with the code on the main branch anymore. Please, feel free to re-open if you think the related BUG is still valid.
Also available in: Atom
PDF