Bug #23645
closedhot plug disk might not work ceph-volume
0%
Description
From the list:
Sata hdds, this happen on running server, without reboot.
Due to hardware problem, vibration, human factor, anything, sata host loose connection with drive and /dev/sda disappears. Than operator unplug/plug it and without lvm it can appear with same node /dev/sda and with different one, it does not matrer - osd will be started.
But in case of lvm /dev/dm-0 holds lvm objects and sda node, so disk got next letter (/dev/sdt for example), but lvm can't create lv with same uid, so lsblk does not see logical volume on this disk
Yes everything will be fixed after reboot, but I don't think it is a solution.
Workaround as done by user:
And I have to perform a list of manual actions to start osd:
- remove device mapper device:
sudo dmsetup remove /dev/dm-8
- disable new block device and rescan scsi to make lvm volume appear:
echo 1 | sudo tee /sys/block/sdb/device/delete
echo "- - -" | sudo tee /sys/class/scsi_host/host0/scan
- maybe umount osd direcroty (I'm not sure if it is required):
sudo umount /var/lib/ceph/osd/ceph-12
- list osd disks to get lv name (osd fsid):
sudo ceph-volume lvm list
- And finally start osd:
sudo ceph-volume lvm trigger 12-92b66a98-1c35-40a8-bf5b-ac123c366166
Files
Updated by Alfredo Deza about 6 years ago
- Status changed from New to Need More Info
This is still not very clear to me. You mention a plug/unplug of disks
that make the device path change, but
then that "lvm can't create lv with same uid". So this is before the
OSD is running? or where exactly in the process is this?
In any case, you could just refresh LVM's cache by running: vgscan
The docs explains this better:
LVM runs the vgscan command automatically at system startup and at other times during LVM operation, such as when you execute a vgcreate command or when LVM detects an inconsistency. You may need to run the vgscan > command manually when you change your hardware configuration, causing new devices to be visible to the system that were not present at system bootup.
This may be necessary, for example, when you add new disks to the > system on a SAN or hotplug a new disk that has been labeled as a physical volume.
If you run that, do you have issues still?
If the problem goes away with vgscan, we could just add it to the unit
that activates/starts the OSD. But without some confirmation on your end, we can't really try to fix this
Updated by Alfredo Deza almost 6 years ago
- Status changed from Need More Info to Can't reproduce
Updated by Aleksei Gutikov almost 6 years ago
Please, find in attach output of some useful commands (lsblk, ceph-volume, vgscan, etc...)
What happened (in my understanding):
1) OSD was running on /dev/dm-2 on /dev/sdk with serial WD-WMC6M0D3AZMS
1.1) OSD crashed after io error.
2) /dev/sdk disappeared and this disk appeared as /dev/sdl (see output of sudo ls /sys/block/dm-2/slaves/ -lah)
2.1) logical volume /dev/ceph-fda7bc3b-0047-45c8-8f16-5a8764664a9f/osd-block-52032eb1-698d-409c-94ee-385c76825638
created on /dev/sdl
2.2) /dev/dm-2 now uses /dev/sdl
2.3) OSD was not starting during timeout (ceph osd metadata not updated)
3) /dev/sdl disappeared and this disk appeared as /dev/sdj
3.1) logical volume was not created, /dev/dm-2 still holds /dev/sdl
3.2) lvm does not see logical volume on /dev/sdj
3.3) OSD tries use /dev/dm-2 but get IO error (see dmesg and OSD log)
4) Can be fixed by reboot or with sequence of command I've listed in first email (also see in attachment)
Updated by Alfredo Deza almost 6 years ago
- Status changed from Can't reproduce to New
Updated by Alfredo Deza almost 6 years ago
Thanks Aleksei for providing extra information. Would it be possible for you to try another thing here?
sudo lvchange --refresh vg/lv
According to the LVM docs, it should help update the device mapper to point to the right device.
Updated by Aleksei Gutikov almost 6 years ago
Sure, I'll reproduce it again and try this command.
Updated by Alfredo Deza almost 6 years ago
- Status changed from New to Need More Info
Aleksei, any luck?
Updated by Alfredo Deza almost 6 years ago
- Status changed from Need More Info to Closed