Bug #47502
closedceph-volume lvm batch race condition
0%
Description
when usng ceph-volume to create the osd, some ssd are failing to get the correct rotational state.
after some debugging with @Guillaume Abrioux we are pretty sure it's a race condition.
I added some prints in that function and ran the report a few times.
https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/util/device.py#L295-L302
rotational state of the devices. 9496 sys_api: 0 104 sys_api: 1 9580 disk_api: 0 20 disk_api: 1
after some more digging i see that udevadm monitor gets triggered at every run of ceph-volume.
issue with ceph-ansible:
https://github.com/ceph/ceph-ansible/issues/5727
# ceph-volume --cluster ceph lvm batch --bluestore --yes --dmcrypt /dev/sdf /dev/sdo /dev/sdw /dev/sdd /dev/sdm /dev/sdu /dev/sdb /dev/sdk /dev/sds /dev/sdi /dev/sdq /dev/sdg /dev/sdx /dev/sde /dev/sdn /dev/sdv /dev/sdc /dev/sdl /dev/sdt /dev/sda /dev/sdj /dev/sdr /dev/sdh /dev/sdp --report Total OSDs: 24 Type Path LV Size % of device ---------------------------------------------------------------------------------------------------- [data] /dev/sdf 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdo 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdw 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdd 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdm 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdu 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdb 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdk 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sds 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdi 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdq 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdg 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdx 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sde 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdn 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdv 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdc 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdl 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdt 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sda 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdj 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdr 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdh 3.49 TB 100.0% ---------------------------------------------------------------------------------------------------- [data] /dev/sdp 3.49 TB 100.0%
# ceph-volume --cluster ceph lvm batch --bluestore --yes --dmcrypt /dev/sdf /dev/sdo /dev/sdw /dev/sdd /dev/sdm /dev/sdu /dev/sdb /dev/sdk /dev/sds /dev/sdi /dev/sdq /dev/sdg /dev/sdx /dev/sde /dev/sdn /dev/sdv /dev/sdc /dev/sdl /dev/sdt /dev/sda /dev/sdj /dev/sdr /dev/sdh /dev/sdp --report Total OSDs: 1 Solid State VG: Targets: block.db Total size: 80.30 TB Total LVs: 1 Size per LV: 80.30 TB Devices: /dev/sdo, /dev/sdw, /dev/sdd, /dev/sdm, /dev/sdu, /dev/sdb, /dev/sdk, /dev/sds, /dev/sdi, /dev/sdq, /dev/sdg, /dev/sdx, /dev/sde, /dev/sdn, /dev/sdv, /dev/sdc, /dev/sdl, /dev/sdt, /dev/sda, /dev/sdj, /dev/sdr, /dev/sdh, /dev/sdp Type Path LV Size % of device ---------------------------------------------------------------------------------------------------- [data] /dev/sdf 3.49 TB 100.0% [block.db] vg: vg/lv 80.30 TB 100%
Updated by Jan Fajerski over 3 years ago
Interesting. Can you let us know the ceph version please (ceph version or ceph versions output).
ceph-volume gets this flag from either /sys/block/<device name>/queue/rotational
(preferred) and falls back to what lsblk -o ROTA
reports.
Any chance you could check if the kernel and lsblk report consistently?
Something like for 1 in {1..1000}; do cat /sys/block/sda/queue/rotational >> rot; done; cat rot | sort -u
Updated by Dimitri Savineau over 3 years ago
You can find more information at [1] for Octopus
But this has also been reported with Nautilus on CentOS 7 [2]
And even more on Luminous and RHEL 7 [3] (private)
[1] https://github.com/ceph/ceph-ansible/issues/5727
[2] https://github.com/ceph/ceph-ansible/issues/5412
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1878500
Updated by Marc Gariépy over 3 years ago
# ceph versions { "mon": { "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3 }, "mgr": { "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3 }, "osd": { "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 324 }, "mds": {}, "overall": { "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 330 } }For some reason i couldn't find the root cause when running ceph-volume does trigger a change in udev running udevadm monitor then running ceph-volume lvm batch we see there are changes on every devices:
- a weird behavior I found it that sda and sdb get changed once but all the other devices do get the event twice.(i'm not sure it's relevant here but it's just kind of a weird inconsistency)
UDEV [428293.401463] change /devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:22/end_device-0:0:22/target0:0:22/0:0:22:0/block/sdw (block) ACTION=change DEVPATH=/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:22/end_device-0:0:22/target0:0:22/0:0:22:0/block/sdw SUBSYSTEM=block SYNTH_UUID=0 DEVNAME=/dev/sdw DEVTYPE=disk SEQNUM=67032 USEC_INITIALIZED=5112551 SCSI_TPGS=0 SCSI_TYPE=disk SCSI_VENDOR=ATA SCSI_VENDOR_ENC=ATA\x20\x20\x20\x20\x20 SCSI_MODEL=HFS3T8G32FEH-741 SCSI_MODEL_ENC=HFS3T8G32FEH-741 SCSI_REVISION=DE03 ID_SCSI=1 ID_SCSI_INQUIRY=1 ID_VENDOR=ATA ID_VENDOR_ENC=ATA\x20\x20\x20\x20\x20 ID_MODEL=HFS3T8G32FEH-741 ID_MODEL_ENC=HFS3T8G32FEH-741 ID_REVISION=DE03 ID_TYPE=disk SCSI_IDENT_SERIAL=FJ03N6706I0403I6F SCSI_IDENT_LUN_NAA_REG=5ace42e025100e0f SCSI_IDENT_PORT_NAA_REG=500056b3ba7069d6 SCSI_IDENT_PORT_RELATIVE=0 ID_WWN=0x5ace42e025100e0f ID_WWN_WITH_EXTENSION=0x5ace42e025100e0f ID_BUS=scsi ID_SERIAL=35ace42e025100e0f ID_SERIAL_SHORT=5ace42e025100e0f MPATH_SBIN_PATH=/sbin .SAVED_FM_WAIT_UNTIL= DM_MULTIPATH_DEVICE_PATH=0 ID_PATH=pci-0000:66:00.0-sas-exp0x500056b3ba7069ff-phy22-lun-0 ID_PATH_TAG=pci-0000_66_00_0-sas-exp0x500056b3ba7069ff-phy22-lun-0 .ID_FS_TYPE_NEW= ID_FS_TYPE= MAJOR=65 MINOR=96 DEVLINKS=/dev/disk/by-id/wwn-0x5ace42e025100e0f /dev/disk/by-id/scsi-SATA_HFS3T8G32FEH-741_FJ03N6706I0403I6F /dev/disk/by-path/pci-0000:66:00.0-sas-exp0x500056b3ba7069ff-phy22-lun-0 /dev/disk/by-id/scsi-35ace42e025100e0f TAGS=:systemd:
I did try to reproduce the udev change by running all the command that ceph-volume runs but none of these did trigger the change.
then i ran a inotifywait on the device folder to see what got written in it and it was :
# inotifywait -m /sys/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:20/end_device-0:0:20/target0:0:20/0:0:20:0/block/sdu >> /tmp/test.log ### run ceph-volume command # grep CLOSE_WRITE /tmp/test.log /sys/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:20/end_device-0:0:20/target0:0:20/0:0:20:0/block/sdu/ CLOSE_WRITE,CLOSE uevent /sys/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:20/end_device-0:0:20/target0:0:20/0:0:20:0/block/sdu/ CLOSE_WRITE,CLOSE uevent
i'm avalaible on IRC today if you want to chat.
Updated by Marc Gariépy over 3 years ago
the grep rotational when the udev device do not get changed always seems to return the correct value.
# for i in {1..1000}; do cat /sys/block/sda/queue/rotational >> rot; done; cat rot | sort -u 0
Updated by Marc Gariépy over 3 years ago
Another debug session with @Guillaume Abrioux
@Guillaume Abrioux found that the udev event is triggered by :
- https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/util/disk.py#L795
running a loop of:
import ceph_volume.util.disk
import time
for i in range(1, 1000):
ceph_volume.util.disk.is_locked_raw_device("/dev/sdu")
time.sleep(.02)
and doing the cat of /sys/block/sdu/queue/rotational in loop does get the same inconsistency on the result. (a few 1 in the result) if I patch the function to return 0 on the this line:
i cannot reproduce the issue since udev doesnt get a changed event.
maybe a ugly hack would be to do a simple sleep to let the system settle a bit.
Updated by Jan Fajerski over 3 years ago
- Has duplicate Bug #47553: OSDs being deployed bluestore mixed type for SSD disks added
Updated by Marc Gariépy over 3 years ago
yet another debugging session with @Guillaume Abrioux.
https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L146 << this line is caussing the race condition
https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L270 << this is the help function
https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L266 << this is called by the argparse instance.
which do mostly a double lock on the device to probe if it's locked.
By removing the function and adding a string in the description, it works correctly everytime.
Updated by Jan Fajerski over 3 years ago
- Related to Bug #47585: cache disk.get_devices return value added
Updated by Jan Fajerski over 3 years ago
- Status changed from New to Pending Backport
- Backport set to octopus,nautilus
Updated by Jan Fajerski over 3 years ago
- Copied to Backport #47649: octopus: ceph-volume lvm batch race condition added
Updated by Jan Fajerski over 3 years ago
- Copied to Backport #47650: nautilus: ceph-volume lvm batch race condition added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".