Project

General

Profile

Actions

Bug #47502

closed

ceph-volume lvm batch race condition

Added by Marc Gariépy over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when usng ceph-volume to create the osd, some ssd are failing to get the correct rotational state.

after some debugging with @Guillaume Abrioux we are pretty sure it's a race condition.

I added some prints in that function and ran the report a few times.
https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/util/device.py#L295-L302

 
rotational state of the devices.

   9496 sys_api: 0
    104 sys_api: 1
   9580 disk_api: 0
     20 disk_api: 1

after some more digging i see that udevadm monitor gets triggered at every run of ceph-volume.

issue with ceph-ansible:
https://github.com/ceph/ceph-ansible/issues/5727

# ceph-volume --cluster ceph lvm batch --bluestore --yes --dmcrypt /dev/sdf /dev/sdo /dev/sdw /dev/sdd /dev/sdm /dev/sdu /dev/sdb /dev/sdk /dev/sds /dev/sdi /dev/sdq /dev/sdg /dev/sdx /dev/sde /dev/sdn /dev/sdv /dev/sdc /dev/sdl /dev/sdt /dev/sda /dev/sdj /dev/sdr /dev/sdh /dev/sdp --report

Total OSDs: 24

  Type            Path                                                    LV Size         % of device
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdf                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdo                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdw                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdd                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdm                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdu                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdb                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdk                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sds                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdi                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdq                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdg                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdx                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sde                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdn                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdv                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdc                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdl                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdt                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sda                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdj                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdr                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdh                                                3.49 TB         100.0%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdp                                                3.49 TB         100.0%

# ceph-volume --cluster ceph lvm batch --bluestore --yes --dmcrypt /dev/sdf /dev/sdo /dev/sdw /dev/sdd /dev/sdm /dev/sdu /dev/sdb /dev/sdk /dev/sds /dev/sdi /dev/sdq /dev/sdg /dev/sdx /dev/sde /dev/sdn /dev/sdv /dev/sdc /dev/sdl /dev/sdt /dev/sda /dev/sdj /dev/sdr /dev/sdh /dev/sdp --report

Total OSDs: 1

Solid State VG:
  Targets:   block.db                  Total size: 80.30 TB                 
  Total LVs: 1                         Size per LV: 80.30 TB                 
  Devices:   /dev/sdo, /dev/sdw, /dev/sdd, /dev/sdm, /dev/sdu, /dev/sdb, /dev/sdk, /dev/sds, /dev/sdi, /dev/sdq, /dev/sdg, /dev/sdx, /dev/sde, /dev/sdn, /dev/sdv, /dev/sdc, /dev/sdl, /dev/sdt, /dev/sda, /dev/sdj, /dev/sdr, /dev/sdh, /dev/sdp

  Type            Path                                                    LV Size         % of device
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdf                                                3.49 TB         100.0%
  [block.db]      vg: vg/lv                                               80.30 TB        100%

Related issues 4 (1 open3 closed)

Related to ceph-volume - Bug #47585: cache disk.get_devices return valueNew

Actions
Has duplicate ceph-volume - Bug #47553: OSDs being deployed bluestore mixed type for SSD disksDuplicate

Actions
Copied to ceph-volume - Backport #47649: octopus: ceph-volume lvm batch race conditionResolvedJan FajerskiActions
Copied to ceph-volume - Backport #47650: nautilus: ceph-volume lvm batch race conditionResolvedJan FajerskiActions
Actions #1

Updated by Jan Fajerski over 3 years ago

Interesting. Can you let us know the ceph version please (ceph version or ceph versions output).

ceph-volume gets this flag from either /sys/block/<device name>/queue/rotational (preferred) and falls back to what lsblk -o ROTA reports.

Any chance you could check if the kernel and lsblk report consistently?
Something like for 1 in {1..1000}; do cat /sys/block/sda/queue/rotational >> rot; done; cat rot | sort -u

Actions #2

Updated by Dimitri Savineau over 3 years ago

You can find more information at [1] for Octopus

But this has also been reported with Nautilus on CentOS 7 [2]

And even more on Luminous and RHEL 7 [3] (private)

[1] https://github.com/ceph/ceph-ansible/issues/5727
[2] https://github.com/ceph/ceph-ansible/issues/5412
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1878500

Actions #3

Updated by Marc Gariépy over 3 years ago

# ceph versions
{
    "mon": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
    },
    "osd": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 324
    },
    "mds": {},
    "overall": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 330
    }
}
For some reason i couldn't find the root cause when running ceph-volume does trigger a change in udev running udevadm monitor then running ceph-volume lvm batch we see there are changes on every devices:
  • a weird behavior I found it that sda and sdb get changed once but all the other devices do get the event twice.(i'm not sure it's relevant here but it's just kind of a weird inconsistency)
UDEV  [428293.401463] change   /devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:22/end_device-0:0:22/target0:0:22/0:0:22:0/block/sdw (block)
ACTION=change
DEVPATH=/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:22/end_device-0:0:22/target0:0:22/0:0:22:0/block/sdw
SUBSYSTEM=block
SYNTH_UUID=0
DEVNAME=/dev/sdw
DEVTYPE=disk
SEQNUM=67032
USEC_INITIALIZED=5112551
SCSI_TPGS=0
SCSI_TYPE=disk
SCSI_VENDOR=ATA
SCSI_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
SCSI_MODEL=HFS3T8G32FEH-741
SCSI_MODEL_ENC=HFS3T8G32FEH-741
SCSI_REVISION=DE03
ID_SCSI=1
ID_SCSI_INQUIRY=1
ID_VENDOR=ATA
ID_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
ID_MODEL=HFS3T8G32FEH-741
ID_MODEL_ENC=HFS3T8G32FEH-741
ID_REVISION=DE03
ID_TYPE=disk
SCSI_IDENT_SERIAL=FJ03N6706I0403I6F
SCSI_IDENT_LUN_NAA_REG=5ace42e025100e0f
SCSI_IDENT_PORT_NAA_REG=500056b3ba7069d6
SCSI_IDENT_PORT_RELATIVE=0
ID_WWN=0x5ace42e025100e0f
ID_WWN_WITH_EXTENSION=0x5ace42e025100e0f
ID_BUS=scsi
ID_SERIAL=35ace42e025100e0f
ID_SERIAL_SHORT=5ace42e025100e0f
MPATH_SBIN_PATH=/sbin
.SAVED_FM_WAIT_UNTIL=
DM_MULTIPATH_DEVICE_PATH=0
ID_PATH=pci-0000:66:00.0-sas-exp0x500056b3ba7069ff-phy22-lun-0
ID_PATH_TAG=pci-0000_66_00_0-sas-exp0x500056b3ba7069ff-phy22-lun-0
.ID_FS_TYPE_NEW=
ID_FS_TYPE=
MAJOR=65
MINOR=96
DEVLINKS=/dev/disk/by-id/wwn-0x5ace42e025100e0f /dev/disk/by-id/scsi-SATA_HFS3T8G32FEH-741_FJ03N6706I0403I6F /dev/disk/by-path/pci-0000:66:00.0-sas-exp0x500056b3ba7069ff-phy22-lun-0 /dev/disk/by-id/scsi-35ace42e025100e0f
TAGS=:systemd:

I did try to reproduce the udev change by running all the command that ceph-volume runs but none of these did trigger the change.

then i ran a inotifywait on the device folder to see what got written in it and it was :

# inotifywait -m /sys/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:20/end_device-0:0:20/target0:0:20/0:0:20:0/block/sdu >> /tmp/test.log
### run ceph-volume command
# grep CLOSE_WRITE /tmp/test.log
/sys/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:20/end_device-0:0:20/target0:0:20/0:0:20:0/block/sdu/ CLOSE_WRITE,CLOSE uevent
/sys/devices/pci0000:64/0000:64:02.0/0000:66:00.0/host0/port-0:0/expander-0:0/port-0:0:20/end_device-0:0:20/target0:0:20/0:0:20:0/block/sdu/ CLOSE_WRITE,CLOSE uevent

i'm avalaible on IRC today if you want to chat.

Actions #4

Updated by Marc Gariépy over 3 years ago

the grep rotational when the udev device do not get changed always seems to return the correct value.

# for i in {1..1000}; do cat /sys/block/sda/queue/rotational >> rot; done; cat rot | sort -u
0
Actions #5

Updated by Marc Gariépy over 3 years ago

Another debug session with @Guillaume Abrioux

@Guillaume Abrioux found that the udev event is triggered by :
- https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/util/disk.py#L795

running a loop of:

import ceph_volume.util.disk
import time

for i in range(1, 1000):
    ceph_volume.util.disk.is_locked_raw_device("/dev/sdu")
    time.sleep(.02)

and doing the cat of /sys/block/sdu/queue/rotational in loop does get the same inconsistency on the result. (a few 1 in the result)

if I patch the function to return 0 on the this line:

i cannot reproduce the issue since udev doesnt get a changed event.

maybe a ugly hack would be to do a simple sleep to let the system settle a bit.

Actions #6

Updated by Jan Fajerski over 3 years ago

  • Has duplicate Bug #47553: OSDs being deployed bluestore mixed type for SSD disks added
Actions #7

Updated by Marc Gariépy over 3 years ago

yet another debugging session with @Guillaume Abrioux.

https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L146 << this line is caussing the race condition

https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L270 << this is the help function

https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L266 << this is called by the argparse instance.

which do mostly a double lock on the device to probe if it's locked.

By removing the function and adding a string in the description, it works correctly everytime.

Actions #8

Updated by Jan Fajerski over 3 years ago

  • Related to Bug #47585: cache disk.get_devices return value added
Actions #9

Updated by Jan Fajerski over 3 years ago

  • Status changed from New to Pending Backport
  • Backport set to octopus,nautilus
Actions #10

Updated by Jan Fajerski over 3 years ago

  • Pull request ID set to 37311
Actions #11

Updated by Jan Fajerski over 3 years ago

  • Copied to Backport #47649: octopus: ceph-volume lvm batch race condition added
Actions #12

Updated by Jan Fajerski over 3 years ago

  • Copied to Backport #47650: nautilus: ceph-volume lvm batch race condition added
Actions #13

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF