Bug #61597
openceph-volume lvm batch fails in a container with environment variable `DM_DISABLE_UDEV=1`
0%
Description
2023-06-06 10:17:42.630239 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --crush-device-class hdd /dev/vdb --db-devices /dev/rookvg0/metadata0 --report 2023-06-06 10:17:44.384011 D | exec: --> passed data devices: 1 physical, 0 LVM 2023-06-06 10:17:44.384100 D | exec: --> relative data size: 1.0 2023-06-06 10:17:44.384110 D | exec: --> passed block_db devices: 0 physical, 1 LVM 2023-06-06 10:17:44.390411 D | exec: Traceback (most recent call last): 2023-06-06 10:17:44.390462 D | exec: File "/usr/sbin/ceph-volume", line 11, in <module> 2023-06-06 10:17:44.390483 D | exec: load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')() 2023-06-06 10:17:44.390496 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__ 2023-06-06 10:17:44.390504 D | exec: self.main(self.argv) 2023-06-06 10:17:44.390512 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc 2023-06-06 10:17:44.390519 D | exec: return f(*a, **kw) 2023-06-06 10:17:44.390526 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main 2023-06-06 10:17:44.390534 D | exec: terminal.dispatch(self.mapper, subcommand_args) 2023-06-06 10:17:44.390542 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch 2023-06-06 10:17:44.390549 D | exec: instance.main() 2023-06-06 10:17:44.390557 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 46, in main 2023-06-06 10:17:44.390565 D | exec: terminal.dispatch(self.mapper, self.argv) 2023-06-06 10:17:44.390573 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch 2023-06-06 10:17:44.390581 D | exec: instance.main() 2023-06-06 10:17:44.390589 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root 2023-06-06 10:17:44.390597 D | exec: return func(*a, **kw) 2023-06-06 10:17:44.390605 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 428, in main 2023-06-06 10:17:44.390613 D | exec: plan = self.get_plan(self.args) 2023-06-06 10:17:44.390622 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 466, in get_plan 2023-06-06 10:17:44.390718 D | exec: args.wal_devices) 2023-06-06 10:17:44.390730 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 498, in get_deployment_layout 2023-06-06 10:17:44.390739 D | exec: fast_type) 2023-06-06 10:17:44.390756 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 535, in fast_allocations 2023-06-06 10:17:44.390772 D | exec: ret.extend(get_lvm_fast_allocs(lvm_devs)) 2023-06-06 10:17:44.390781 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 164, in get_lvm_fast_allocs 2023-06-06 10:17:44.390790 D | exec: disk.Size(b=int(d.lvs[0].lv_size)), 1) for d in lvs if not 2023-06-06 10:17:44.390917 D | exec: File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 165, in <listcomp> 2023-06-06 10:17:44.390930 D | exec: d.used_by_ceph] 2023-06-06 10:17:44.390938 D | exec: IndexError: list index out of range 2023-06-06 10:17:44.539864 C | rookcmd: failed to configure devices: failed to initialize osd: failed ceph-volume report: exit status 1
Updated by Jerry Pu 12 months ago
[background]¶
We had a ceph cluster on a k3s (lightweight kubernetes) cluster. The ceph osds were deployed by a job container per host, and the container had environment variable DM_DISABLE_UDEV=1. The job containers on each host created some Logical Volumes(LVs) to be used as osds' DB devices and then executed ceph-volume tool with lvm mode to deploy osds.
The environment variable DM_DISABLE_UDEV influences a LV's symlink path to real path.
When DM_DISABLE_UDEV=1, a LV's symlink path to real path mapping looks like: /dev/vg/lv --> /dev/mapper/vg-lv
When DM_DISABLE_UDEV=0, a LV's symlink path to real path mapping looks like: /dev/vg/lv --> /dev/dm-x
[The problem]¶
Because the job containers have environment variable DM_DISABLE_UDEV=1, the symlink path to real path mapping of those LVs which will be used as osds' DB devices look like: /dev/vg/lv --> /dev/mapper/vg-lv. And, the problem occurs in the following ceph-volume code snippets when ceph-volume calls get_single_lv(...) to get the LV info. An example LV "/dev/testvg/testlv" is used to illustrate the problem.
# ceph/src/ceph-volume/ceph_volume/util/device.py <---snippets---> 102 def __init__(self, path, with_lsm=False, lvs=None, lsblk_all=None, all_devices_vgs=None): 103 self.path = path # self.path -> "/dev/testvg/testlv" 104 # LVs can have a vg/lv path, while disks will have /dev/sda 105 self.symlink = None 106 # check if we are a symlink 107 if os.path.islink(self.path): 108 self.symlink = self.path # self.symlink -> "/dev/testvg/testlv" 109 real_path = os.path.realpath(self.path) # real_path -> "/dev/mapper/testvg-testlv" 110 # check if we are not a device mapper 111 if "dm-" not in real_path: 112 self.path = real_path # self.path -> "/dev/mapper/testvg-testlv" <---snippets---> 204 if self.path[0] == '/': # self.path -> "/dev/mapper/testvg-testlv" 205 lv = lvm.get_single_lv(filters={'lv_path': self.path}) # the returned value would be "None" 206 else: 207 vgname, lvname = self.path.split('/') 208 lv = lvm.get_single_lv(filters={'lv_name': lvname, 209 'vg_name': vgname}) <---snippets--->
The only valid LV path format to the filter key 'lv_path' is /dev/<vg-name>/<lv-name> but not /dev/mapper/<vg-name>-<lv-name> or /dev/dm-x. Therefore, in the above example, we can get a valid LV info from get_single_lv(...) if the value of self.path is "/dev/testvg/testlv"
Updated by Jerry Pu 12 months ago
A possible fix is not converting a LV's symlink to real path. Check if the give path is a LV first. If the path represents a LV, then do not convert the path to its real path.
See the PR for more details. https://github.com/ceph/ceph/pull/51954
Updated by Guillaume Abrioux 11 months ago
- Status changed from New to Fix Under Review
- Assignee set to Jerry Pu
- Backport set to reef,quincy,pacific