Project

General

Profile

Actions

Bug #61597

open

ceph-volume lvm batch fails in a container with environment variable `DM_DISABLE_UDEV=1`

Added by Jerry Pu 11 months ago. Updated 8 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2023-06-06 10:17:42.630239 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --crush-device-class hdd /dev/vdb --db-devices /dev/rookvg0/metadata0 --report
2023-06-06 10:17:44.384011 D | exec: --> passed data devices: 1 physical, 0 LVM
2023-06-06 10:17:44.384100 D | exec: --> relative data size: 1.0
2023-06-06 10:17:44.384110 D | exec: --> passed block_db devices: 0 physical, 1 LVM
2023-06-06 10:17:44.390411 D | exec: Traceback (most recent call last):
2023-06-06 10:17:44.390462 D | exec:   File "/usr/sbin/ceph-volume", line 11, in <module>
2023-06-06 10:17:44.390483 D | exec:     load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2023-06-06 10:17:44.390496 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
2023-06-06 10:17:44.390504 D | exec:     self.main(self.argv)
2023-06-06 10:17:44.390512 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
2023-06-06 10:17:44.390519 D | exec:     return f(*a, **kw)
2023-06-06 10:17:44.390526 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
2023-06-06 10:17:44.390534 D | exec:     terminal.dispatch(self.mapper, subcommand_args)
2023-06-06 10:17:44.390542 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2023-06-06 10:17:44.390549 D | exec:     instance.main()
2023-06-06 10:17:44.390557 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 46, in main
2023-06-06 10:17:44.390565 D | exec:     terminal.dispatch(self.mapper, self.argv)
2023-06-06 10:17:44.390573 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2023-06-06 10:17:44.390581 D | exec:     instance.main()
2023-06-06 10:17:44.390589 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
2023-06-06 10:17:44.390597 D | exec:     return func(*a, **kw)
2023-06-06 10:17:44.390605 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 428, in main
2023-06-06 10:17:44.390613 D | exec:     plan = self.get_plan(self.args)
2023-06-06 10:17:44.390622 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 466, in get_plan
2023-06-06 10:17:44.390718 D | exec:     args.wal_devices)
2023-06-06 10:17:44.390730 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 498, in get_deployment_layout
2023-06-06 10:17:44.390739 D | exec:     fast_type)
2023-06-06 10:17:44.390756 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 535, in fast_allocations
2023-06-06 10:17:44.390772 D | exec:     ret.extend(get_lvm_fast_allocs(lvm_devs))
2023-06-06 10:17:44.390781 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 164, in get_lvm_fast_allocs
2023-06-06 10:17:44.390790 D | exec:     disk.Size(b=int(d.lvs[0].lv_size)), 1) for d in lvs if not
2023-06-06 10:17:44.390917 D | exec:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 165, in <listcomp>
2023-06-06 10:17:44.390930 D | exec:     d.used_by_ceph]
2023-06-06 10:17:44.390938 D | exec: IndexError: list index out of range
2023-06-06 10:17:44.539864 C | rookcmd: failed to configure devices: failed to initialize osd: failed ceph-volume report: exit status 1
Actions #1

Updated by Jerry Pu 11 months ago

I am working to resolve the issue and will send a PR in a few days.

Actions #2

Updated by Jerry Pu 11 months ago

[background]

We had a ceph cluster on a k3s (lightweight kubernetes) cluster. The ceph osds were deployed by a job container per host, and the container had environment variable DM_DISABLE_UDEV=1. The job containers on each host created some Logical Volumes(LVs) to be used as osds' DB devices and then executed ceph-volume tool with lvm mode to deploy osds.

The environment variable DM_DISABLE_UDEV influences a LV's symlink path to real path.
When DM_DISABLE_UDEV=1, a LV's symlink path to real path mapping looks like: /dev/vg/lv --> /dev/mapper/vg-lv
When DM_DISABLE_UDEV=0, a LV's symlink path to real path mapping looks like: /dev/vg/lv --> /dev/dm-x

[The problem]

Because the job containers have environment variable DM_DISABLE_UDEV=1, the symlink path to real path mapping of those LVs which will be used as osds' DB devices look like: /dev/vg/lv --> /dev/mapper/vg-lv. And, the problem occurs in the following ceph-volume code snippets when ceph-volume calls get_single_lv(...) to get the LV info. An example LV "/dev/testvg/testlv" is used to illustrate the problem.

# ceph/src/ceph-volume/ceph_volume/util/device.py

<---snippets--->
102     def __init__(self, path, with_lsm=False, lvs=None, lsblk_all=None, all_devices_vgs=None):
103         self.path = path                                                 # self.path -> "/dev/testvg/testlv" 
104         # LVs can have a vg/lv path, while disks will have /dev/sda
105         self.symlink = None
106         # check if we are a symlink
107         if os.path.islink(self.path):
108             self.symlink = self.path                                     # self.symlink -> "/dev/testvg/testlv" 
109             real_path = os.path.realpath(self.path)                      # real_path -> "/dev/mapper/testvg-testlv" 
110             # check if we are not a device mapper
111             if "dm-" not in real_path:
112                 self.path = real_path                                    # self.path -> "/dev/mapper/testvg-testlv" 
<---snippets--->
204             if self.path[0] == '/':                                      # self.path -> "/dev/mapper/testvg-testlv" 
205                 lv = lvm.get_single_lv(filters={'lv_path': self.path})   # the returned value would be "None" 
206             else:
207                 vgname, lvname = self.path.split('/')
208                 lv = lvm.get_single_lv(filters={'lv_name': lvname,
209                                                 'vg_name': vgname})
<---snippets--->

The only valid LV path format to the filter key 'lv_path' is /dev/<vg-name>/<lv-name> but not /dev/mapper/<vg-name>-<lv-name> or /dev/dm-x. Therefore, in the above example, we can get a valid LV info from get_single_lv(...) if the value of self.path is "/dev/testvg/testlv"

Actions #3

Updated by Jerry Pu 11 months ago

A possible fix is not converting a LV's symlink to real path. Check if the give path is a LV first. If the path represents a LV, then do not convert the path to its real path.

See the PR for more details. https://github.com/ceph/ceph/pull/51954

Actions #4

Updated by Guillaume Abrioux 11 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Jerry Pu
  • Backport set to reef,quincy,pacific
Actions #5

Updated by Guillaume Abrioux 11 months ago

  • Pull request ID set to 51954
Actions #6

Updated by Jerry Pu 8 months ago

I've fixed this bug and updated the PR (51954).

Actions

Also available in: Atom PDF