Bug #53846
closedceph-volume should ignore /dev/rbd* devices
0%
Description
If rbd devices are mapped on ceph cluster nodes (as they may be if you're running an iSCSI gateway for example), then ceph-volume inventory
will list those RBD devices, and quite possibly list them as being "available". This causes a couple of problems:
1) Because /dev/rbd0 appears in the list of available devices, the orchestrator will actually try to deploy OSDs on top of those RBD devices. Luckily, this will fail, because the various LVM invocations will die with "Device /dev/rbd0 excluded by a filter", but really we shouldn't even be trying to do this in the first place. Let's not rely on luck ;-)
2) It's possible for /dev/rbd* devices to be locked/stuck in such a way that when ceph-volume invokes blkid
, it hangs indefinitely (the process ends up in D-state). This can actually block the entire orchestrator, because the orchestrator calls out to cephadm periodically to inventory devices, and the latter tries to acquire a lock, which it can't get because a prior invocation is stuck running ceph-volume inventory
.
I suggest we make ceph-volume completely ignore /dev/rbd* when doing a device inventory. I know we had a similar discussion on dev@ceph.io regarding ceph-volume listing, or not listing, GPT devices (see https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/N3TK4IO2QYHXIZMQTZ4AMPU5BE56J5MP/#T7UM53WCW2MDD62DDH6KLI4EZXKBXZBY) but the difference here is that mapped RBD volumes really aren't part of the host inventory, so IMO should be excluded.
Updated by Michael Fritch over 2 years ago
- Status changed from New to Fix Under Review
- Assignee set to Michael Fritch
- Pull request ID set to 44604
Updated by Michael Fritch over 2 years ago
attempting to open, blkid, etc. on a stale RBD (gone) device, that has not been unmapped, will cause c-v to hang in an uninterruptible "D" state.
Updated by Michael Fritch over 2 years ago
can occur during inventory:
root 114583 0.0 0.3 144488 29188 ? Ss 2021 0:00 \_ /usr/bin/python3.6 /usr/sbin/ceph-volume inventory --format=json --filter-for-batch root 114616 0.0 0.0 11988 1036 ? D 2021 0:00 \_ /usr/sbin/blkid -p /dev/rbd0
and lvm batch:
root 48440 0.0 0.3 144492 29504 ? Ss Jan10 0:00 \_ /usr/bin/python3.6 /usr/sbin/ceph-volume lvm batch --no-auto /dev/rbd0 /dev/vdb /dev/vdc /dev/vdd /dev/vde /dev/vdf --yes --no-systemd root 48477 0.0 0.0 11988 936 ? D Jan10 0:00 \_ /usr/sbin/blkid -p /dev/rbd0
which results in a cephadm instance never releasing it's global lock ... which in turn causes stuck/stale operations from the ceph orchestrator
Updated by Guillaume Abrioux over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Guillaume Abrioux over 2 years ago
- Copied to Backport #53961: octopus: ceph-volume should ignore /dev/rbd* devices added
Updated by Guillaume Abrioux over 2 years ago
- Copied to Backport #53962: pacific: ceph-volume should ignore /dev/rbd* devices added
Updated by Guillaume Abrioux over 2 years ago
- Status changed from Pending Backport to Resolved