Bug #43701
opensystemd units for mon, mgr and mds fail to start with ms_type=async+rdma
0%
Description
At some point in time, ceph systemd units ceph-mon@.service, ceph-mgr@.service and ceph-mds@.service seem to have adopted 'PrivateDevices=yes' in [Service] stanza. This will cause the services to fail with the following error if cluster communication is changed with ms_type=async+rdma to RDMA:
DeviceList failed to get rdma device list. (19) No such device /build/ceph-14.2.6/src/msg/async/rdma/Infiniband.h: In function 'DeviceList::DeviceList(CephContext*)' /build/ceph-14.2.6/src/msg/async/rdma/Infiniband.h: 106: ceph_abort_msg("abort() called")
Stracing the mon process with a modified systemd unit reveals the following:
2885913 stat("/sys/class/infiniband_verbs/abi_version", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 2885913 stat("/sys/class/infiniband_verbs/uverbs0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 2885913 openat(AT_FDCWD, "/sys/class/infiniband_verbs/uverbs0/ibdev", O_RDONLY|O_CLOEXEC) = 24 2885913 read(24, "mlx4_0\n", 64) = 7 2885913 close(24) = 0 2885913 stat("/sys/class/infiniband/mlx4_0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 2885913 stat("/dev/infiniband/uverbs0", 0x7faefcd479a0) = -1 ENOENT (No such file or directory)
The service crashes right after this ENOENT. However, /dev/infiniband/uverbs0 exists. From PrivateDevices description (https://www.freedesktop.org/software/systemd/man/systemd.exec.html#PrivateDevices=):
"If true, sets up a new /dev mount for the executed processes and only adds API pseudo devices such as /dev/null, /dev/zero or /dev/random (as well as the pseudo TTY subsystem) to it, but no physical devices"
It appears that for async+rdma messaging to work, 'PrivateDevices=yes' has to be removed from these systemd units or some other workaround has to be devised. In the meantime, a simple workaround is to add a systemd override for each of the affected services (with 'systemctl edit ceph-[mon|mgr|mds]@.service'):
[Service] PrivateDevices=false
Please note that simply stracing the affected binary will not reveal the bug as without systemd's "protection" the service will start without issue.