Bug #17889
closedceph-disk: ceph-disk@.service races with ceph-osd@.service
0%
Description
I don't think there is a safeguard against the following scenario:
a) /dev/sda has lvm partitions for / and /var/lib
b) / is mounted on /dev/mapper/rootvg-vol_root
c) /dev/sdb has an OSD partition
d) udev fires an event on the OSD partition and the OSD fails because /var/lib is not mounted yet and /var/lib/ceph/osd is not found
e) /var/lib is mounted on /dev/mapper/rootvg-vol_lib
The OSD is not mounted and since no other udev event will be fired for /dev/sdb it stays down. Running partprobe /dev/sdb manually will bring the OSD up.
Updated by Loïc Dachary over 7 years ago
- Subject changed from udev OSD events may race with lvm at boot time to OSD udev / systemd may race with lvm at boot time
Updated by Loïc Dachary over 7 years ago
Creating a similar environment on a bare metal running CentOS 7.2:
sudo sgdisk --zap-all /dev/sdb sudo fdisk -l /dev/sdb sudo pvcreate /dev/sdb sudo vgcreate all /dev/sdb echo y | sudo lvcreate --name ceph --size 100G all sudo mkfs.ext4 /dev/all/ceph echo /dev/all/ceph /var/lib/ceph ext4 defaults 1 1 | sudo tee -a /etc/fstab sudo mkdir /var/lib/ceph sudo mount /var/lib/ceph sudo yum install -y yum-utils && sudo yum-config-manager --add-repo https://dl.fedoraproject.org/pub/epel/7/x86_64/ && sudo yum install --nogpgcheck -y epel-release && sudo rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 && sudo rm /etc/yum.repos.d/dl.fedoraproject.org* cat <<EOF | sudo tee -a /etc/yum.repos.d/ceph-deploy.repo [ceph-noarch] name=Ceph noarch packages baseurl=http://download.ceph.com/rpm-jewel/el7/noarch enabled=1 gpgcheck=1 type=rpm-md gpgkey=https://download.ceph.com/keys/release.asc EOF sudo yum update -y sudo yum install -y ceph-deploy sudo ceph-deploy new $(hostname -s) sudo ceph-deploy install --release=jewel $(hostname -s) sudo ceph-deploy mon create-initial sudo ceph-disk zap /dev/sdc sudo ceph-disk prepare /dev/sdc
Updated by Loïc Dachary over 7 years ago
The /lib/udev/rules.d/95-dm-notify.rules
# These rules are responsible for sending a notification to a process # waiting for completion of udev rules. The process is identified by # a cookie value sent within "change" and "remove" events (the cookie # value is set before by that process for every action requested).
runs after /lib/udev/rules.d/95-ceph-osd.rules. If mounting waits for completion of udev rules, that could explain the race. However, removing /lib/udev/rules.d/95-dm-notify.rules and rebooting works, meaning mounting the file system does not wait on dmsetup udevcomplete.
Updated by Loïc Dachary over 7 years ago
- Related to Bug #17077: Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs added
Updated by Loïc Dachary over 7 years ago
- Related to deleted (Bug #17077: Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs)
Updated by Loïc Dachary over 7 years ago
- Has duplicate Bug #17077: Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs added
Updated by Loïc Dachary over 7 years ago
- Has duplicate Bug #15559: osds do not start after boot added
Updated by Loïc Dachary over 7 years ago
Updated /etc/udev/udev.conf with udev_log="debug" and rebooted for the fourth time (the first three times the OSD went back up as expected).
Updated by Loïc Dachary over 7 years ago
- Status changed from New to In Progress
- Assignee set to Loïc Dachary
- Priority changed from Normal to Urgent
Updated by Loïc Dachary over 7 years ago
- Related to Bug #17813: ceph-disk: udev permission race with dm added
Updated by Loïc Dachary over 7 years ago
- Status changed from In Progress to Fix Under Review
- Backport set to jewel
Updated by Kefu Chai over 7 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Loïc Dachary over 7 years ago
- Copied to Backport #18007: jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service added
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to In Progress
Updated by Loïc Dachary over 7 years ago
- Status changed from In Progress to Pending Backport
Updated by Loïc Dachary over 7 years ago
It turns out this may not be fixed after all : https://github.com/ceph/ceph/pull/12136#issuecomment-263007276
Maybe we need something like https://www.freedesktop.org/software/systemd/man/systemd.mount.html#fstab ?
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to In Progress
Updated by Loïc Dachary over 7 years ago
After setting up the environment as described at http://tracker.ceph.com/issues/17889#note-2 the dependencies are automatically what they should be to ensure /var/lib/ceph is mounted before ceph-disk@.service is run. Namely ceph-disk@dev-sdc1.service after basic.target after sysinit.target after local-fs.target after var-lib-ceph.mount
systemctl --no-pager list-dependencies --after ceph-disk@dev-sdc1.service ceph-disk@dev-sdc1.service ● ├─system-ceph\x2ddisk.slice ● ├─systemd-journald.socket ● └─basic.target ● ├─rhel-import-state.service ● ├─systemd-ask-password-plymouth.path ● ├─paths.target ● │ ├─brandbot.path ● │ ├─systemd-ask-password-console.path ● │ └─systemd-ask-password-wall.path ● ├─slices.target ● │ ├─-.slice ● │ ├─system.slice ● │ └─user.slice ● ├─sockets.target ● │ ├─dbus.socket ● │ ├─rpcbind.socket ● │ ├─sshd.socket ● │ ├─syslog.socket ● │ ├─systemd-initctl.socket ● │ ├─systemd-journald.socket ● │ ├─systemd-shutdownd.socket ● │ ├─systemd-udevd-control.socket ● │ └─systemd-udevd-kernel.socket ● └─sysinit.target ● ├─auditd.service ● ├─dev-hugepages.mount ● ├─dev-mqueue.mount ● ├─emergency.service ● ├─kmod-static-nodes.service ● ├─plymouth-read-write.service ● ├─proc-sys-fs-binfmt_misc.automount ● ├─rhel-autorelabel-mark.service ● ├─rhel-autorelabel.service ● ├─rhel-loadmodules.service ● ├─sys-fs-fuse-connections.mount ● ├─sys-kernel-config.mount ● ├─sys-kernel-debug.mount ● ├─systemd-binfmt.service ● ├─systemd-firstboot.service ● ├─systemd-hwdb-update.service ● ├─systemd-journal-catalog-update.service ● ├─systemd-journald.service ● ├─systemd-machine-id-commit.service ● ├─systemd-modules-load.service ● ├─systemd-random-seed.service ● ├─systemd-readahead-collect.service ● ├─systemd-readahead-replay.service ● ├─systemd-sysctl.service ● ├─systemd-tmpfiles-setup-dev.service ● ├─systemd-tmpfiles-setup.service ● ├─systemd-udev-settle.service ● ├─systemd-udev-trigger.service ● ├─systemd-udevd.service ● ├─systemd-update-done.service ● ├─systemd-update-utmp.service ● ├─systemd-vconsole-setup.service ● ├─cryptsetup.target ● │ └─dmraid-activation.service ● ├─emergency.target ● │ ├─emergency.service ● │ ├─rhel-import-state.service ● │ └─rhel-readonly.service ● ├─local-fs.target ● │ ├─-.mount ● │ ├─dm-event.service ● │ ├─dmraid-activation.service ● │ ├─lvm2-monitor.service ● │ ├─rhel-readonly.service ● │ ├─run-user-1000.mount ● │ ├─run-user-991.mount ● │ ├─systemd-fsck-root.service ● │ ├─systemd-remount-fs.service ● │ ├─tmp.mount ● │ ├─var-lib-ceph-osd-ceph\x2d0.mount ● │ ├─var-lib-ceph.mount ● │ └─local-fs-pre.target ● │ ├─systemd-remount-fs.service ● │ └─systemd-tmpfiles-setup-dev.service ● └─swap.target
Updated by Loïc Dachary over 7 years ago
It looks like var-lib-ceph.mount does not depend on lvm2-monitor.service even indirectly although it should. Not sure if I'm missing something.
systemctl --no-pager list-dependencies --after var-lib-ceph.mount var-lib-ceph.mount ● ├─-.mount ● ├─dev-all-ceph.device ● ├─system.slice ● ├─systemd-fsck@dev-all-ceph.service ● ├─systemd-journald.socket ● └─local-fs-pre.target ● ├─systemd-remount-fs.service ● └─systemd-tmpfiles-setup-dev.service
Updated by Loïc Dachary over 7 years ago
ceph-disk@.service should be Before= ceph-osd@.service but this relationship does not exist. It usually is not a problem because ceph-osd@.service will retry and that will give time for ceph-disk@.service to complete. But, maybe, ceph-osd@.service will give up if ceph-disk@.service is delayed because it has to wait for lvm to be ready and mount local file system. That would explain why this problem is associated to lvm.
Updated by Loïc Dachary over 7 years ago
The logs of osd.32 show 3 failed attempts to start at 13:18:15
2016-10-28 13:18:15.063934 7f2e590f1800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:18:15.063966 7f2e590f1800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 4076 2016-10-28 13:18:15.064168 7f2e590f1800 -1 [0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-32: (2) No such file or directory[0m 2016-10-28 13:18:15.454721 7f3ee7b5a800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:18:15.454777 7f3ee7b5a800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 4491 2016-10-28 13:18:15.454972 7f3ee7b5a800 -1 [0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-32: (2) No such file or directory[0m 2016-10-28 13:18:15.788618 7f9b327e5800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:18:15.788656 7f9b327e5800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 4915 2016-10-28 13:18:15.788810 7f9b327e5800 -1 [0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-32: (2) No such file or directory[0m
which are explained because the activation that mounts /var/lib/ceph/osd/ceph-lc-32 only happens 5 minutes later at 13:32:37
Oct 28 13:32:37 lcpcephosd1n5.cmmint.net sh[7439]: activate: OSD uuid is c3aa9e0f-4fcc-488e-8b5c-1b717984a6b5 Oct 28 13:32:37 lcpcephosd1n5.cmmint.net sh[7439]: activate: OSD id is 32
It only tries three times because /etc/systemd/system/ceph-osd.target.wants/ceph-osd@32.service has
StartLimitBurst=3
but that would not be very different if it was unset because the default is 5 and would quickly be reached.
Updated by Loïc Dachary over 7 years ago
- Status changed from In Progress to Fix Under Review
Working on a fix assuming ceph-disk@.service must be Before= ceph-osd@.service at https://github.com/ceph/ceph/pull/12241
- how to express that ceph-disk@dev-sdb1.service is Before= ceph-osd@32.service ?
- with such a dependency it will block the sysadmin trying to systemctl start ceph-osd@32.service after manually creating the OSD
Updated by Loïc Dachary over 7 years ago
The problem is exactly the opposite of what I thought: since ceph-disk@.service already starts the matching ceph-osd@.service, it must not be enabled at all to prevent the race. The dependency cannot be expressed by systemd but it does not matter because it already is implemented by ceph.
https://github.com/ceph/ceph/pull/12241/commits/589d89715cb9dc605a5a94517965c7f035ad95d8
However, not enabling ceph-osd@.service will introduce a non backward compatible change: systemctl start/stop ceph will no longer start/stop the ceph-osd@.service. When enabling ceph-osd@3.service
[Install] WantedBy=ceph-osd.target
will create a static dependency for systemctl start/stop ceph to use. This is not used at boot time but may be used by sysadmins or scripts.
Updated by Loïc Dachary over 7 years ago
Instead of removing the enable/disable from ceph-disk, it is enough to enable --runtime so that it does not start at boot time
Updated https://github.com/ceph/ceph/pull/12241/files accordingly
Updated by Loïc Dachary over 7 years ago
For the record, the pull requests that need backporting are:
Updated by Loïc Dachary over 7 years ago
- Subject changed from OSD udev / systemd may race with lvm at boot time to build/ops: ceph-disk@.service races with ceph-osd@.service
Updated by Loïc Dachary over 7 years ago
- Subject changed from build/ops: ceph-disk@.service races with ceph-osd@.service to ceph-disk: ceph-disk@.service races with ceph-osd@.service
Updated by Loïc Dachary over 7 years ago
- Status changed from Fix Under Review to Pending Backport
wait two weeks before merging the backport
Updated by Wido den Hollander over 7 years ago
I see that commit b3887379d6dde3b5a44f2e84cf917f4f0a0cb120 changed the systemd service file for ceph-osd with improved values for StartLimitBurst and RestartSec.
Can we expect a backport to Jewel for these changes?
Updated by Vikhyat Umrao over 7 years ago
Wido den Hollander wrote:
I see that commit b3887379d6dde3b5a44f2e84cf917f4f0a0cb120 changed the systemd service file for ceph-osd with improved values for StartLimitBurst and RestartSec.
Can we expect a backport to Jewel for these changes?
Yes, backport is in progress:
Backport tracker: http://tracker.ceph.com/issues/18007
Backport PR: https://github.com/ceph/ceph/pull/12147
Updated by Nathan Cutler over 7 years ago
- Status changed from Pending Backport to Resolved