Project

General

Profile

Bug #17889

ceph-disk: ceph-disk@.service races with ceph-osd@.service

Added by Loic Dachary 7 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
11/14/2016
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

I don't think there is a safeguard against the following scenario:

a) /dev/sda has lvm partitions for / and /var/lib
b) / is mounted on /dev/mapper/rootvg-vol_root
c) /dev/sdb has an OSD partition
d) udev fires an event on the OSD partition and the OSD fails because /var/lib is not mounted yet and /var/lib/ceph/osd is not found
e) /var/lib is mounted on /dev/mapper/rootvg-vol_lib

The OSD is not mounted and since no other udev event will be fired for /dev/sdb it stays down. Running partprobe /dev/sdb manually will bring the OSD up.


Related issues

Related to Ceph - Bug #17813: ceph-disk: udev permission race with dm Resolved 11/07/2016
Duplicated by Ceph - Bug #17077: Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs Duplicate 08/19/2016
Duplicated by Ceph - Bug #15559: osds do not start after boot Duplicate 04/21/2016
Copied to Ceph - Backport #18007: jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service Resolved

History

#1 Updated by Loic Dachary 7 months ago

  • Subject changed from udev OSD events may race with lvm at boot time to OSD udev / systemd may race with lvm at boot time

#2 Updated by Loic Dachary 7 months ago

Creating a similar environment on a bare metal running CentOS 7.2:

sudo sgdisk --zap-all /dev/sdb
sudo fdisk -l /dev/sdb
sudo pvcreate /dev/sdb
sudo vgcreate all /dev/sdb
echo y | sudo lvcreate --name ceph --size 100G all 
sudo mkfs.ext4 /dev/all/ceph
echo /dev/all/ceph /var/lib/ceph ext4 defaults 1 1 | sudo tee -a /etc/fstab
sudo mkdir /var/lib/ceph
sudo mount /var/lib/ceph

sudo yum install -y yum-utils && sudo yum-config-manager --add-repo https://dl.fedoraproject.org/pub/epel/7/x86_64/ && sudo yum install --nogpgcheck -y epel-release && sudo rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 && sudo rm /etc/yum.repos.d/dl.fedoraproject.org*
cat <<EOF | sudo tee -a /etc/yum.repos.d/ceph-deploy.repo 
[ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-jewel/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
EOF
sudo yum update -y
sudo yum install -y ceph-deploy
sudo ceph-deploy new $(hostname -s)
sudo ceph-deploy install --release=jewel $(hostname -s)
sudo ceph-deploy mon create-initial
sudo ceph-disk zap /dev/sdc
sudo ceph-disk prepare /dev/sdc

#3 Updated by Loic Dachary 7 months ago

The /lib/udev/rules.d/95-dm-notify.rules

# These rules are responsible for sending a notification to a process
# waiting for completion of udev rules. The process is identified by
# a cookie value sent within "change" and "remove" events (the cookie
# value is set before by that process for every action requested).

runs after /lib/udev/rules.d/95-ceph-osd.rules. If mounting waits for completion of udev rules, that could explain the race. However, removing /lib/udev/rules.d/95-dm-notify.rules and rebooting works, meaning mounting the file system does not wait on dmsetup udevcomplete.

#4 Updated by Loic Dachary 7 months ago

  • Related to Bug #17077: Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs added

#5 Updated by Loic Dachary 7 months ago

  • Related to deleted (Bug #17077: Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs)

#6 Updated by Loic Dachary 7 months ago

  • Duplicated by Bug #17077: Fail to automount osd after reboot when the /var Partition is ext4 but success automount when /var Partition is xfs added

#7 Updated by Loic Dachary 7 months ago

  • Duplicated by Bug #15559: osds do not start after boot added

#8 Updated by Loic Dachary 7 months ago

Updated /etc/udev/udev.conf with udev_log="debug" and rebooted for the fourth time (the first three times the OSD went back up as expected).

#9 Updated by Loic Dachary 7 months ago

  • Status changed from New to In Progress
  • Assignee set to Loic Dachary
  • Priority changed from Normal to Urgent

#10 Updated by Loic Dachary 7 months ago

  • Related to Bug #17813: ceph-disk: udev permission race with dm added

#11 Updated by Loic Dachary 7 months ago

  • Status changed from In Progress to Need Review
  • Backport set to jewel

#12 Updated by Kefu Chai 7 months ago

  • Status changed from Need Review to Pending Backport

#13 Updated by Loic Dachary 7 months ago

  • Copied to Backport #18007: jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service added

#14 Updated by Loic Dachary 7 months ago

  • Status changed from Pending Backport to In Progress

#15 Updated by Loic Dachary 7 months ago

  • Status changed from In Progress to Pending Backport

#17 Updated by Loic Dachary 7 months ago

  • Status changed from Pending Backport to In Progress

#18 Updated by Loic Dachary 7 months ago

After setting up the environment as described at http://tracker.ceph.com/issues/17889#note-2 the dependencies are automatically what they should be to ensure /var/lib/ceph is mounted before ceph-disk@.service is run. Namely after basic.target after sysinit.target after local-fs.target after var-lib-ceph.mount

systemctl --no-pager list-dependencies --after ceph-disk@dev-sdc1.service
ceph-disk@dev-sdc1.service
● ├─system-ceph\x2ddisk.slice
● ├─systemd-journald.socket
● └─basic.target
●   ├─rhel-import-state.service
●   ├─systemd-ask-password-plymouth.path
●   ├─paths.target
●   │ ├─brandbot.path
●   │ ├─systemd-ask-password-console.path
●   │ └─systemd-ask-password-wall.path
●   ├─slices.target
●   │ ├─-.slice
●   │ ├─system.slice
●   │ └─user.slice
●   ├─sockets.target
●   │ ├─dbus.socket
●   │ ├─rpcbind.socket
●   │ ├─sshd.socket
●   │ ├─syslog.socket
●   │ ├─systemd-initctl.socket
●   │ ├─systemd-journald.socket
●   │ ├─systemd-shutdownd.socket
●   │ ├─systemd-udevd-control.socket
●   │ └─systemd-udevd-kernel.socket
●   └─sysinit.target
●     ├─auditd.service
●     ├─dev-hugepages.mount
●     ├─dev-mqueue.mount
●     ├─emergency.service
●     ├─kmod-static-nodes.service
●     ├─plymouth-read-write.service
●     ├─proc-sys-fs-binfmt_misc.automount
●     ├─rhel-autorelabel-mark.service
●     ├─rhel-autorelabel.service
●     ├─rhel-loadmodules.service
●     ├─sys-fs-fuse-connections.mount
●     ├─sys-kernel-config.mount
●     ├─sys-kernel-debug.mount
●     ├─systemd-binfmt.service
●     ├─systemd-firstboot.service
●     ├─systemd-hwdb-update.service
●     ├─systemd-journal-catalog-update.service
●     ├─systemd-journald.service
●     ├─systemd-machine-id-commit.service
●     ├─systemd-modules-load.service
●     ├─systemd-random-seed.service
●     ├─systemd-readahead-collect.service
●     ├─systemd-readahead-replay.service
●     ├─systemd-sysctl.service
●     ├─systemd-tmpfiles-setup-dev.service
●     ├─systemd-tmpfiles-setup.service
●     ├─systemd-udev-settle.service
●     ├─systemd-udev-trigger.service
●     ├─systemd-udevd.service
●     ├─systemd-update-done.service
●     ├─systemd-update-utmp.service
●     ├─systemd-vconsole-setup.service
●     ├─cryptsetup.target
●     │ └─dmraid-activation.service
●     ├─emergency.target
●     │ ├─emergency.service
●     │ ├─rhel-import-state.service
●     │ └─rhel-readonly.service
●     ├─local-fs.target
●     │ ├─-.mount
●     │ ├─dm-event.service
●     │ ├─dmraid-activation.service
●     │ ├─lvm2-monitor.service
●     │ ├─rhel-readonly.service
●     │ ├─run-user-1000.mount
●     │ ├─run-user-991.mount
●     │ ├─systemd-fsck-root.service
●     │ ├─systemd-remount-fs.service
●     │ ├─tmp.mount
●     │ ├─var-lib-ceph-osd-ceph\x2d0.mount
●     │ ├─var-lib-ceph.mount
●     │ └─local-fs-pre.target
●     │   ├─systemd-remount-fs.service
●     │   └─systemd-tmpfiles-setup-dev.service
●     └─swap.target

#19 Updated by Loic Dachary 7 months ago

It looks like var-lib-ceph.mount does not depend on lvm2-monitor.service even indirectly although it should. Not sure if I'm missing something.

systemctl --no-pager list-dependencies --after var-lib-ceph.mount
var-lib-ceph.mount
● ├─-.mount
● ├─dev-all-ceph.device
● ├─system.slice
● ├─systemd-fsck@dev-all-ceph.service
● ├─systemd-journald.socket
● └─local-fs-pre.target
●   ├─systemd-remount-fs.service
●   └─systemd-tmpfiles-setup-dev.service

#20 Updated by Loic Dachary 7 months ago

ceph-disk@.service should be Before= ceph-osd@.service but this relationship does not exist. It usually is not a problem because ceph-osd@.service will retry and that will give time for ceph-disk@.service to complete. But, maybe, ceph-osd@.service will give up if ceph-disk@.service is delayed because it has to wait for lvm to be ready and mount local file system. That would explain why this problem is associated to lvm.

#21 Updated by Loic Dachary 7 months ago

The logs of osd.32 show 3 failed attempts to start at 13:18:15

2016-10-28 13:18:15.063934 7f2e590f1800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:18:15.063966 7f2e590f1800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 4076
2016-10-28 13:18:15.064168 7f2e590f1800 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-32: (2) No such file or directory
2016-10-28 13:18:15.454721 7f3ee7b5a800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:18:15.454777 7f3ee7b5a800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 4491
2016-10-28 13:18:15.454972 7f3ee7b5a800 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-32: (2) No such file or directory
2016-10-28 13:18:15.788618 7f9b327e5800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:18:15.788656 7f9b327e5800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 4915
2016-10-28 13:18:15.788810 7f9b327e5800 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-32: (2) No such file or directory

which are explained because the activation that mounts /var/lib/ceph/osd/ceph-lc-32 only happens 5 minutes later at 13:32:37
Oct 28 13:32:37 lcpcephosd1n5.cmmint.net sh[7439]: activate: OSD uuid is c3aa9e0f-4fcc-488e-8b5c-1b717984a6b5
Oct 28 13:32:37 lcpcephosd1n5.cmmint.net sh[7439]: activate: OSD id is 32

It only tries three times because has
StartLimitBurst=3

but that would not be very different if it was unset because the default is 5 and would quickly be reached.

#22 Updated by Loic Dachary 7 months ago

  • Status changed from In Progress to Need Review

Working on a fix assuming ceph-disk@.service must be Before= ceph-osd@.service at https://github.com/ceph/ceph/pull/12241

#23 Updated by Loic Dachary 7 months ago

The problem is exactly the opposite of what I thought: since ceph-disk@.service already starts the matching ceph-osd@.service, it must not be enabled at all to prevent the race. The dependency cannot be expressed by systemd but it does not matter because it already is implemented by ceph.

https://github.com/ceph/ceph/pull/12241/commits/589d89715cb9dc605a5a94517965c7f035ad95d8

However, not enabling ceph-osd@.service will introduce a non backward compatible change: systemctl start/stop ceph will no longer start/stop the ceph-osd@.service. When enabling

[Install]
WantedBy=ceph-osd.target

will create a static dependency for systemctl start/stop ceph to use. This is not used at boot time but may be used by sysadmins or scripts.

#24 Updated by Loic Dachary 7 months ago

Instead of removing the enable/disable from ceph-disk, it is enough to enable --runtime so that it does not start at boot time

Updated https://github.com/ceph/ceph/pull/12241/files accordingly

#25 Updated by Loic Dachary 7 months ago

For the record, the pull requests that need backporting are:

#26 Updated by Loic Dachary 7 months ago

  • Subject changed from OSD udev / systemd may race with lvm at boot time to build/ops: ceph-disk@.service races with ceph-osd@.service

#27 Updated by Loic Dachary 7 months ago

  • Subject changed from build/ops: ceph-disk@.service races with ceph-osd@.service to ceph-disk: ceph-disk@.service races with ceph-osd@.service

#28 Updated by Loic Dachary 7 months ago

  • Status changed from Need Review to Pending Backport

wait two weeks before merging the backport

#29 Updated by Wido den Hollander 5 months ago

I see that commit b3887379d6dde3b5a44f2e84cf917f4f0a0cb120 changed the systemd service file for ceph-osd with improved values for StartLimitBurst and RestartSec.

Can we expect a backport to Jewel for these changes?

#30 Updated by Vikhyat Umrao 5 months ago

Wido den Hollander wrote:

I see that commit b3887379d6dde3b5a44f2e84cf917f4f0a0cb120 changed the systemd service file for ceph-osd with improved values for StartLimitBurst and RestartSec.

Can we expect a backport to Jewel for these changes?

Yes, backport is in progress:

Backport tracker: http://tracker.ceph.com/issues/18007
Backport PR: https://github.com/ceph/ceph/pull/12147

#31 Updated by Nathan Cutler 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF