Bug #10146
ceph-disk: sometimes the journal symlink is not created
0%
Description
Hi,
We observed in practise that sometimes the journal symlink is not created during a ceph-disk prepare run.
- Scientific Linux 6.6
- ceph-disk from master branch
- /dev/sdo is a new empty spinning disk (for the OSD)
- /dev/sdc is an SSD with 5 journal partitions
- /dev/sdc1 is not currently used by any OSD
- ceph-disk --verbose prepare /dev/sdo /dev/sdc1
- sdo becomes and OSD with a sdc1 as the journal. The /var/lib/ceph/osd/ceph-X/journal should be soft-linked to /dev/disk/by-partuuid/<uuid of sdc1> which is a softlink to /dev/sdc1
- /var/lib/ceph/osd/ceph-X/journal is softlinked to /dev/disk/by-partuuid/<uuid of sdc1>, but /dev/disk/by-partuuid/<uuid of sdc1> is a plain empty file, not a softlink to /dev/sdc1
- In function prepare_journal_dev sgdisk is called to change the partition guid, then partx -a is called to reload the partition table, the udevadm settle is called to let udev finish handling the new ptable. It is expected that either sgdisk or partx triggers udev to add the new /dev/disk/by-partuuid/ symlink to /dev/sdc1, but in practise (with a busy server) the new symlink is not created. By "busy", we mean that /dev/sdc is seeing around 100 writes / second.
- Since the by-partuuid symlink doesn't exist, later in ceph-disk when the symlink from /var/lib/ceph/osd/ceph-X/journal to /dev/disk/by-partuuid/<journal_uuid> is made, this results in an empty file being created at the link target, and afterwords the OSD cannot start.
- We have found that by retriggering the udev block subsystem the symlink is always created. See the patch here: https://github.com/ceph/ceph/pull/2955
- Another possible solution would be to not change the partition guid when re-using a journal partition. The previous /dev/disk/by-partuuid/ link would already exist and could be used by the new OSD.
Related issues
Associated revisions
ceph-disk: don't change the journal partition uuid
We observe that the new /dev/disk/by-partuuid/<journal_uuid>
symlink is not always created by udev when reusing a journal
partition. Fix by not changing the uuid of a journal partition
in this case -- instead we can reuse the existing uuid (and
journal_symlink) instead. We also now assert that the symlink
exists before further preparing the OSD.
Fixes: #10146
Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
Tested-by: Dan van der Ster <daniel.vanderster@cern.ch>
ceph-disk: test re-using an existing journal partition
Add a ceph-disk test to first setup an OSD with a separate journal
block device, then tear down the OSD (simulating a failure) and create
a new OSD which re-uses the same journal device.
Add create_dev / destroy_dev helpers that encapsulate the operations
that ensure the partition table is up to date in the kernel and the
symlinks are created as expected. In particular it makes sure the kernel
is aware that the partition table of a newly created device is
empty. If the device previously existed and the kernel was not informed
of the latest partition table updates via partprobe / partx, it may
have cached an old partition table which can create all sorts of
unexpected behaviors such as a failure to create the by-partuuid
symbolic links as described in http://tracker.ceph.com/issues/10146
Refs: #10146
Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
Signed-off-by: Loic Dachary <ldachary@redhat.com>
ceph-disk: don't change the journal partition uuid
We observe that the new /dev/disk/by-partuuid/<journal_uuid>
symlink is not always created by udev when reusing a journal
partition. Fix by not changing the uuid of a journal partition
in this case -- instead we can reuse the existing uuid (and
journal_symlink) instead. We also now assert that the symlink
exists before further preparing the OSD.
Fixes: #10146
Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
Tested-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 29eb1350b4acaeabfe1d2b19efedbce22641d8cc)
History
#1 Updated by Loïc Dachary over 8 years ago
- Status changed from New to In Progress
- Assignee set to Dan van der Ster
I like the idea of not changing the uuid
#2 Updated by Dan van der Ster over 8 years ago
I've pushed the alternative fix in the same pull req.
#3 Updated by Sage Weil over 8 years ago
- Status changed from In Progress to Resolved
#4 Updated by Loïc Dachary over 8 years ago
- Status changed from Resolved to In Progress
Still open, needs tests.
#5 Updated by Loïc Dachary over 8 years ago
I'm able to reproduce that frequently by running
sudo test/ceph-disk.sh test_activate_journal_devon my laptop at the moment. I'm taking that opportunity to find the cause.
#6 Updated by Loïc Dachary over 8 years ago
I think what happens is the following sequence
- partition 1 is created
- partprobe called so the kernel notices (the symlink shows)
- partition table is zapped and symlink removed
- partproble is not called
- partition 1 is created
- partprobe called but the kernel thinks partition already exists and does not trigger an udev event that does not create the symlink
the only way to reset the idea the kernel has about a given device is to zap the partition table + partprobe. I think symlinks are created reliably and it does not depend on the machine load.
#7 Updated by Dan van der Ster over 8 years ago
Hi Loic,
In this case, ceph-disk zap doesn't apply. The use-case is that you have say 4-5 partitions on a shared journal SSD, and you want to re-use only one of those partitions to become the journal for a new OSD. So we don't (and mustn't!) call ceph-disk zap on the SSD device in this case.
Instead we tried changing the guid of the journal partition, but that doesn't trigger udev reliably. So the only reliable method is not to change the guid, as in the current pull req.
Cheers, Dan
#8 Updated by Loïc Dachary over 8 years ago
Dan van der Ster wrote:
Instead we tried changing the guid of the journal partition, but that doesn't trigger udev reliably. So the only reliable method is not to change the guid, as in the current pull req.
I'm under the impression (although I've never actually tried to prove it) that the udev event will never be called if the guid is modified, even though it makes the by-partuuid symlink obsolete. Not changing the guid (which is what you implemented at https://github.com/ceph/ceph/commit/29eb1350b4acaeabfe1d2b19efedbce22641d8cc ) works around the problem.
This should probably be a bug report against udev ?
#9 Updated by Dan van der Ster over 8 years ago
Loic Dachary wrote:
I'm under the impression (although I've never actually tried to prove it) that the udev event will never be called if the guid is modified, even though it makes the by-partuuid symlink obsolete.
On our test cluster changing the guid did trigger udev and make the correct by-partuuid link (and also left behind the old link -- no big deal). On our prod cluster changing the guid did not trigger udev. Both are CentOS 6.6. The only difference is the activity on the journal devs.
#10 Updated by Loïc Dachary over 8 years ago
Ok. Have you ever seen a problem under load where udev would fail to notice the creation / removal of a partition although partprobe / partx is called consistently (i.e. after each creation / removal ) ?
#11 Updated by Dan van der Ster over 8 years ago
For OSD devices (one dev -- one OSD), I haven't observed any problems, regardless of load. (In this case, the OSD process is not running, so no processes have the device open when the partition is removed or created).
For journal devices, it really depends. Removing 1 out of 5 journal partitions (when the other 4 are still used by active OSDs) is not really doable, in my experience with CentOS 6. I never found the combination of partprobe / partx that made the OSD realize the new ptable. The only reliable method to remove/recreate partitions on a shared journal dev was to stop all OSDs using that dev, then adjust the ptable, then restart the OSDs.
#12 Updated by Loïc Dachary over 8 years ago
Thanks for explaining, that makes me slightly less worried about the reliability of the partition/udev notification couple in the most common case :-)
#13 Updated by Loïc Dachary about 8 years ago
- Status changed from In Progress to Resolved
#14 Updated by Loïc Dachary over 7 years ago
- Status changed from Resolved to Pending Backport
- Backport set to firefly
- Regression set to No
#15 Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved