Project

General

Profile

Actions

Bug #10146

closed

ceph-disk: sometimes the journal symlink is not created

Added by Dan van der Ster over 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
We observed in practise that sometimes the journal symlink is not created during a ceph-disk prepare run.

Environment:
  • Scientific Linux 6.6
  • ceph-disk from master branch
  • /dev/sdo is a new empty spinning disk (for the OSD)
  • /dev/sdc is an SSD with 5 journal partitions
  • /dev/sdc1 is not currently used by any OSD
To reproduce:
  • ceph-disk --verbose prepare /dev/sdo /dev/sdc1
Expected result:
  • sdo becomes and OSD with a sdc1 as the journal. The /var/lib/ceph/osd/ceph-X/journal should be soft-linked to /dev/disk/by-partuuid/<uuid of sdc1> which is a softlink to /dev/sdc1
Actual result:
  • /var/lib/ceph/osd/ceph-X/journal is softlinked to /dev/disk/by-partuuid/<uuid of sdc1>, but /dev/disk/by-partuuid/<uuid of sdc1> is a plain empty file, not a softlink to /dev/sdc1
Explanation:
  • In function prepare_journal_dev sgdisk is called to change the partition guid, then partx -a is called to reload the partition table, the udevadm settle is called to let udev finish handling the new ptable. It is expected that either sgdisk or partx triggers udev to add the new /dev/disk/by-partuuid/ symlink to /dev/sdc1, but in practise (with a busy server) the new symlink is not created. By "busy", we mean that /dev/sdc is seeing around 100 writes / second.
  • Since the by-partuuid symlink doesn't exist, later in ceph-disk when the symlink from /var/lib/ceph/osd/ceph-X/journal to /dev/disk/by-partuuid/<journal_uuid> is made, this results in an empty file being created at the link target, and afterwords the OSD cannot start.
Solutions:
  • We have found that by retriggering the udev block subsystem the symlink is always created. See the patch here: https://github.com/ceph/ceph/pull/2955
  • Another possible solution would be to not change the partition guid when re-using a journal partition. The previous /dev/disk/by-partuuid/ link would already exist and could be used by the new OSD.

Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #12418: ceph-disk: sometimes the journal symlink is not createdResolvedLoïc Dachary11/20/2014Actions
Actions #1

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to In Progress
  • Assignee set to Dan van der Ster

I like the idea of not changing the uuid

Actions #2

Updated by Dan van der Ster over 9 years ago

I've pushed the alternative fix in the same pull req.

Actions #3

Updated by Sage Weil over 9 years ago

  • Status changed from In Progress to Resolved
Actions #4

Updated by Loïc Dachary over 9 years ago

  • Status changed from Resolved to In Progress

Still open, needs tests.

Actions #5

Updated by Loïc Dachary over 9 years ago

I'm able to reproduce that frequently by running

sudo test/ceph-disk.sh test_activate_journal_dev 
on my laptop at the moment. I'm taking that opportunity to find the cause.

Actions #6

Updated by Loïc Dachary over 9 years ago

I think what happens is the following sequence

  • partition 1 is created
  • partprobe called so the kernel notices (the symlink shows)
  • partition table is zapped and symlink removed
  • partproble is not called
  • partition 1 is created
  • partprobe called but the kernel thinks partition already exists and does not trigger an udev event that does not create the symlink

the only way to reset the idea the kernel has about a given device is to zap the partition table + partprobe. I think symlinks are created reliably and it does not depend on the machine load.

Actions #7

Updated by Dan van der Ster over 9 years ago

Hi Loic,
In this case, ceph-disk zap doesn't apply. The use-case is that you have say 4-5 partitions on a shared journal SSD, and you want to re-use only one of those partitions to become the journal for a new OSD. So we don't (and mustn't!) call ceph-disk zap on the SSD device in this case.

Instead we tried changing the guid of the journal partition, but that doesn't trigger udev reliably. So the only reliable method is not to change the guid, as in the current pull req.
Cheers, Dan

Actions #8

Updated by Loïc Dachary over 9 years ago

Dan van der Ster wrote:

Instead we tried changing the guid of the journal partition, but that doesn't trigger udev reliably. So the only reliable method is not to change the guid, as in the current pull req.

I'm under the impression (although I've never actually tried to prove it) that the udev event will never be called if the guid is modified, even though it makes the by-partuuid symlink obsolete. Not changing the guid (which is what you implemented at https://github.com/ceph/ceph/commit/29eb1350b4acaeabfe1d2b19efedbce22641d8cc ) works around the problem.

This should probably be a bug report against udev ?

Actions #9

Updated by Dan van der Ster over 9 years ago

Loic Dachary wrote:

I'm under the impression (although I've never actually tried to prove it) that the udev event will never be called if the guid is modified, even though it makes the by-partuuid symlink obsolete.

On our test cluster changing the guid did trigger udev and make the correct by-partuuid link (and also left behind the old link -- no big deal). On our prod cluster changing the guid did not trigger udev. Both are CentOS 6.6. The only difference is the activity on the journal devs.

Actions #10

Updated by Loïc Dachary over 9 years ago

Ok. Have you ever seen a problem under load where udev would fail to notice the creation / removal of a partition although partprobe / partx is called consistently (i.e. after each creation / removal ) ?

Actions #11

Updated by Dan van der Ster over 9 years ago

For OSD devices (one dev -- one OSD), I haven't observed any problems, regardless of load. (In this case, the OSD process is not running, so no processes have the device open when the partition is removed or created).

For journal devices, it really depends. Removing 1 out of 5 journal partitions (when the other 4 are still used by active OSDs) is not really doable, in my experience with CentOS 6. I never found the combination of partprobe / partx that made the OSD realize the new ptable. The only reliable method to remove/recreate partitions on a shared journal dev was to stop all OSDs using that dev, then adjust the ptable, then restart the OSDs.

Actions #12

Updated by Loïc Dachary over 9 years ago

Thanks for explaining, that makes me slightly less worried about the reliability of the partition/udev notification couple in the most common case :-)

Actions #13

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Loïc Dachary almost 9 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to firefly
  • Regression set to No
Actions #15

Updated by Loïc Dachary over 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF