Project

General

Profile

Actions

Bug #10987

closed

ceph-disk broken for rhel 7 due to usage of partx

Added by Itamar Landsman about 9 years ago. Updated about 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph-disk fails on rhel 7 when trying to create partitions. This also breaks ceph-deploy when trying to deploy new OSDs.
The reason is the fact that ceph-disk tries to use partx with old parameters on any redhat based machine. THis crashed with an error on rhel 7 and the likes.

The fix would be to instruct ceph-disk to use partprobe in the case of rhel 7.


Files

out.txt (56 KB) out.txt RHEL7.0 test output Loïc Dachary, 03/07/2015 05:45 PM

Related issues 1 (0 open1 closed)

Related to devops - Bug #7334: ceph-disk: cannot run partprobe on used devices with EL6ResolvedAlfredo Deza02/04/2014

Actions
Actions #1

Updated by Itamar Landsman about 9 years ago

Added pull request to main https://github.com/ceph/ceph/pull/3838 .
Can be applied to any ceph-disk from firefly and up.

Actions #2

Updated by Loïc Dachary about 9 years ago

Could you please include the ceph-deploy command you used as well as the output ? It will help people facing the same problem: finding this issue by looking for strings from the error message will match ;-)

Actions #3

Updated by Alfredo Deza about 9 years ago

  • Tracker changed from Bug to Fix
  • Project changed from 18 to Ceph
Actions #4

Updated by Loïc Dachary about 9 years ago

  • Tracker changed from Fix to Bug
  • Project changed from Ceph to 18
Actions #5

Updated by Loïc Dachary about 9 years ago

  • Project changed from 18 to Ceph
Actions #6

Updated by Loïc Dachary about 9 years ago

  • Status changed from New to Need More Info

I don't think this is an actual bug. Can you show the error you get ? partx will display what looks like an error but actually do the right thing (i.e. refresh the kernel view of the partition table). And the activate that follows the prepare should work. If it does not work for you, please include a sequence to reproduce the problem as well as the output you get.

Actions #7

Updated by Alfredo Deza about 9 years ago

  • Status changed from Need More Info to 12

I say this is a bug. Here is the Bugzilla ticket that has better/more information:

https://bugzilla.redhat.com/show_bug.cgi?id=1195694

Actions #8

Updated by Loïc Dachary about 9 years ago

The relevant information of the https://bugzilla.redhat.com/show_bug.cgi?id=1195694 that is not publicly accessible is as follows:

What happens when `ceph-disk` is manually ran on that server (vs. doing it with ceph-deploy)

After removing the /dev/vdb1 partition
[root@ceph ~]# ceph-disk prepare /dev/vdb

*******************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format.

The operation has completed successfully.
partx: /dev/vdb: error adding partition 2
The operation has completed successfully.
partx: /dev/vdb: error adding partitions 1-2
meta-data=/dev/vdb1 isize=2048 agcount=4, agsize=262079 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0
data = bsize=4096 blocks=1048315, imaxpct=25 = sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
partx: /dev/vdb: error adding partitions 1-2

What does `ceph-disk list` say?

Before removing vdb1 and vdc1 partitions
[root@ceph ~]# ceph-disk list
/dev/vda :
/dev/vda1 other, xfs, mounted on /
/dev/vdb :
/dev/vdb1 other
/dev/vdc :
/dev/vdc1 other

Can you tell if the udev rules are working correctly? (see if there is anything worth noting in udev logs)

Feb 12 21:07:09 localhost systemd: Starting udev Kernel Socket.
Feb 12 21:07:09 localhost systemd: Listening on udev Kernel Socket.
Feb 12 21:07:09 localhost systemd: Starting udev Control Socket.
Feb 12 21:07:09 localhost systemd: Listening on udev Control Socket.
Feb 12 21:07:09 localhost systemd: Starting dracut pre-udev hook...
Feb 12 21:07:09 localhost systemd: Started dracut pre-udev hook.
Feb 12 21:07:09 localhost systemd: Starting udev Kernel Device Manager...
Feb 12 21:07:09 localhost systemd-udevd205: starting version 208
Feb 12 21:07:09 localhost systemd: Started udev Kernel Device Manager.
Feb 12 21:07:09 localhost systemd: Starting udev Coldplug all Devices...
Feb 12 21:07:09 localhost systemd: Started udev Coldplug all Devices.
Feb 12 21:07:11 localhost systemd: Stopping udev Coldplug all Devices...
Feb 12 21:07:11 localhost systemd: Stopped udev Coldplug all Devices.
Feb 12 21:07:11 localhost systemd: Stopping udev Kernel Device Manager...
Feb 12 21:07:11 localhost systemd: Stopped udev Kernel Device Manager.
Feb 12 21:07:11 localhost systemd: Stopping dracut pre-udev hook...
Feb 12 21:07:11 localhost systemd: Stopped dracut pre-udev hook.
Feb 12 21:07:11 localhost systemd: Stopping udev Kernel Socket.
Feb 12 21:07:11 localhost systemd: Closed udev Kernel Socket.
Feb 12 21:07:11 localhost systemd: Stopping udev Control Socket.
Feb 12 21:07:11 localhost systemd: Closed udev Control Socket.
Feb 12 21:07:11 localhost systemd: Starting Cleanup udevd DB...
Feb 12 21:07:11 localhost systemd: Started Cleanup udevd DB.
Feb 12 21:07:11 localhost systemd: Started udev Coldplug all Devices.
Feb 12 21:07:11 localhost systemd: Starting udev Kernel Device Manager...
Feb 12 21:07:11 localhost systemd: Started udev Kernel Device Manager.
Feb 12 21:07:11 localhost systemd-udevd366: starting version 208
Feb 12 21:07:17 localhost dracut: * Including module: udev-rules *
Feb 12 21:07:17 localhost dracut: Skipping udev rule: 91-permissions.rules
Feb 25 03:03:27 localhost systemd: Starting udev Kernel Socket.
Feb 25 03:03:27 localhost systemd: Listening on udev Kernel Socket.
Feb 25 03:03:27 localhost systemd: Starting udev Control Socket.
Feb 25 03:03:27 localhost systemd: Listening on udev Control Socket.
Feb 25 03:03:27 localhost systemd: Starting dracut pre-udev hook...
Feb 25 03:03:27 localhost systemd: Started dracut pre-udev hook.
Feb 25 03:03:27 localhost systemd: Starting udev Kernel Device Manager...
Feb 25 03:03:27 localhost systemd-udevd206: starting version 208
Feb 25 03:03:27 localhost systemd: Started udev Kernel Device Manager.
Feb 25 03:03:27 localhost systemd: Starting udev Coldplug all Devices...
Feb 25 03:03:27 localhost systemd: Started udev Coldplug all Devices.
Feb 25 03:03:28 localhost systemd: Stopping udev Coldplug all Devices...
Feb 25 03:03:28 localhost systemd: Stopped udev Coldplug all Devices.
Feb 25 03:03:28 localhost systemd: Stopping udev Kernel Device Manager...
Feb 25 03:03:28 localhost systemd: Stopped udev Kernel Device Manager.
Feb 25 03:03:28 localhost systemd: Stopping dracut pre-udev hook...
Feb 25 03:03:28 localhost systemd: Stopped dracut pre-udev hook.
Feb 25 03:03:28 localhost systemd: Stopping udev Kernel Socket.
Feb 25 03:03:28 localhost systemd: Closed udev Kernel Socket.
Feb 25 03:03:28 localhost systemd: Stopping udev Control Socket.
Feb 25 03:03:28 localhost systemd: Closed udev Control Socket.
Feb 25 03:03:28 localhost systemd: Starting Cleanup udevd DB...
Feb 25 03:03:28 localhost systemd: Started Cleanup udevd DB.
Feb 25 03:03:29 localhost systemd: Started udev Coldplug all Devices.
Feb 25 03:03:29 localhost systemd: Starting udev Kernel Device Manager...
Feb 25 03:03:29 localhost systemd: Starting udev Wait for Complete Device Initialization...
Feb 25 03:03:29 localhost systemd-udevd371: starting version 208
Feb 25 03:03:29 localhost systemd: Started udev Kernel Device Manager.
Feb 25 03:03:29 localhost systemd: Started udev Wait for Complete Device Initialization.

Travis,

The ceph-deploy zap seems to work fine (disk is actually converted from mbr to gpt), the issue comes with the ceph-deploy prepare, which is not able to create the partitions required on the gpt disk

[root@ceph ~]# parted -l
Model: Virtio Block Device (virtblk)
Disk /dev/vda: 10.7GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number Start End Size Type File system Flags
1 1049kB 10.7GB 10.7GB primary xfs boot

Model: Virtio Block Device (virtblk)
Disk /dev/vdb: 5369MB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags

Model: Virtio Block Device (virtblk)
Disk /dev/vdc: 5369MB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags

Actions #9

Updated by Loïc Dachary about 9 years ago

for the purpose of using partprobe instead of kpartx for rhel7, is it right to assume that would also be true for centos7 ? I kind of assume centos7 closely follows rhel7, but that's not something I know for sure. If rhel7 is significantly different, testing gets complicated.

Actions #10

Updated by Ken Dreyer about 9 years ago

CentOS aims to be functionally compatible with RHEL (bug-for-bug compatibility). So partprobe should work the same way on CentOS as it does on RHEL.

Actions #11

Updated by Loïc Dachary about 9 years ago

The ceph-deploy zap seems to work fine (disk is actually converted from mbr to gpt), the issue comes with the ceph-deploy prepare, which is not able to create the partitions required on the gpt disk

It would help a great deal to have a copy of the exact zap + prepare command that were run and failed, with all the output. I'm having troubles reproducing the problem and zap + prepare is tested to work consistently on centos 7 with partx. I'm missing something.

Actions #12

Updated by Loïc Dachary about 9 years ago

I tried to reproduce the problem on a RHEL 7 as follows. It succeeds, therefore I'm missing a critical step that makes it fail.

[root@vpm031 src]# sgdisk --print /dev/vdb
Creating new GPT entries.
Disk /dev/vdb: 419430400 sectors, 200.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): EFD967D6-C338-4F9F-8FFA-CDE2AD20CB08
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 419430366
Partitions will be aligned on 2048-sector boundaries
Total free space is 419430333 sectors (200.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
[root@vpm031 src]# ceph-disk prepare /dev/vdb
Creating new GPT entries.
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
partx: /dev/vdb: error adding partition 2
Information: Moved requested sector from 204801 to 206848 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
partx: /dev/vdb: error adding partitions 1-2
meta-data=/dev/vdb1              isize=2048   agcount=4, agsize=13100735 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0
data     =                       bsize=4096   blocks=52402939, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=25587, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
The operation has completed successfully.
partx: /dev/vdb: error adding partitions 1-2
[root@vpm031 src]# sgdisk --print /dev/vdb
Disk /dev/vdb: 419430400 sectors, 200.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): B0EBBB15-5FBA-4084-8739-4093E148344B
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 419430366
Partitions will be aligned on 2048-sector boundaries
Total free space is 4061 sectors (2.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1          206848       419430366   199.9 GiB   FFFF  ceph data
   2            2048          204800   99.0 MiB    FFFF  ceph journal
[root@vpm031 src]# sgdisk --delete=1 /dev/vdb
The operation has completed successfully.
[root@vpm031 src]# sgdisk --delete=2 /dev/vdb
The operation has completed successfully.
[root@vpm031 src]# sgdisk --print /dev/vdb
Disk /dev/vdb: 419430400 sectors, 200.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): B0EBBB15-5FBA-4084-8739-4093E148344B
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 419430366
Partitions will be aligned on 2048-sector boundaries
Total free space is 419430333 sectors (200.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
[root@vpm031 src]# ceph-disk prepare /dev/vdb
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
partx: /dev/vdb: error adding partition 2
Information: Moved requested sector from 204801 to 206848 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
partx: /dev/vdb: error adding partitions 1-2
meta-data=/dev/vdb1              isize=2048   agcount=4, agsize=13100735 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0
data     =                       bsize=4096   blocks=52402939, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=25587, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
The operation has completed successfully.
partx: /dev/vdb: error adding partitions 1-2
[root@vpm031 src]# sgdisk --print /dev/vdb
Disk /dev/vdb: 419430400 sectors, 200.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): B0EBBB15-5FBA-4084-8739-4093E148344B
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 419430366
Partitions will be aligned on 2048-sector boundaries
Total free space is 4061 sectors (2.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1          206848       419430366   199.9 GiB   FFFF  ceph data
   2            2048          204800   99.0 MiB    FFFF  ceph journal
[root@vpm031 src]# ceph-disk list 
WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying sgdisk; may not correctly identify ceph volumes with dmcrypt
/dev/sr0 other, iso9660
/dev/vda :
 /dev/vda1 other, xfs, mounted on /
/dev/vdb :
 /dev/vdb1 ceph data, prepared, unknown cluster 8830ef1c-e971-4e0c-ba7a-75ca89595144, journal /dev/vdb2
 /dev/vdb2 ceph journal, for /dev/vdb1
/dev/vdc other, unknown
/dev/vdd other, unknown
[root@vpm031 src]# ceph --version
ceph version 0.93-215-gd44c245 (d44c24517cd98e18c86f7a351de129159ca8cb2b)
[root@vpm031 src]# lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:    RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.0 (Maipo)
Release:    7.0
Codename:    Maipo
[root@vpm031 src]# 

Actions #13

Updated by Loïc Dachary about 9 years ago

  • Status changed from 12 to Need More Info
Actions #14

Updated by Loïc Dachary about 9 years ago

  • Project changed from Ceph to devops
Actions #15

Updated by Loïc Dachary about 9 years ago

There is a test case at https://github.com/ceph/ceph/pull/3872 for creating a journal partition on a disk where there already is a journal partition:

Both succeeded on a RHEL7.0, see out.txt . What is the use case that fails ?

Actions #16

Updated by Loïc Dachary about 9 years ago

sudo test/ceph-disk.sh test_activate_dev

as found in https://github.com/ceph/ceph/pull/3872 succeeds on RHEL 6.5 with the following changes
  • /dev/loop devices are replaced with /dev/vdb devices because the default 2.6 kernel does not support partitions on /dev/loop
  • the 60-ceph-partuuid-workaround.rules udev rule is installed otherwise /dev/disk/by-partuuid is not populated as expected
Actions #17

Updated by Loïc Dachary about 9 years ago

Here is a use case that can lead to a confusing situation

  • ceph-disk prepare /dev/sdc /dev/sdd # adds /dev/sdd1
  • /var/lib/ceph/osd/osd-0/journal is a symlink to /dev/disk/by-partuuid/XXXX (no check for existence because it may not yet exist)
  • ceph-disk calls partprobe which notices the new partition
  • /dev/disk/by-partuuid is updated and XXXX exists
  • manually removing partition /dev/sdd1 and not calling partx or partprobe
  • /dev/disk/by-partuuid is not updated and now has a stale uuid XXXX
  • ceph-disk prepare /dev/sdc /dev/sdd # adds /dev/sdd1
  • /var/lib/ceph/osd/osd-1/journal is a symlink to /dev/disk/by-partuuid/YYYY (no check for existence because it may not yet exist)
  • ceph-disk calls partprobe which does not notice the new partition because it did not notic its removal and does not check more than existence / absence (i.e. it fails to notice it has a different uuid etc.)
  • /dev/disk/by-partuuid is not updated and YYYY does not exists, i.e. the journal is stale

the solution is to call partprobe / partx after manually removing / adding a partition

Actions #18

Updated by Loïc Dachary about 9 years ago

  • Status changed from Need More Info to Can't reproduce
Actions #19

Updated by Loïc Dachary about 9 years ago

After working more with ceph-disk / udev and debugging issues with it in the past few weeks, I think it works as expected. It is however quite easy to create a confusing situation (as described at http://tracker.ceph.com/issues/10987#note-17) which has nothing to do with Ceph and can be mistaken as a bug in how Ceph handles disks. I'm not sure how to make the partition table updates more robust or the udev event chain easier to debug. But I also don't think this is a problem that should be adressed in Ceph itself.

I also think partx / partprobe should not be used by ceph-disk the way they are. It works the way it is, reliably, but the mixture of partx / partprobe could probably be replaced by a simple call to partprobe and no call to partx at all. In order to fix that we first need a robust test environment that not only includes tests on /dev/loop devices (which allows to verify that partition tables are updated when they should) but also tests for all udev events. In other words the container based tests need to be complemented with virtual machine based tests that involve booting an actual virtual machine to exercise various ceph-disk udev scenarios. When this is covered and we have tests showing it works as it should on all supported platforms (as opposed to testing them manually as we currently do), we will be able to refactor the usage of partx / partprobe with limited risks of regressions.

Actions

Also available in: Atom PDF