Project

General

Profile

Bug #16451

Using ceph-deploy with --zap-disk and --dmcrypt fails

Added by Brian Andrus almost 4 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature:

Description

Description of problem:
When using ceph-deploy with the --zap-disk and --dmcrypt option, ceph-deploy seems to call the zap function in ceph-disk without unmounting the disk from the osd-lockbox first. The sgdisk times out (5 timeouts of 60 seconds each) and the osd creation fails.

Version-Release number of selected component (if applicable):
tested:
ceph-deploy: 1.5.24, 1.5.30, 1.5.34
ceph-disk: v10.2.0, v10.2.1, v10.2.2

How reproducible:
100%

Steps to Reproduce:
1. ceph-deploy osd create --zap-disk --dmcrypt host:sd{a..b}
2.
3.

Actual results:
The OSD creation times out while waiting on udevadm. Note the osd-lockbox does not get unmounted which may or may not be by design. Also that sgdisk zap is run against the drive while the partition is mounted (which fails).

Expected results:
The OSD creation should succeed.

Additional info:

[redacted-host][WARNIN] populate: Mounting lockbox mount -t ext4 /dev/sda3 /var/lib/ceph/osd-lockbox/redacted
[redacted-host][WARNIN] command_check_call: Running command: /bin/mount -t ext4 /dev/sda3 /var/lib/ceph/osd-lockbox/redacted
[redacted-host][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd-lockbox/redacted/osd-uuid.3089.tmp
[redacted-host][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
[redacted-host][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_type
[redacted-host][WARNIN] command_check_call: Running command: /usr/bin/ceph config-key put dm-crypt/osd/redacted/luks iE0N25NKgkvOlZxfnN9IEJBfBwO6HCcM0oZeIuowpgFuxsn/yLxz8hDmXZzesQY3MKI1wPWkyzETpV+dw0yBECX/TbAldHqTxYj/W+d6zbKkVe61TABZfIYxjdS+KFu80QaFGlHqBnY5Gj3rXalHE/qquS81XUvsXfafAFTqY8E=
[redacted-host][WARNIN] value stored
[redacted-host][WARNIN] command: Running command: /usr/bin/ceph auth get-or-create client.osd-lockbox.redacted mon allow command "config-key get" with key="dm-crypt/osd/redacted/luks" 
[redacted-host][WARNIN] create_key: stderr 
[redacted-host][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd-lockbox/redacted/key-management-mode.3089.tmp
[redacted-host][WARNIN] adjust_symlink: Creating symlink /var/lib/ceph/osd-lockbox/8dc95d04-65a7-4dee-97d4-6b5ff1117f0d -> /var/lib/ceph/osd-lockbox/redacted
[redacted-host][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd-lockbox/redacted/journal-uuid.3089.tmp
[redacted-host][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd-lockbox/redacted/magic.3089.tmp
[redacted-host][WARNIN] command_check_call: Running command: /sbin/sgdisk --typecode=3:fb3aabf9-d25f-47cc-bf5e-721d1816496b -- /dev/sda
[redacted-host][DEBUG ] Warning: The kernel is still using the old partition table.
[redacted-host][DEBUG ] The new table will be used at the next reboot.
[redacted-host][DEBUG ] The operation has completed successfully.
[redacted-host][WARNIN] get_dm_uuid: get_dm_uuid /dev/sda uuid path is /sys/dev/block/8:0/dm/uuid
[redacted-host][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[redacted-host][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[redacted-host][WARNIN] get_dm_uuid: get_dm_uuid /dev/sda uuid path is /sys/dev/block/8:0/dm/uuid
[redacted-host][WARNIN] zap: Zapping partition table on /dev/sda
[redacted-host][WARNIN] command_check_call: Running command: /sbin/sgdisk --zap-all -- /dev/sda
[redacted-host][WARNIN] Caution: invalid backup GPT header, but valid main header; regenerating
[redacted-host][WARNIN] backup header from main header.
[redacted-host][WARNIN] 
[redacted-host][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
[redacted-host][WARNIN] on the recovery & transformation menu to examine the two tables.
[redacted-host][WARNIN] 
[redacted-host][WARNIN] Warning! One or more CRCs don't match. You should repair the disk!
[redacted-host][WARNIN] 
[redacted-host][DEBUG ] ****************************************************************************
[redacted-host][DEBUG ] Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
[redacted-host][DEBUG ] verification and recovery are STRONGLY recommended.
[redacted-host][DEBUG ] ****************************************************************************
[redacted-host][DEBUG ] Warning: The kernel is still using the old partition table.
redacted-host][DEBUG ] The new table will be used at the next reboot.
[redacted-host][DEBUG ] GPT data structures destroyed! You may now partition the disk using fdisk or
[redacted-host][DEBUG ] other utilities.
[redacted-host][WARNIN] command_check_call: Running command: /sbin/sgdisk --clear --mbrtogpt -- /dev/sda
[redacted-host][DEBUG ] Creating new GPT entries.
[redacted-host][DEBUG ] Warning: The kernel is still using the old partition table.
[redacted-host][DEBUG ] The new table will be used at the next reboot.
[redacted-host][DEBUG ] The operation has completed successfully.
[redacted-host][WARNIN] update_partition: Calling partprobe on zapped device /dev/sda
[redacted-host][WARNIN] command_check_call: Running command: /sbin/udevadm settle --timeout=600
[redacted-host][WARNIN] command: Running command: /sbin/partprobe /dev/sda
[redacted-host][WARNIN] update_partition: partprobe /dev/sda failed : Error: Partition(s) 3 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
[redacted-host][WARNIN]  (ignored, waiting 60s)
[redacted-host][WARNIN] command_check_call: Running command: /sbin/udevadm settle --timeout=600
[redacted-host][WARNIN] command: Running command: /sbin/partprobe /dev/sda
[redacted-host][WARNIN] update_partition: partprobe /dev/sda failed : Error: Partition(s) 3 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
[redacted-host][WARNIN]  (ignored, waiting 60s)
[redacted-host][WARNIN] command_check_call: Running command: /sbin/udevadm settle --timeout=600
[redacted-host][WARNIN] command: Running command: /sbin/partprobe /dev/sda
[redacted-host][WARNIN] update_partition: partprobe /dev/sda failed : Error: Partition(s) 3 on /dev/sda have been written, but we have been unable to inform the kernel of he change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
[redacted-host][WARNIN]  (ignored, waiting 60s)

Eventually times out.


Related issues

Blocked by Ceph-deploy - Bug #14099: do not call partx / partprobe when zapping a device Resolved 12/17/2015

History

#1 Updated by Loic Dachary almost 4 years ago

  • Description updated (diff)

#2 Updated by Loic Dachary almost 4 years ago

  • Assignee deleted (Loic Dachary)

#3 Updated by Loic Dachary almost 4 years ago

  • Blocked by Bug #14099: do not call partx / partprobe when zapping a device added

#4 Updated by Loic Dachary almost 4 years ago

Could you please confirm that the same happens even when https://github.com/ceph/ceph-deploy/pull/400/files is manually applied to ceph-deploy ?

#5 Updated by Alfredo Deza almost 4 years ago

@loic ceph-deploy v1.5.34 includes the fix to not call partx/partprobe when zapping and the description includes trying out this with v1.5.34

http://docs.ceph.com/ceph-deploy/docs/changelog.html#id2

I don't see how ceph-deploy can be blocking this since it is not calling partx/partprobe at all

#6 Updated by Alfredo Deza almost 4 years ago

  • Blocked by deleted (Bug #14099: do not call partx / partprobe when zapping a device)

#7 Updated by Loic Dachary almost 4 years ago

  • Blocked by Bug #14099: do not call partx / partprobe when zapping a device added

#8 Updated by Loic Dachary almost 4 years ago

@Brian since Alfredo claims this behavior is entirely unrelated to ceph-deploy, it would be more relevant to create an issue in http://tracker.ceph.com/projects/ceph/issues/new with a ceph-disk reproducer that does not involve ceph-deploy. What do you think ?

#9 Updated by Loic Dachary almost 4 years ago

@Alfredo Brian tried with ceph-deploy: 1.5.24, 1.5.30, 1.5.34 and since only 1.5.34 does not call partprobe, there still seem to be a possibility that the ceph-deploy partprobe races with the ceph-disk partprobe for 1.5.24 + 1.5.30.

#10 Updated by Brian Andrus almost 4 years ago

Sounds fair, I'll see what I can do to reproduce without ceph-deploy, though our current project has now moved past this so it might be next week.

#11 Updated by Alfredo Deza about 2 years ago

  • Status changed from New to Closed

No longer a problem in ceph-deploy since we've dropped ceph-disk as a backend. (since version 2.0.0)

Also available in: Atom PDF