Project

General

Profile

Bug #57976

ceph-volume lvm activate removes /var/lib/ceph/osd/ceph-XXX folder and then chokes on itself

Added by Janek Bevendorff 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When I create a new OSD and try to activate it, the activation step removes the@ /var/lib/ceph/osd/ceph-XXX@ mount folder (that was previously created by the prepare command) and then fails to activate the OSD.

This issue occurs when I try to add new (SSD) OSDs to my cluster.

Steps to reproduce:

ceph-volume lvm zap /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0
    --> Zapping: /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0
    --> --destroy was not specified, but zapping a whole device will remove the partition table
    Running command: /usr/bin/dd if=/dev/zero of=/dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0 bs=1M count=10 conv=fsync
     stderr: 10+0 records in
    10+0 records out
    10485760 bytes (10 MB, 10 MiB) copied, 0.0216828 s, 484 MB/s
    --> Zapping successful for: <Raw Device: /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0>

ceph-volume lvm prepare --data /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0
    Running command: /usr/bin/ceph-authtool --gen-print-key
    Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 4f054fea-ea30-4966-a886-801cb14b62f9
    Running command: vgcreate --force --yes ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185 /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0
     stdout: Physical volume "/dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0" successfully created.
     stdout: Volume group "ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185" successfully created
    Running command: lvcreate --yes -l 228928 -n osd-block-4f054fea-ea30-4966-a886-801cb14b62f9 ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185
     stdout: Logical volume "osd-block-4f054fea-ea30-4966-a886-801cb14b62f9" created.
    Running command: /usr/bin/ceph-authtool --gen-print-key
    Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1255
    --> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185/osd-block-4f054fea-ea30-4966-a886-801cb14b62f9
    Running command: /usr/bin/chown -R ceph:ceph /dev/dm-16
    Running command: /usr/bin/ln -s /dev/ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185/osd-block-4f054fea-ea30-4966-a886-801cb14b62f9 /var/lib/ceph/osd/ceph-1255/block
    Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1255/activate.monmap
     stderr: got monmap epoch 29
    Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-1255/keyring --create-keyring --name osd.1255 --add-key XXXX
     stdout: creating /var/lib/ceph/osd/ceph-1255/keyring
    added entity osd.1255 auth(key=XXXX)
    Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1255/keyring
    Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1255/
    Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1255 --monmap /var/lib/ceph/osd/ceph-1255/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-1255/ --osd-uuid 4f054fea-ea30-4966-a886-801cb14b62f9 --setuser ceph --setgroup ceph
    --> ceph-volume lvm prepare successful for: /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:12:0

At this point, the OSD tmpfs exists and contains a lot of stuff:

ls /var/lib/ceph/osd/ceph-1255
    activate.monmap
    bfm_blocks
    bfm_blocks_per_key
    bfm_bytes_per_block
    bfm_size
    block
    bluefs
    ceph_fsid
    fsid
    keyring
    kv_backend
    magic
    mkfs_done
    osd_key
    ready
    type
    whoami

But running ceph-volume lvm activate removes the folder and then fails to start the new OSD:

ceph-volume lvm activate --all
    --> OSD ID 763 FSID 93cd5072-8097-4c32-b843-762e6c9bd46f process is active. Skipping activation
    --> OSD ID 3 FSID ccf0a00b-b0e7-4963-b6f6-0e34342abb0a process is active. Skipping activation
    --> OSD ID 613 FSID 81dea1c7-0ae8-4091-8339-024f8723dd29 process is active. Skipping activation
    --> OSD ID 78 FSID 521fa96f-d0af-4106-b6d3-5e3048b8199e process is active. Skipping activation
    --> OSD ID 1002 FSID e42e678d-334d-4337-8b7e-0c9fd9eaca6e process is active. Skipping activation
    --> OSD ID 1169 FSID cfae82e0-8fc7-4ee1-965b-d85d3360ef7a process is active. Skipping activation
    --> OSD ID 687 FSID 49442d45-75e5-47fc-bf26-2ccdef5fbaef process is active. Skipping activation
    --> OSD ID 157 FSID 749d55ad-6315-4fdf-952f-25984be7552a process is active. Skipping activation
    --> OSD ID 461 FSID 412673e6-21a1-44d8-8f86-9a4e7d53c206 process is active. Skipping activation
    --> OSD ID 384 FSID b9b7ac14-c9e9-4bc7-b00b-6984adce0753 process is active. Skipping activation
    --> OSD ID 841 FSID 6657df87-86e5-4d2c-962b-fafd5a1f9d30 process is active. Skipping activation
    --> OSD ID 921 FSID 9627acbd-c73d-405b-8485-c25dbaa91c99 process is active. Skipping activation
    --> OSD ID 1086 FSID 0151a4b8-de7b-401c-af25-5fb167875363 process is active. Skipping activation
    --> OSD ID 539 FSID 5b94f35d-d09d-4a02-909c-6857684d1e41 process is active. Skipping activation
    --> OSD ID 240 FSID ba3c9f42-45d4-4338-bd2b-b56dd8ea2ddb process is active. Skipping activation
    --> Activating OSD ID 1255 FSID 4f054fea-ea30-4966-a886-801cb14b62f9
    Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1255
    Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185/osd-block-4f054fea-ea30-4966-a886-801cb14b62f9 --path /var/lib/ceph/osd/ceph-1255 --no-mon-config
    Running command: /usr/bin/ln -snf /dev/ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185/osd-block-4f054fea-ea30-4966-a886-801cb14b62f9 /var/lib/ceph/osd/ceph-1255/block
    Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1255/block
    Running command: /usr/bin/chown -R ceph:ceph /dev/dm-16
    Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1255
    Running command: /usr/bin/systemctl enable ceph-volume@lvm-1255-4f054fea-ea30-4966-a886-801cb14b62f9
     stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1255-4f054fea-ea30-4966-a886-801cb14b62f9.service → /lib/systemd/system/ceph-volume@.service.
    Running command: /usr/bin/systemctl enable --runtime ceph-osd@1255
     stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@1255.service → /lib/systemd/system/ceph-osd@.service.
    Running command: /usr/bin/systemctl start ceph-osd@1255
     stderr: Job for ceph-osd@1255.service failed because the control process exited with error code.
    See "systemctl status ceph-osd@1255.service" and "journalctl -xe" for details.
    -->  RuntimeError: command returned non-zero exit status: 1

Folder is gone:

ls /var/lib/ceph/osd/ceph-1255
    ls: cannot access '/var/lib/ceph/osd/ceph-1255': No such file or directory

Creating the folder manually fixes the problem:

mkdir /var/lib/ceph/osd/ceph-1255

ceph-volume lvm activate --all
    --> OSD ID 763 FSID 93cd5072-8097-4c32-b843-762e6c9bd46f process is active. Skipping activation
    --> OSD ID 3 FSID ccf0a00b-b0e7-4963-b6f6-0e34342abb0a process is active. Skipping activation
    --> OSD ID 613 FSID 81dea1c7-0ae8-4091-8339-024f8723dd29 process is active. Skipping activation
    --> OSD ID 78 FSID 521fa96f-d0af-4106-b6d3-5e3048b8199e process is active. Skipping activation
    --> OSD ID 1002 FSID e42e678d-334d-4337-8b7e-0c9fd9eaca6e process is active. Skipping activation
    --> OSD ID 1169 FSID cfae82e0-8fc7-4ee1-965b-d85d3360ef7a process is active. Skipping activation
    --> OSD ID 687 FSID 49442d45-75e5-47fc-bf26-2ccdef5fbaef process is active. Skipping activation
    --> OSD ID 157 FSID 749d55ad-6315-4fdf-952f-25984be7552a process is active. Skipping activation
    --> OSD ID 461 FSID 412673e6-21a1-44d8-8f86-9a4e7d53c206 process is active. Skipping activation
    --> OSD ID 384 FSID b9b7ac14-c9e9-4bc7-b00b-6984adce0753 process is active. Skipping activation
    --> OSD ID 841 FSID 6657df87-86e5-4d2c-962b-fafd5a1f9d30 process is active. Skipping activation
    --> OSD ID 921 FSID 9627acbd-c73d-405b-8485-c25dbaa91c99 process is active. Skipping activation
    --> OSD ID 1086 FSID 0151a4b8-de7b-401c-af25-5fb167875363 process is active. Skipping activation
    --> OSD ID 539 FSID 5b94f35d-d09d-4a02-909c-6857684d1e41 process is active. Skipping activation
    --> OSD ID 240 FSID ba3c9f42-45d4-4338-bd2b-b56dd8ea2ddb process is active. Skipping activation
    --> Activating OSD ID 1255 FSID 4f054fea-ea30-4966-a886-801cb14b62f9
    Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1255
    Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185/osd-block-4f054fea-ea30-4966-a886-801cb14b62f9 --path /var/lib/ceph/osd/ceph-1255 --no-mon-config
    Running command: /usr/bin/ln -snf /dev/ceph-e7ca3aa6-4794-459c-9dcd-b8973f53b185/osd-block-4f054fea-ea30-4966-a886-801cb14b62f9 /var/lib/ceph/osd/ceph-1255/block
    Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1255/block
    Running command: /usr/bin/chown -R ceph:ceph /dev/dm-16
    Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1255
    Running command: /usr/bin/systemctl enable ceph-volume@lvm-1255-4f054fea-ea30-4966-a886-801cb14b62f9
    Running command: /usr/bin/systemctl enable --runtime ceph-osd@1255
    Running command: /usr/bin/systemctl start ceph-osd@1255
    --> ceph-volume lvm activate successful for osd ID: 1255
    --> OSD ID 314 FSID 7d72f481-bda9-4c5f-b9a8-36220ac8a461 process is active. Skipping activation

ls /var/lib/ceph/osd/ceph-1255
    block
    ceph_fsid
    fsid
    keyring
    ready
    require_osd_release
    type
    whoami

ceph-volume create, which bundles the two-step process suffers from the same issue.

History

#1 Updated by Janek Bevendorff 3 months ago

Looks like the problem is gone after a full reboot. No idea what was going on, but it was reproducible on all nodes.

Also available in: Atom PDF