Project

General

Profile

Actions

Bug #37486

closed

tmpfs in /var/lib/ceph/osd/X sometimes created with wrong permissions

Added by Paul Emmerich over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've encountered a strange issue on a customer's system today: Some OSDs didn't come up after a reboot, it was somewhat random which ones. I've tracked it down to an issue with the permissions of /var/lib/ceph/osd/ceph-*
The tmpfs was created properly and mounted and all files existed (except the block.db symlink for some reason? probably failed before?) and seemed correct. However, most files had the wrong owner:

  • /var/lib/ceph/osd/ceph-X was owned by root:root
  • /var/lib/ceph/osd/ceph-X/block was owned by ceph:ceph
  • other files in the directory were owned by root:root

This wouldn't be a problem if the block symlink was also owned by root.

But it causes ceph-volume to fail because of fs.protected_symlinks which prevents dereferencing the symlink in this case (world-writable tmpfs mountpoint owned by root, symlink owned by ceph, target of the symlink owned by root).

Work-around: disable fs.protected_symlinks or chown the directory to ceph:ceph

I have no idea how it got into this situation in the first place. Weird thing is that this only happens on one specific hardware and only with kernel 4.18. It works fine with kernel 4.9 and it works fine with 4.18 on all other hardware that we ever encountered. We boot the exact same image on a lot of servers.

Our image is Debian, but there's nothing special in our boot routine that could cause this, we leave that part to ceph-volume.
Anyways, what is causing these permissions in the first place is beside the point I think. I think we should try to make ceph-volume more robust and handle this case?

Steps to kind of reproduce:

root@x /var/lib/ceph/osd/ceph-14 $ systemctl stop ceph-osd@14
root@x /var/lib/ceph/osd/ceph-14 $ chown root:root .
root@x /var/lib/ceph/osd/ceph-14 $ ceph-volume lvm activate --all
--> Activating OSD ID 14 FSID 2f8651bb-d404-44bf-b4d2-67c1aa3d5be1
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-20802f35-15ad-438c-aab1-df5b72325ce1/osd-block-2f8651bb-d404-44bf-b4d2-67c1aa3d5be1 --path /var/lib/ceph/osd/ceph-14 --no-mon-config
 stderr: error symlinking /var/lib/ceph/osd/ceph-14/block: (13) Permission denied
-->  RuntimeError: command returned non-zero exit status: 1

I think it should automatically fix the chown here or at least show a better error message. Took me an hour or so to track this down to fs.protected_symlinks

Actions

Also available in: Atom PDF