Project

General

Profile

Actions

Bug #37486

closed

tmpfs in /var/lib/ceph/osd/X sometimes created with wrong permissions

Added by Paul Emmerich over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've encountered a strange issue on a customer's system today: Some OSDs didn't come up after a reboot, it was somewhat random which ones. I've tracked it down to an issue with the permissions of /var/lib/ceph/osd/ceph-*
The tmpfs was created properly and mounted and all files existed (except the block.db symlink for some reason? probably failed before?) and seemed correct. However, most files had the wrong owner:

  • /var/lib/ceph/osd/ceph-X was owned by root:root
  • /var/lib/ceph/osd/ceph-X/block was owned by ceph:ceph
  • other files in the directory were owned by root:root

This wouldn't be a problem if the block symlink was also owned by root.

But it causes ceph-volume to fail because of fs.protected_symlinks which prevents dereferencing the symlink in this case (world-writable tmpfs mountpoint owned by root, symlink owned by ceph, target of the symlink owned by root).

Work-around: disable fs.protected_symlinks or chown the directory to ceph:ceph

I have no idea how it got into this situation in the first place. Weird thing is that this only happens on one specific hardware and only with kernel 4.18. It works fine with kernel 4.9 and it works fine with 4.18 on all other hardware that we ever encountered. We boot the exact same image on a lot of servers.

Our image is Debian, but there's nothing special in our boot routine that could cause this, we leave that part to ceph-volume.
Anyways, what is causing these permissions in the first place is beside the point I think. I think we should try to make ceph-volume more robust and handle this case?

Steps to kind of reproduce:

root@x /var/lib/ceph/osd/ceph-14 $ systemctl stop ceph-osd@14
root@x /var/lib/ceph/osd/ceph-14 $ chown root:root .
root@x /var/lib/ceph/osd/ceph-14 $ ceph-volume lvm activate --all
--> Activating OSD ID 14 FSID 2f8651bb-d404-44bf-b4d2-67c1aa3d5be1
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-20802f35-15ad-438c-aab1-df5b72325ce1/osd-block-2f8651bb-d404-44bf-b4d2-67c1aa3d5be1 --path /var/lib/ceph/osd/ceph-14 --no-mon-config
 stderr: error symlinking /var/lib/ceph/osd/ceph-14/block: (13) Permission denied
-->  RuntimeError: command returned non-zero exit status: 1

I think it should automatically fix the chown here or at least show a better error message. Took me an hour or so to track this down to fs.protected_symlinks

Actions #1

Updated by Alfredo Deza over 5 years ago

  • Status changed from New to 12
  • Assignee set to Alfredo Deza
Actions #2

Updated by Alfredo Deza over 5 years ago

What version of Ceph was this? There was a problem with permissions a while ago, see issue: http://tracker.ceph.com/issues/24661

Specifically, this changed here helped: https://github.com/ceph/ceph/pull/22462/files#diff-d96a27274e642aeac837f66e2d406dc5R103

Using 13.2.0 I can't see the problem:

root@node9:/var/lib/ceph/osd/ceph-13# ls -alh
total 28K
drwxrwxrwt  2 ceph ceph  200 Dec  7 23:38 .
drwxr-xr-x 55 ceph ceph 4.0K Dec  5 17:04 ..
lrwxrwxrwx  1 ceph ceph   99 Dec  7 23:38 block -> /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62
lrwxrwxrwx  1 root root  106 Dec  7 23:38 block.db -> /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214
-rw-------  1 ceph ceph   37 Dec  7 23:38 ceph_fsid
-rw-------  1 ceph ceph   37 Dec  7 23:38 fsid
-rw-------  1 ceph ceph   56 Dec  7 23:38 keyring
-rw-------  1 ceph ceph    6 Dec  7 23:38 ready
-rw-------  1 ceph ceph   10 Dec  7 23:38 type
-rw-------  1 ceph ceph    3 Dec  7 23:38 whoami

root@node9:/var/lib/ceph/osd/ceph-13# systemctl stop ceph-osd@13

root@node9:/var/lib/ceph/osd/ceph-13# chown -R root:root .
root@node9:/var/lib/ceph/osd/ceph-13# ls -alh
total 28K
drwxrwxrwt  2 root root  200 Dec  7 23:38 .
drwxr-xr-x 55 ceph ceph 4.0K Dec  5 17:04 ..
lrwxrwxrwx  1 root root   99 Dec  7 23:38 block -> /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62
lrwxrwxrwx  1 root root  106 Dec  7 23:38 block.db -> /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214
-rw-------  1 root root   37 Dec  7 23:38 ceph_fsid
-rw-------  1 root root   37 Dec  7 23:38 fsid
-rw-------  1 root root   56 Dec  7 23:38 keyring
-rw-------  1 root root    6 Dec  7 23:38 ready
-rw-------  1 root root   10 Dec  7 23:38 type
-rw-------  1 root root    3 Dec  7 23:38 whoami

root@node9:/var/lib/ceph/osd# ceph --version
ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)

root@node9:/var/lib/ceph/osd/ceph-13# cat fsid
059d5aa6-bbfa-42b2-9c52-fe0bf23d9647
root@node9:/var/lib/ceph/osd/ceph-13# ceph-volume lvm activate 13 059d5aa6-bbfa-42b2-9c52-fe0bf23d9647
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62 --path /var/lib/ceph/osd/ceph-13
Running command: /bin/ln -snf /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62 /var/lib/ceph/osd/ceph-13/block
Running command: /bin/chown -R ceph:ceph /dev/dm-8
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-13
Running command: /bin/ln -snf /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214 /var/lib/ceph/osd/ceph-13/block.db
Running command: /bin/chown -R ceph:ceph /dev/dm-5
Running command: /bin/systemctl enable ceph-volume@lvm-13-059d5aa6-bbfa-42b2-9c52-fe0bf23d9647
Running command: /bin/systemctl start ceph-osd@13
--> ceph-volume lvm activate successful for osd ID: 13
root@node9:/var/lib/ceph/osd/ceph-13# ls -alh
total 28K
drwxrwxrwt  2 ceph ceph  200 Dec  7 23:41 .
drwxr-xr-x 55 ceph ceph 4.0K Dec  5 17:04 ..
lrwxrwxrwx  1 ceph ceph   99 Dec  7 23:41 block -> /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62
lrwxrwxrwx  1 root root  106 Dec  7 23:41 block.db -> /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214
-rw-------  1 ceph ceph   37 Dec  7 23:41 ceph_fsid
-rw-------  1 ceph ceph   37 Dec  7 23:41 fsid
-rw-------  1 ceph ceph   56 Dec  7 23:41 keyring
-rw-------  1 ceph ceph    6 Dec  7 23:41 ready
-rw-------  1 ceph ceph   10 Dec  7 23:41 type
-rw-------  1 ceph ceph    3 Dec  7 23:41 whoami

# cat /etc/os-release
NAME="Ubuntu" 
VERSION="16.04 LTS (Xenial Xerus)" 
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04 LTS" 
VERSION_ID="16.04" 
HOME_URL="http://www.ubuntu.com/" 
SUPPORT_URL="http://help.ubuntu.com/" 
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" 
UBUNTU_CODENAME=xenial
Actions #3

Updated by Paul Emmerich over 5 years ago

13.2.2 on Debian Stretch.

The problem was that root owns . and ceph the symlink, in your example root owns everything which is fine. Run chown without -R to reproduce

Actions #4

Updated by Alfredo Deza over 5 years ago

  • Status changed from 12 to In Progress
root@node9:/var/lib/ceph/osd/ceph-13# ls -alh
total 28K
drwxrwxrwt  2 root root  200 Dec 10 19:23 .
drwxr-xr-x 55 ceph ceph 4.0K Dec  5 17:04 ..
lrwxrwxrwx  1 ceph ceph   99 Dec 10 19:23 block -> /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62
lrwxrwxrwx  1 root root  106 Dec 10 19:23 block.db -> /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214
-rw-------  1 root root   37 Dec 10 19:23 ceph_fsid
-rw-------  1 root root   37 Dec 10 19:23 fsid
-rw-------  1 root root   56 Dec 10 19:23 keyring
-rw-------  1 root root    6 Dec 10 19:23 ready
-rw-------  1 root root   10 Dec 10 19:23 type
-rw-------  1 root root    3 Dec 10 19:23 whoami
root@node9:/var/lib/ceph/osd/ceph-13# ceph-volume lvm activate 13 059d5aa6-bbfa-42b2-9c52-fe0bf23d9647
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62 --path /var/lib/ceph/osd/ceph-13
 stderr: error symlinking
 stderr: /var/lib/ceph/osd/ceph-13/block: (13) Permission denied
-->  RuntimeError: command returned non-zero exit status: 1
root@node9:/var/lib/ceph/osd/ceph-13# CEPH_VOLUME_DEBUG=1 ceph-volume lvm activate 13 059d5aa6-bbfa-42b2-9c52-fe0bf23d9647
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62 --path /var/lib/ceph/osd/ceph-13
 stderr: error symlinking /var/lib/ceph/osd/ceph-13/block: (13) Permission denied
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    main.Volume()
  File "/usr/lib/python2.7/dist-packages/ceph_volume/main.py", line 37, in __init__
    self.main(self.argv)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, in dispatch
    instance.main()
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/main.py", line 38, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, in dispatch
    instance.main()
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 318, in main
    self.activate(args)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 242, in activate
    activate_bluestore(lvs, no_systemd=args.no_systemd)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 154, in activate_bluestore
    '--path', osd_path])
  File "/usr/lib/python2.7/dist-packages/ceph_volume/process.py", line 149, in run
    raise RuntimeError(msg)
RuntimeError: command returned non-zero exit status: 1
Actions #5

Updated by Alfredo Deza over 5 years ago

Potential working fix:

diff --git a/src/ceph-volume/ceph_volume/devices/lvm/activate.py b/src/ceph-volume/ceph_volume/devices/lvm/activate.py
index acebfe123b..d13de5d9cc 100644
--- a/src/ceph-volume/ceph_volume/devices/lvm/activate.py
+++ b/src/ceph-volume/ceph_volume/devices/lvm/activate.py
@@ -152,6 +152,7 @@ def activate_bluestore(lvs, no_systemd=False):
     wal_device_path = get_osd_device_path(osd_lv, lvs, 'wal', dmcrypt_secret=dmcrypt_secret)

     # Once symlinks are removed, the osd dir can be 'primed again.
+    system.chown(osd_path)
     prime_command = [
         'ceph-bluestore-tool', '--cluster=%s' % conf.cluster,
         'prime-osd-dir', '--dev', osd_lv_path,

Was able to activate correctly:

(tmp) root@node9:/var/lib/ceph/osd/ceph-13# chown -h ceph:ceph block
(tmp) root@node9:/var/lib/ceph/osd/ceph-13# ls -alh
total 28K
drwxrwxrwt  2 root root  200 Dec 10 21:05 .
drwxr-xr-x 55 ceph ceph 4.0K Dec  5 17:04 ..
lrwxrwxrwx  1 ceph ceph   99 Dec 10 21:05 block -> /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62
lrwxrwxrwx  1 root root  106 Dec 10 21:05 block.db -> /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214
-rw-------  1 root root   37 Dec 10 21:05 ceph_fsid
-rw-------  1 root root   37 Dec 10 21:05 fsid
-rw-------  1 root root   56 Dec 10 21:05 keyring
-rw-------  1 root root    6 Dec 10 21:05 ready
-rw-------  1 root root   10 Dec 10 21:05 type
-rw-------  1 root root    3 Dec 10 21:05 whoami
(tmp) root@node9:/var/lib/ceph/osd/ceph-13# cd ../
(tmp) root@node9:/var/lib/ceph/osd# systemctl stop ceph-osd@13
(tmp) root@node9:/var/lib/ceph/osd# ceph-volume lvm activate 13 059d5aa6-bbfa-42b2-9c52-fe0bf23d9647
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-13
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62 --path /var/lib/ceph/osd/ceph-13 --no-mon-config
Running command: /bin/ln -snf /dev/ceph-block-c5019075-26a5-4ccc-af1e-006598f7ee64/osd-block-d64b56fc-db27-4510-81ee-7a224ffdbb62 /var/lib/ceph/osd/ceph-13/block
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-13/block
Running command: /bin/chown -R ceph:ceph /dev/dm-8
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-13
Running command: /bin/ln -snf /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214 /var/lib/ceph/osd/ceph-13/block.db
Running command: /bin/chown -h ceph:ceph /dev/ceph-block-dbs-843d6828-7c62-4545-91bb-5a185c0dd829/osd-block-db-d5d5f1ae-6bbf-41da-86ea-4eeeb7be9214
Running command: /bin/chown -R ceph:ceph /dev/dm-5
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-13/block.db
Running command: /bin/chown -R ceph:ceph /dev/dm-5
Running command: /bin/systemctl enable ceph-volume@lvm-13-059d5aa6-bbfa-42b2-9c52-fe0bf23d9647
Running command: /bin/systemctl enable --runtime ceph-osd@13
Running command: /bin/systemctl start ceph-osd@13
--> ceph-volume lvm activate successful for osd ID: 13
(tmp) root@node9:/var/lib/ceph/osd# echo $?
0

Will follow up with functional tests

Actions #8

Updated by Alfredo Deza over 5 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF