Bug #15553: /var/run/ceph permissions are borked after every reboot - devops - Ceph

Actions

Copy link

Bug #15553

closed

/var/run/ceph permissions are borked after every reboot

Added by Heath Jepson almost 8 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

4 - irritation

Reviewed:

Affected Versions:

Ceph - v0.21.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/var/run/ceph permissions are borked after every reboot. I have to run chmod -R 770 /var/run/ceph & chown -R ceph:ceph /var/run/ceph before any ceph daemons will start.

permission is denied when the daemons attempt to write the pid file.

I'm running on debian jessie and the latest jewel release candidate. I also had this issue with infernalis, but used a dirty hack in the init scripts as a temporary fix.

Files

ceph.conf (4.92 KB) ceph.conf

Heath Jepson, 04/22/2016 09:59 AM

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Nathan Cutler almost 8 years ago

Project changed from Ceph to devops

Actions

Copy link

Updated by Nathan Cutler almost 8 years ago

On RPM-based systems, we are installing a file /usr/lib/tmpfiles.d/ceph-common.conf which, when systemd-tmpfiles --create is run, causes /run/ceph to be created with permissions 0770 and ownership ceph.ceph.

On these systems, /var/run is a symlink to /run.

I'm wondering if, possibly, this file /usr/lib/tmpfiles.d/ceph-common.conf is not getting created on Jessie?

Actions

Copy link

Updated by Nathan Cutler almost 8 years ago

I just looked, and it seems to me that /usr/lib/tmpfiles.d/ceph.conf is getting created properly, and systemd-tmpfiles --create should be run by the ceph-common postinst script.

Can you show us the output of the following commands on your system (after reboot)?

ls -ld /var/run
ls -ld /run/ceph
ls -l /run/ceph
ls -l /usr/lib/tmpfiles.d/ceph.conf
cat /usr/lib/tmpfiles.d/ceph.conf

It could also be a misconfiguration on your system - maybe the script that is supposed to run systemd-tmpfiles --create at boot time is not getting run, or is somehow disabled?

Actions

Copy link

Updated by Heath Jepson almost 8 years ago

Here is what i get after rebooting. None of the ceph processes start on their own (checked with ps aux | grep ceph)

root@elara:~# ls -ld /var/run lrwxrwxrwx 1 root root 4 Mar 5 18:51 /var/run -> /run root@elara:~# ls -ld /run/ceph drwxrwx--- 11 ceph ceph 220 Apr 21 19:27 /run/ceph root@elara:~# ls -l /run/ceph total 0 drwxr-xr-x 2 root root 40 Apr 21 19:24 mds.fs1 drwxr-xr-x 2 root root 40 Apr 21 19:24 mon.elara drwxr-xr-x 2 root root 40 Apr 21 19:24 osd.0 drwxr-xr-x 2 root root 40 Apr 21 19:24 osd.1 drwxr-xr-x 2 root root 40 Apr 21 19:25 osd.2 drwxr-xr-x 2 root root 40 Apr 21 19:25 osd.3 drwxr-xr-x 2 root root 40 Apr 21 19:26 osd.4 drwxr-xr-x 2 root root 40 Apr 21 19:26 osd.5 drwxr-xr-x 2 root root 40 Apr 21 19:27 osd.6 root@elara:~# ls -l /usr/lib/tmpfiles.d/ceph.conf -rw-r--r-- 1 root root 29 Apr 12 13:11 /usr/lib/tmpfiles.d/ceph.conf root@elara:~# cat /usr/lib/tmpfiles.d/ceph.conf d /run/ceph 0770 ceph ceph -

Actions

Copy link

Updated by Nathan Cutler almost 8 years ago

That sheds light - thanks. We see that /run/ceph has the correct permissions, so systemd-tmpfiles --create must be getting run. The problem is that the directories inside /run/ceph are created with root.root ownership.

Actions

Copy link

Updated by Nathan Cutler almost 8 years ago

Did you do anything special to make ceph daemons create subdirectories under /var/run/ceph ? Can you post your ceph.conf ?

Actions

Copy link

Updated by Heath Jepson almost 8 years ago

you may have nailed it. Here are the first 3 lines of my config. I've been using this config in various iterations since firefly.
[global] log file = /var/log/ceph/$type.$id/log pid file = /var/run/ceph/$type.$id/pid
in a separate report, http://tracker.ceph.com/issues/15554 I've been having trouble with my custom osd directories since infernalis. I'm not very impressed with systemd, upstart seemed to work so much better.

I have a small production firefly cluster that I'd like to upgrade to jewel, but it's running a near-identical config to this test cluster.

Actions

Copy link

Updated by Heath Jepson almost 8 years ago

File ceph.conf ceph.conf added

attached my full config. Don't know if it'll be useful though.

Thanks!

Actions

Copy link

Updated by Anonymous almost 8 years ago

Heath Jepson wrote:

you may have nailed it. Here are the first 3 lines of my config. I've been using this config in various iterations since firefly.
@
[global]
log file = /var/log/ceph/$type.$id/log
pid file = /var/run/ceph/$type.$id/pid

On my setup here, this fails horribly:

failed to open pid file '/var/run/ceph/mon.ceph-foo/pid': (2) No such file or directory.

As far as I can tell from the global_init() code, this kind of directory would only be created by a ceph daemon if you override the default run_dir in your ceph.conf. Which you cleary do not. Thus, it leads me to believe there is some outside script/source that is creating this directory for you as root? Do you have some relic tooling that's somehow being invoked?

To get ceph to create these directories on your behalf, try adding the following to the global section of your ceph.conf:

run dir = /var/run/ceph/$type.$id/

Please let us know if this helps.

Actions

Copy link

#10

Updated by Heath Jepson almost 8 years ago

Relic tooling only if a clean install of debian jessie and ceph infernalis would have left trash behind.

Adding that line fixed the pidfile problems. yay! but I upgraded my release candidate 10.1.2, to release 10.2.0 and all hell broke loose with systemd. I doesn't even mount the data dir's anymore and no ceph daemons can be started via systemd.

The only way I can start my cluster is to manually mount my XFS filesystems then run:

node 1:
ceph-mon --id elara --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 0 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 1 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 2 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 3 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 4 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 5 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 6 --setuser ceph --setgroup ceph ceph-mds -i fs1 --setuser ceph --setgroup ceph

node 2:
ceph-osd --cluster ceph --id 7 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 8 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 9 --setuser ceph --setgroup ceph ceph-osd --cluster ceph --id 10 --setuser ceph --setgroup ceph

Yuck. I haven't even been mucking around with systemd and it messes up this badly.

Heres's all I get out of systemd after a reboot+manually trying to restart one osd:
root@elara:~# systemctl status ceph ● ceph.service - LSB: Start Ceph distributed file system daemons at boot time Loaded: loaded (/etc/init.d/ceph) Active: active (exited) since Fri 2016-04-22 17:55:53 MDT; 23min ago Process: 2083 ExecStop=/etc/init.d/ceph stop (code=exited, status=0/SUCCESS) Process: 2090 ExecStart=/etc/init.d/ceph start (code=exited, status=0/SUCCESS) root@elara:~# systemctl status ceph-osd@0 ● ceph-osd@0.service - Ceph object storage daemon Loaded: loaded (/lib/systemd/system/ceph-osd.service; disabled)
Active: inactive (dead)

Apr 22 17:55:38 elara systemd¹: [/lib/systemd/system/ceph-osd@.service:18] Unknown lvalue 'TasksMax' in section 'Service'
root@elara:~# systemctl restart ceph-osd@0
root@elara:~# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled)
Active: failed (Result: start-limit) since Fri 2016-04-22 17:55:46 MDT; 243ms ago
Process: 2073 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Process: 2034 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
Main PID: 2073 (code=exited, status=1/FAILURE)

Apr 22 17:55:46 elara systemd¹: Unit ceph-osd@0.service entered failed state.
Apr 22 17:55:46 elara systemd¹: ceph-osd@0.service start request repeated too quickly, refusing to start.
Apr 22 17:55:46 elara systemd¹: Failed to start Ceph object storage daemon.
Apr 22 17:55:46 elara systemd¹: Unit ceph-osd@0.service entered failed state.@

Actions

Copy link

#11

Updated by Nathan Cutler almost 8 years ago

TasksMax was introduced in systemd 227. Which version of systemd are you running?

Actions

Copy link

#12

Updated by Nathan Cutler almost 8 years ago

Related to Bug #15583: systemd warns about TasksMax= setting on older distros added

Actions

Copy link

#13

Updated by Nathan Cutler almost 8 years ago

Two more notes. First, I opened #15583 for the unsupported TasksMax= issue you just tripped over. Second, though TasksMax= was introduced in upstream systemd 227, your distro likely supports it even though the version is less than 227.

In short, please update your system and try again.

Actions

Copy link

#14

Updated by Nathan Cutler almost 8 years ago

As the systemd maintainers just told me, the "Unknown lvalue 'TasksMax' in section 'Service'" is benign and the presence of an unsupported/unrecognized "TasksMax=" in the unit file does not prevent the unit from starting.

This issue was originally about the pidfile permissions problem - is that now resolved?

(It looks like ceph-disk is not mounting the OSDs properly - if that problem persists, please open a different issue for it.)

Actions

Copy link

#15

Updated by Heath Jepson almost 8 years ago

pidfile problem is resolved thanks Nathan! You rock!

I created http://tracker.ceph.com/issues/15596 for the ceph-disk issue.

Actions

Copy link

#16

Updated by Nathan Cutler almost 8 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » devops

Custom queries

Bug #15553

/var/run/ceph permissions are borked after every reboot

Updated by Nathan Cutler almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Heath Jepson almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Heath Jepson almost 8 years ago

Updated by Heath Jepson almost 8 years ago

Updated by Anonymous almost 8 years ago

Updated by Heath Jepson almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Heath Jepson almost 8 years ago

Updated by Nathan Cutler almost 8 years ago