Project

General

Profile

Bug #15907

10.2.1 more pidfile permission problems

Added by Heath Jepson almost 8 years ago. Updated almost 8 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Previously had pidfile permission problems because my config was lacking the following (resolved with http://tracker.ceph.com/issues/15553):

run dir = /var/run/ceph/$type.$id/

After upgrading from 10.2.0 to 10.2.1, I'm now having more pidfile problems and none of the daemons are starting.

running

chown -R ceph:ceph /var/run/ceph
systemctl start ceph

hangs for a very long time and no daemons start.
I must resort to manually starting:

ceph-mon -i <mon id> --setuser ceph --setgroup ceph
ceph-osd -i <osd id> --setuser ceph --setgroup ceph
ceph-mds -i <mds id> --setuser ceph --setgroup ceph

I get this when checking to see if there are any signs of life with ps aux | grep ceph:

root 4422 0.0 0.0 22488 2612 pts/0 S+ 20:27 0:00 systemctl start ceph
root 4424 0.0 0.0 4468 1732 ? Ss 20:27 0:00 /bin/sh /etc/init.d/ceph start
root 5579 0.0 0.0 10076 840 ? S 20:29 0:00 timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.2 --keyring=/srv/ceph/osd/osd.2/keyring osd crush create-or-move -- 2 0.9092 host=elara root=default
root 5580 0.4 0.0 669948 22364 ? Sl 20:29 0:00 /usr/bin/python /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.2 --keyring=/srv/ceph/osd/osd.2/keyring osd crush create-or-move -- 2 0.9092 host=elara root=default
root 5614 0.0 0.0 12732 2300 pts/1 S+ 20:30 0:00 grep ceph

My config is attached.

thanks in advance for your help!

ceph.conf View (5.37 KB) Heath Jepson, 05/17/2016 04:44 PM

History

#1 Updated by Nathan Cutler almost 8 years ago

  • Status changed from New to Need More Info

Please provide: your full ceph.conf, platform (distro/version), and output of the following commands:

systemctl is-enabled ceph.target
systemctl is-enabled ceph-osd.target
systemctl is-enabled ceph-mon.target
systemctl is-enabled ceph-mds.target
ls -ld /var/lib/ceph
ls -l /var/lib/ceph

#2 Updated by Nathan Cutler almost 8 years ago

BTW this line is disturbing:

root 4424 0.0 0.0 4468 1732 ? Ss 20:27 0:00 /bin/sh /etc/init.d/ceph start

That script should be completely disabled. You should not be running it at all.

#3 Updated by Heath Jepson almost 8 years ago

could you elaborate on what is disturbing about /etc/init.d/ceph? All I did after upgrading to 10.2.1 was to notice that none of the ceph deamons would start, so I'd try to run systemctl start ceph, they also could not be started via systemd back on 10.2.0

I'm running debian jessie, output of uname -r is: 4.4.0-0.bpo.1-amd64

I have another node that was a fresh install originally on jessie with 10.2.0 and it exhibits all the same problems as the other 2 nodes.

I attached my config originally, now I don't see it, I attached it again.

@systemctl is-enabled ceph.target
disabled

systemctl is-enabled ceph-osd.target
Failed to get unit file state for ceph-osd.target: No such file or directory

systemctl is-enabled ceph-mon.target
Failed to get unit file state for ceph-mon.target: No such file or directory

systemctl is-enabled ceph-mds.target
Failed to get unit file state for ceph-mds.target: No such file or directory

ls ld /var/lib/ceph
drwxr-x--
1 ceph ceph 102 Apr 22 18:47 /var/lib/ceph

ls -l /var/lib/ceph
total 0
drwxr-xr-x 1 ceph ceph 0 Apr 20 13:33 bootstrap-mds
drwxr-xr-x 1 ceph ceph 0 Apr 20 13:33 bootstrap-osd
drwxr-xr-x 1 ceph ceph 0 Apr 20 13:33 bootstrap-rgw
drwxr-xr-x 1 ceph ceph 0 Apr 20 13:33 mds
drwxr-xr-x 1 ceph ceph 0 Apr 20 13:33 mon
drwxr-xr-x 1 ceph ceph 0 Apr 20 13:33 osd
drwxr-xr-x 1 ceph ceph 46 Apr 22 17:43 tmp@

#4 Updated by Nathan Cutler almost 8 years ago

  • Status changed from Need More Info to Rejected
  • Assignee set to Nathan Cutler

I'm rejecting this, because it's not a bug. Please read the 10.2.0 release notes. Esapecially this section: http://docs.ceph.com/docs/master/release-notes/#major-changes-from-hammer

As of Jewel, sysvinit is no longer supported on Debian. You seem to have deleted the systemd unit files, or moved them somewhere where systemd can't find them. Without the unit files, you will not be able to start/stop the daemons in the usual way.

The command to start/stop all the daemons at once is systemctl start ceph.target - the .target part is important. But you'll have to fix your installation first, so systemd can find the unit files.

#5 Updated by Heath Jepson almost 8 years ago

I read the release notes regularly and in detail, I follow ceph VERY close and have a couple of successful installations that I've been maintaining for the last 3 years.

I'm keenly aware that you switched to systemd with infernalis and jewel.

I did not delete anything, if apt-get install ceph does not install all the needed files, what should I be doing?

This behavior is being exhibited with a clean installation of jewel 10.2.0 on a clean installation of debian jessie. Shouldn't upgrading from 10.2.0 to 10.2.1 replace any missing files? Are you sure that this http://tracker.ceph.com/issues/14941 was handled properly for upgrades or does it only work when doing a clean installation of 10.2.1?

I used the exact config I'm using now on infernalis (except for run dir =) and it worked fine, but with jewel it's falling apart. It doesn't make sense.

#6 Updated by Nathan Cutler almost 8 years ago

  • Status changed from Rejected to Duplicate

Sorry, I wasn't aware of #15573 - you're right; this really is a bug. Changing status from "Rejected" to "Duplicate".

The fix is https://github.com/ceph/ceph/pull/8815

#7 Updated by Heath Jepson almost 8 years ago

glad I'm not losing my mind here. I thought this fix was going to be in 10.2.1? But apparently it isn't?

fighting this in my home lab has been preventing me from upgrading another production firefly cluster for weeks now. if this fix works I may actually be able to upgrade that production cluster over memorial day weekend. crossing my fingers!

Will we see the fix included with 10.2.2?

#8 Updated by Heath Jepson almost 8 years ago

I copied the missing files from github and still nothing after enabling only the target files. I have no clue if this was the right thing to do, but I just broke down enabled everything and for the first time since infernalis it finally works!

yea...the OSD ID's are wacky, it's a test cluster. don't judge ;)

on titan:

systemctl enable ceph.target
systemctl enable ceph-mon.target
systemctl enable ceph-osd.target
systemctl enable ceph-mds.target
systemctl enable ceph-mon@titan
systemctl enable ceph-osd@10
systemctl enable ceph-osd@8
systemctl enable ceph-osd@9
systemctl enable ceph-osd@16
systemctl enable ceph-osd@17
systemctl enable ceph-osd@18
systemctl enable ceph-osd@7
systemctl enable ceph-mds@fs3

on elara:

systemctl enable ceph.target
systemctl enable ceph-mon.target
systemctl enable ceph-osd.target
systemctl enable ceph-mds.target
systemctl enable ceph-mon@elara
systemctl enable ceph-osd@0
systemctl enable ceph-osd@1
systemctl enable ceph-osd@2
systemctl enable ceph-osd@3
systemctl enable ceph-osd@4
systemctl enable ceph-osd@5
systemctl enable ceph-osd@6
systemctl enable ceph-osd@13
systemctl enable ceph-osd@14
systemctl enable ceph-osd@15
systemctl enable ceph-mds@fs1

on athena:

systemctl enable ceph.target
systemctl enable ceph-mon.target
systemctl enable ceph-osd.target
systemctl enable ceph-mds.target
systemctl enable ceph-mon@titan
systemctl enable ceph-osd@10
systemctl enable ceph-osd@8
systemctl enable ceph-osd@9
systemctl enable ceph-osd@16
systemctl enable ceph-osd@17
systemctl enable ceph-osd@18
systemctl enable ceph-osd@7
systemctl enable ceph-mds@fs2

#9 Updated by Heath Jepson almost 8 years ago

arrrgh, I can't edit my comments. that last one was for node athena, not titan, but you get the idea.

I'm going for a walk, I need to de-stress. I've been pulling my hair out over this problem.

Thanks for your help Nathan!

#10 Updated by Nathan Cutler almost 8 years ago

What you did (enabling "everything") is the right way to go.

If an instantiated service is disabled (e.g. due to a packaging/deployment glitch or because you did systemctl disable ceph-mds@fs2.service), that service will no longer be started automatically - neither at boot time nor manually using systemctl start ceph-mds.target or systemctl start ceph.target. You would have to start it explicitly, i.e. systemctl start ceph-mds@fs2.service.

Also available in: Atom PDF