Project

General

Profile

Actions

Bug #6043

closed

upstart does not reflect running ceph-osd daemons (ubuntu 13.04 only)

Added by Zoltan Arnold Nagy over 10 years ago. Updated almost 10 years ago.

Status:
Won't Fix
Priority:
High
Category:
-
Target version:
-
% Done:

100%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Workaround

Using restart instead of reload restarts the daemons instead of sending them a signal that gracefully reopens the log.

perl -pi -e 's/reload/restart/' /etc/logrotate.d/ceph

Original description

ubuntu 13.04, ceph from ceph.com repository. (0.67.1-1raring)

according to documentation [[http://ceph.com/docs/master/rados/operations/add-or-rm-osds/]], this should work.

root@zc2store:~# ps faux | grep ceph-osd
root     21656  0.0  0.0   9436   960 pts/2    S+   12:52   0:00          \_ grep --color=auto ceph-osd
root     21475  0.0  0.0   4440   628 ?        Ss   12:47   0:00 /bin/sh -e -c /usr/bin/ceph-osd --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh
root     21476  0.3  0.0 438156 24556 ?        Sl   12:47   0:00  \_ /usr/bin/ceph-osd --cluster=ceph -i 0 -f
root@zc2store:~# /etc/init.d/ceph stop osd.0
/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )
root@zc2store:~# mount | grep ceph
/dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime)
root@zc2store:~# 

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #7188: Admin socket files are lost on log rotation calling initctl reload (ubuntu 13.04 only)Won't FixLoïc Dachary01/20/2014

Actions
Actions #1

Updated by Sage Weil over 10 years ago

  • Status changed from New to Rejected

stop ceph-osd id=0
or
stop ceph-osd-all

Actions #2

Updated by Zoltan Arnold Nagy over 10 years ago

well...

root@signina:~# service ceph-osd id=11
ceph-osd: unrecognized service
root@signina:~# service ceph osd id=11
/etc/init.d/ceph: id=.11 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )
root@signina:~# service ceph osd 11
/etc/init.d/ceph: 11. not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )

but more importantly:

root@signina:~# ps faux | grep ceph | grep '\-i 11'
root      3047  0.3  0.0 707088 172644 ?       Sl   Aug17  10:31 /usr/bin/ceph-osd --cluster=ceph -i 11 -f
root@signina:~# service ceph osd.11
root@signina:~# ps faux | grep ceph | grep '\-i 11'
root      3047  0.3  0.0 707088 172644 ?       Sl   Aug17  10:31 /usr/bin/ceph-osd --cluster=ceph -i 11 -f
root@signina:~# 
Actions #3

Updated by Sage Weil over 10 years ago

  • Status changed from Rejected to In Progress

is the ceph package still installed? some older versions didn't stop the jobs before they uninstalled, which might explain this.

initctl list | grep ceph

Actions #4

Updated by Zoltan Arnold Nagy over 10 years ago

getting somewhere. but it still can't find it.

root@signina:~# ps faux | grep ceph
root      3028  0.3  0.1 786888 267732 ?       Ssl  Aug17  11:54 /usr/bin/ceph-osd --cluster=ceph -i 6 -f
root      3032  0.3  0.0 699544 184632 ?       Sl   Aug17  10:35 /usr/bin/ceph-osd --cluster=ceph -i 8 -f
root      3039  0.3  0.1 765468 250896 ?       Sl   Aug17  12:10 /usr/bin/ceph-osd --cluster=ceph -i 10 -f
root      3043  0.3  0.0 653020 131000 ?       Sl   Aug17  10:54 /usr/bin/ceph-osd --cluster=ceph -i 7 -f
root      3049  0.3  0.1 718976 205100 ?       Sl   Aug17  10:42 /usr/bin/ceph-osd --cluster=ceph -i 9 -f
root      3078  0.3  0.1 767140 252344 ?       Sl   Aug17  13:03 /usr/bin/ceph-osd --cluster=ceph -i 5 -f
root      5076  0.0  0.0   9436   956 pts/4    S+   22:02   0:00          \_ grep --color=auto ceph
root@signina:~# initctl list | grep ceph
ceph-mds-all-starter stop/waiting
ceph-mds-all start/running
ceph-osd-all start/running
ceph-osd-all-starter stop/waiting
ceph-all start/running
ceph-mon-all start/running
ceph-mon-all-starter stop/waiting
ceph-mon stop/waiting
ceph-create-keys stop/waiting
ceph-osd (ceph/6) start/running, process 3028
ceph-mds stop/waiting
root@signina:~# stop ceph-osd id=10
stop: Unknown instance: ceph/10
root@signina:~# 
Actions #5

Updated by Sage Weil over 10 years ago

  • Subject changed from stopping an osd doesn't work on ubuntu to upstart does not reflect running ceph-osd daemons
  • Status changed from In Progress to Need More Info

is this still a problem? unless we can figure out the sequence to reproduce this i'm not sure what to do here. upstart is generally quite reliable about tracking the running processes. was there an upgrade involved?

Actions #6

Updated by Hunter Nield over 10 years ago

I seem to be running into a similar issue. Running 13.04 and 0.67.2 but it was also happening with 0.61.7, it seems that the admin sockets (for OSDs and Mon) get deleted at some point. I've not been able to determine when or how this is happening. I'll keep investigating.

Actions #7

Updated by Hunter Nield over 10 years ago

Has anyone else experienced this issue? It seems to be affecting a few others - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/003402.html

Actions #8

Updated by Hunter Nield over 10 years ago

I've done some further investigation on this. It seems that logrotate is the culprit.

After executing the following lines in /etc/logrotate.d/ceph

initctl list \
        | sed -n 's/^\(ceph-\(mon\|osd\|mds\)\+\)[ \t]\+(\([^ \/]\+\)\/\([^ \/]\+\))[ \t]\+start\/.*$/\1 cluster=\3 id=\4/p' \
        | while read l; do
        initctl reload -- $l 2>/dev/null || :
        done

Then the admin sockets are missing from /var/run/ceph and the parent processes for each mon/osd (eg. /bin/sh -e -c /usr/bin/ceph-osd --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh) are missing

If I run initctl reload -- ceph-osd cluster=ceph id=X for a healthy OSD then I can trigger the issue.

Actions #9

Updated by Hunter Nield over 10 years ago

I should have done a little more digging in my previous update but it looks more likely that sending a HUP is where the problem lies - kill -s HUP <process> will trigger the issue.

Actions #10

Updated by Samuel Just over 10 years ago

  • Status changed from Need More Info to 12
  • Assignee set to Tamilarasi muthamizhan

Need to reproduce.

Actions #11

Updated by Sage Weil over 10 years ago

  • Status changed from 12 to Need More Info
  • Priority changed from Normal to High

Tamil, can you try to reproduce this? (look at the last two comments.. sending HUP or issuing the reload seems to break things)

Actions #12

Updated by Tamilarasi muthamizhan over 10 years ago

  • Status changed from Need More Info to Can't reproduce

I am not able to reproduce this issue on raring with latest stable dumpling branch [v0.67.4]

test setup tried: vpm018. It is still on the same state, if anyone is interested.

Actions #13

Updated by Samuel Just over 10 years ago

Hunter, what are the osd data directories named?

Actions #14

Updated by Hunter Nield over 10 years ago

We're using Chef to install our nodes (which uses ceph-disk tools to manage the disks) so:

/var/lib/ceph/osd/ceph-<id>/
Actions #15

Updated by Loïc Dachary almost 10 years ago

  • Subject changed from upstart does not reflect running ceph-osd daemons to upstart does not reflect running ceph-osd daemons (ubuntu 13.04 only)
  • Description updated (diff)
  • Status changed from Can't reproduce to Won't Fix
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF