Bug #6043
closed
upstart does not reflect running ceph-osd daemons (ubuntu 13.04 only)
Added by Zoltan Arnold Nagy over 10 years ago.
Updated almost 10 years ago.
Description
Workaround¶
Using restart instead of reload restarts the daemons instead of sending them a signal that gracefully reopens the log.
perl -pi -e 's/reload/restart/' /etc/logrotate.d/ceph
Original description¶
ubuntu 13.04, ceph from ceph.com repository. (0.67.1-1raring)
according to documentation [[http://ceph.com/docs/master/rados/operations/add-or-rm-osds/]], this should work.
root@zc2store:~# ps faux | grep ceph-osd
root 21656 0.0 0.0 9436 960 pts/2 S+ 12:52 0:00 \_ grep --color=auto ceph-osd
root 21475 0.0 0.0 4440 628 ? Ss 12:47 0:00 /bin/sh -e -c /usr/bin/ceph-osd --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh
root 21476 0.3 0.0 438156 24556 ? Sl 12:47 0:00 \_ /usr/bin/ceph-osd --cluster=ceph -i 0 -f
root@zc2store:~# /etc/init.d/ceph stop osd.0
/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )
root@zc2store:~# mount | grep ceph
/dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime)
root@zc2store:~#
- Status changed from New to Rejected
stop ceph-osd id=0
or
stop ceph-osd-all
well...
root@signina:~# service ceph-osd id=11
ceph-osd: unrecognized service
root@signina:~# service ceph osd id=11
/etc/init.d/ceph: id=.11 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )
root@signina:~# service ceph osd 11
/etc/init.d/ceph: 11. not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )
but more importantly:
root@signina:~# ps faux | grep ceph | grep '\-i 11'
root 3047 0.3 0.0 707088 172644 ? Sl Aug17 10:31 /usr/bin/ceph-osd --cluster=ceph -i 11 -f
root@signina:~# service ceph osd.11
root@signina:~# ps faux | grep ceph | grep '\-i 11'
root 3047 0.3 0.0 707088 172644 ? Sl Aug17 10:31 /usr/bin/ceph-osd --cluster=ceph -i 11 -f
root@signina:~#
- Status changed from Rejected to In Progress
is the ceph package still installed? some older versions didn't stop the jobs before they uninstalled, which might explain this.
initctl list | grep ceph
getting somewhere. but it still can't find it.
root@signina:~# ps faux | grep ceph
root 3028 0.3 0.1 786888 267732 ? Ssl Aug17 11:54 /usr/bin/ceph-osd --cluster=ceph -i 6 -f
root 3032 0.3 0.0 699544 184632 ? Sl Aug17 10:35 /usr/bin/ceph-osd --cluster=ceph -i 8 -f
root 3039 0.3 0.1 765468 250896 ? Sl Aug17 12:10 /usr/bin/ceph-osd --cluster=ceph -i 10 -f
root 3043 0.3 0.0 653020 131000 ? Sl Aug17 10:54 /usr/bin/ceph-osd --cluster=ceph -i 7 -f
root 3049 0.3 0.1 718976 205100 ? Sl Aug17 10:42 /usr/bin/ceph-osd --cluster=ceph -i 9 -f
root 3078 0.3 0.1 767140 252344 ? Sl Aug17 13:03 /usr/bin/ceph-osd --cluster=ceph -i 5 -f
root 5076 0.0 0.0 9436 956 pts/4 S+ 22:02 0:00 \_ grep --color=auto ceph
root@signina:~# initctl list | grep ceph
ceph-mds-all-starter stop/waiting
ceph-mds-all start/running
ceph-osd-all start/running
ceph-osd-all-starter stop/waiting
ceph-all start/running
ceph-mon-all start/running
ceph-mon-all-starter stop/waiting
ceph-mon stop/waiting
ceph-create-keys stop/waiting
ceph-osd (ceph/6) start/running, process 3028
ceph-mds stop/waiting
root@signina:~# stop ceph-osd id=10
stop: Unknown instance: ceph/10
root@signina:~#
- Subject changed from stopping an osd doesn't work on ubuntu to upstart does not reflect running ceph-osd daemons
- Status changed from In Progress to Need More Info
is this still a problem? unless we can figure out the sequence to reproduce this i'm not sure what to do here. upstart is generally quite reliable about tracking the running processes. was there an upgrade involved?
I seem to be running into a similar issue. Running 13.04 and 0.67.2 but it was also happening with 0.61.7, it seems that the admin sockets (for OSDs and Mon) get deleted at some point. I've not been able to determine when or how this is happening. I'll keep investigating.
I've done some further investigation on this. It seems that logrotate is the culprit.
After executing the following lines in /etc/logrotate.d/ceph
initctl list \
| sed -n 's/^\(ceph-\(mon\|osd\|mds\)\+\)[ \t]\+(\([^ \/]\+\)\/\([^ \/]\+\))[ \t]\+start\/.*$/\1 cluster=\3 id=\4/p' \
| while read l; do
initctl reload -- $l 2>/dev/null || :
done
Then the admin sockets are missing from /var/run/ceph
and the parent processes for each mon/osd (eg. /bin/sh -e -c /usr/bin/ceph-osd --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh
) are missing
If I run initctl reload -- ceph-osd cluster=ceph id=X
for a healthy OSD then I can trigger the issue.
I should have done a little more digging in my previous update but it looks more likely that sending a HUP is where the problem lies - kill -s HUP <process>
will trigger the issue.
- Status changed from Need More Info to 12
- Assignee set to Tamilarasi muthamizhan
- Status changed from 12 to Need More Info
- Priority changed from Normal to High
Tamil, can you try to reproduce this? (look at the last two comments.. sending HUP or issuing the reload seems to break things)
- Status changed from Need More Info to Can't reproduce
I am not able to reproduce this issue on raring with latest stable dumpling branch [v0.67.4]
test setup tried: vpm018. It is still on the same state, if anyone is interested.
Hunter, what are the osd data directories named?
We're using Chef to install our nodes (which uses ceph-disk tools to manage the disks) so:
/var/lib/ceph/osd/ceph-<id>/
- Subject changed from upstart does not reflect running ceph-osd daemons to upstart does not reflect running ceph-osd daemons (ubuntu 13.04 only)
- Description updated (diff)
- Status changed from Can't reproduce to Won't Fix
- % Done changed from 0 to 100
Also available in: Atom
PDF