Bug #7188
Updated by Loïc Dachary almost 10 years ago
h3. Description
Steps to reproduce, using a fresh install of Ubuntu raring (it does not show on Ubuntu trusty) and firefly.
<pre>
root@raring:/etc/ceph# start ceph-mon id=raring
ceph-mon (ceph/raring) start/running, process 6488
root@raring:/etc/ceph# ps -fauwwx | grep ceph-
warning: bad ps syntax, perhaps a bogus '-'?
See http://gitorious.org/procps/procps/blobs/master/Documentation/FAQ
root 6506 0.0 0.0 6560 640 pts/1 S+ 16:17 0:00 \_ grep --color=auto ceph-
root 6488 0.0 0.0 4444 624 ? Ss 16:17 0:00 /bin/sh -e -c /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh
root 6489 1.0 0.8 134772 9116 ? Sl 16:17 0:00 \_ /usr/bin/ceph-mon --cluster=ceph -i raring -f
root@raring:/etc/ceph# reload ceph-mon id=raring
root@raring:/etc/ceph# ls -l /var/run/ceph
total 0
root@raring:/etc/ceph# status ceph-mon id=raring
status: Unknown instance: ceph/raring
</pre>
Note that if the "/bin/sh" is not present, the problem does not show. Run *restart* until the "/bin/sh" process parent of */usr/bin/ceph-mon* shows up.
*reload* kills the intermediate shell, notices that it is dead and run the *post-stop* script that removes the *asok* file. The *ceph-mon* is no longer managed by upstart and cannot be notified.
There does not seem to be a convenient workaround. Commenting out the removal of the asok file in */etc/init/ceph-mon.conf* will allow to reach the unmanaged ceph-mon but it will keep writing to the old log because the reload initiated by logrotate will not reach it. Disabling logrotate will prevent loosing the logs but /var/log will grow indefinitely.
h3. Initial report
My /var/run/ceph/*.asok for OSDs and mon were mysteriously gone when I came back in the morning after running overnight. I noticed that they had disappeared at the same time as a log rotation, and sure enough calling "logrotate --force /etc/logrotate.d/ceph" leaves services running with no socket files.
It looks like the part that's causing the problem is the "initctl reload ceph-mon id=xxx", which on my system is leaving the original service PID running, and trying to start a new one at the same time: logs show the new process failing to start with "failed to create new leveldb store" while the existing process continues to function. Presumably it's the new process which is deleting the socket files in spite of failing to come up successfully.
Package in use is 0.72.2-1raring.
This is kind of severe for anyone using a monitoring system that relies on the socket files to see and talk to the Ceph processes. If we can show that this issue is limited to ubuntu 13.04 then this is less of a big deal: I wouldn't be surprised if it's quite sensitive to extra distro version in use.