Bug #7188
Updated by Loïc Dachary almost 10 years ago
h3. Description Steps to reproduce, using a fresh install of Ubuntu raring (it does not show on Ubuntu trusty) and firefly. The */etc/ceph/ceph-mon.conf* is as follows: <pre> description "Ceph MON" start on ceph-mon stop on runlevel [!2345] or stopping ceph-mon-all respawn respawn limit 5 30 limit nofile 16384 16384 pre-start script set -e rm -f /tmp/dead test -x /usr/bin/ceph-mon || { stop; exit 0; } test -d "/var/lib/ceph/mon/${cluster:-ceph}-$id" || { stop; exit 0; } install -d -m0755 /var/run/ceph end script instance ${cluster:-ceph}/$id export cluster export id # this breaks oneiric #usage "cluster = name of cluster (defaults to 'ceph'); id = monitor instance id" exec /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f post-stop script # Cleanup socket in case of segfault echo toto > /tmp/dead rm -f "/var/run/ceph/ceph-mon.$id.asok" end script </pre> The line of interest is *exec /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f* which is apparently sometime exec'ed and sometime run thru a shell without exec. <pre> root@raring:/etc/ceph# start ceph-mon id=raring ceph-mon (ceph/raring) start/running, process 6488 root@raring:/etc/ceph# ps fauwwx | grep ceph- root 6488 0.0 0.0 4444 624 ? Ss 16:17 0:00 /bin/sh -e -c /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh root 6489 1.0 0.8 134772 9116 ? Sl 16:17 0:00 \_ /usr/bin/ceph-mon --cluster=ceph -i raring -f root@raring:/etc/ceph# reload ceph-mon id=raring root@raring:/etc/ceph# ls -l /var/run/ceph total 0 root@raring:/etc/ceph# status ceph-mon id=raring status: Unknown instance: ceph/raring root@raring:/etc/ceph# ps fauwwx | grep ceph- root 6489 0.0 1.0 135796 10416 ? Sl 16:17 0:00 /usr/bin/ceph-mon --cluster=ceph -i raring -f </pre> Note that if the "/bin/sh" is not present, the problem does not show. Run *restart* until the "/bin/sh" process parent of */usr/bin/ceph-mon* shows up. *reload* kills the intermediate shell, notices that it is dead and run the *post-stop* script that removes the *asok* file. The *ceph-mon* is no longer managed by upstart and cannot be notified. There does not seem to be a convenient workaround. Commenting out the removal of the asok file in */etc/init/ceph-mon.conf* will allow to reach the unmanaged ceph-mon but it will keep writing to the old log because the reload initiated by logrotate will not reach it. Disabling logrotate will prevent loosing the logs but /var/log will grow indefinitely. h3. Initial report My /var/run/ceph/*.asok for OSDs and mon were mysteriously gone when I came back in the morning after running overnight. I noticed that they had disappeared at the same time as a log rotation, and sure enough calling "logrotate --force /etc/logrotate.d/ceph" leaves services running with no socket files. It looks like the part that's causing the problem is the "initctl reload ceph-mon id=xxx", which on my system is leaving the original service PID running, and trying to start a new one at the same time: logs show the new process failing to start with "failed to create new leveldb store" while the existing process continues to function. Presumably it's the new process which is deleting the socket files in spite of failing to come up successfully. Package in use is 0.72.2-1raring. This is kind of severe for anyone using a monitoring system that relies on the socket files to see and talk to the Ceph processes. If we can show that this issue is limited to ubuntu 13.04 then this is less of a big deal: I wouldn't be surprised if it's quite sensitive to extra distro version in use.