Bug #7188: Admin socket files are lost on log rotation calling initctl reload (ubuntu 13.04 only) - Ceph - Ceph

Bug #7188

Updated by Loïc Dachary almost 10 years ago

h1. Description 

 Steps to reproduce, using a fresh install of Ubuntu raring (it does not show on Ubuntu trusty) and firefly.  
 <pre> 
 root@raring:/etc/ceph# start ceph-mon id=raring 
 ceph-mon (ceph/raring) start/running, process 6488 
 root@raring:/etc/ceph# ps -fauwwx | grep ceph- 
 warning: bad ps syntax, perhaps a bogus '-'? 
 See http://gitorious.org/procps/procps/blobs/master/Documentation/FAQ 
 root        6506    0.0    0.0     6560     640 pts/1      S+     16:17     0:00                            \_ grep --color=auto ceph- 
 root        6488    0.0    0.0     4444     624 ?          Ss     16:17     0:00 /bin/sh -e -c /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh 
 root        6489    1.0    0.8 134772    9116 ?          Sl     16:17     0:00    \_ /usr/bin/ceph-mon --cluster=ceph -i raring -f 
 root@raring:/etc/ceph# reload ceph-mon id=raring 
 root@raring:/etc/ceph# ls -l /var/run/ceph 
 total 0 
 root@raring:/etc/ceph# status ceph-mon id=raring 
 status: Unknown instance: ceph/raring 
 </pre> 
 Note that if the "/bin/sh" is not present, the problem does not show. Run *restart* until the "/bin/sh" process parent of */usr/bin/ceph-mon* shows up. 

 *reload* kills the intermediate shell, notices that it is dead and run the *post-stop* script that removes the *asok* file. The *ceph-mon* is no longer managed by upstart and cannot be notified. 

 h1. Initial report 

 My /var/run/ceph/*.asok for OSDs and mon were mysteriously gone when I came back in the morning after running overnight.    I noticed that they had disappeared at the same time as a log rotation, and sure enough calling "logrotate --force /etc/logrotate.d/ceph" leaves services running with no socket files. 

 It looks like the part that's causing the problem is the "initctl reload ceph-mon id=xxx", which on my system is leaving the original service PID running, and trying to start a new one at the same time: logs show the new process failing to start with "failed to create new leveldb store" while the existing process continues to function.    Presumably it's the new process which is deleting the socket files in spite of failing to come up successfully. 

 Package in use is 0.72.2-1raring. 

 This is kind of severe for anyone using a monitoring system that relies on the socket files to see and talk to the Ceph processes.    If we can show that this issue is limited to ubuntu 13.04 then this is less of a big deal: I wouldn't be surprised if it's quite sensitive to extra distro version in use.

Back

Project

General

Profile

Ceph

Bug #7188