Bug #7188: Admin socket files are lost on log rotation calling initctl reload (ubuntu 13.04 only) - Ceph - Ceph

Bug #7188

Updated by Loïc Dachary almost 10 years ago

h3. Description 

 Steps to reproduce, using a fresh install of Ubuntu raring (it does not show on Ubuntu trusty) and firefly. The */etc/ceph/ceph-mon.conf* is as follows: 
 <pre> 
 description "Ceph MON" 

 start on ceph-mon 
 stop on runlevel [!2345] or stopping ceph-mon-all 

 respawn 
 respawn limit 5 30 

 limit nofile 16384 16384 

 pre-start script 
     set -e 
     rm -f /tmp/dead 
     test -x /usr/bin/ceph-mon || { stop; exit 0; } 
     test -d "/var/lib/ceph/mon/${cluster:-ceph}-$id" || { stop; exit 0; } 

     install -d -m0755 /var/run/ceph 
 end script 

 instance ${cluster:-ceph}/$id 
 export cluster 
 export id 

 # this breaks oneiric                                                                                                                                               
 #usage "cluster = name of cluster (defaults to 'ceph'); id = monitor instance id"                                                                                   

 exec /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f 

 post-stop script 
     # Cleanup socket in case of segfault                                                                                        
                                                                                                                                
     echo toto > /tmp/dead 
     rm -f "/var/run/ceph/ceph-mon.$id.asok" 
 end script 
 </pre> 
 The line of interest is *exec /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f* which is apparently sometime exec'ed and sometime run thru a shell without exec. 
 <pre> 
 root@raring:/etc/ceph# start ceph-mon id=raring 
 ceph-mon (ceph/raring) start/running, process 6488 
 root@raring:/etc/ceph# ps fauwwx | grep ceph- 
 root        6488    0.0    0.0     4444     624 ?          Ss     16:17     0:00 /bin/sh -e -c /usr/bin/ceph-mon --cluster="${cluster:-ceph}" -i "$id" -f /bin/sh 
 root        6489    1.0    0.8 134772    9116 ?          Sl     16:17     0:00    \_ /usr/bin/ceph-mon --cluster=ceph -i raring -f 
 root@raring:/etc/ceph# reload ceph-mon id=raring 
 root@raring:/etc/ceph# ls -l /var/run/ceph 
 total 0 
 root@raring:/etc/ceph# status ceph-mon id=raring 
 status: Unknown instance: ceph/raring 
 root@raring:/etc/ceph# ps fauwwx | grep ceph- 
 root        6489    0.0    1.0 135796 10416 ?          Sl     16:17     0:00 /usr/bin/ceph-mon --cluster=ceph -i raring -f 
 </pre> 
 Note that if the "/bin/sh" is not present, the problem does not show. Run *restart* until the "/bin/sh" process parent of */usr/bin/ceph-mon* shows up. 

 *reload* kills the intermediate shell, notices that it is dead and run the *post-stop* script that removes the *asok* file. The *ceph-mon* is no longer managed by upstart and cannot be notified. 

 There does not seem to be a convenient workaround. Commenting out the removal of the asok file in */etc/init/ceph-mon.conf* will allow to reach the unmanaged ceph-mon but it will keep writing to the old log because the reload initiated by logrotate will not reach it. Disabling logrotate will prevent loosing the logs but /var/log will grow indefinitely. 

 h3. Initial report 

 My /var/run/ceph/*.asok for OSDs and mon were mysteriously gone when I came back in the morning after running overnight.    I noticed that they had disappeared at the same time as a log rotation, and sure enough calling "logrotate --force /etc/logrotate.d/ceph" leaves services running with no socket files. 

 It looks like the part that's causing the problem is the "initctl reload ceph-mon id=xxx", which on my system is leaving the original service PID running, and trying to start a new one at the same time: logs show the new process failing to start with "failed to create new leveldb store" while the existing process continues to function.    Presumably it's the new process which is deleting the socket files in spite of failing to come up successfully. 

 Package in use is 0.72.2-1raring. 

 This is kind of severe for anyone using a monitoring system that relies on the socket files to see and talk to the Ceph processes.    If we can show that this issue is limited to ubuntu 13.04 then this is less of a big deal: I wouldn't be surprised if it's quite sensitive to extra distro version in use.

Back

Project

General

Profile

Ceph

Bug #7188