Bug #7072
closedlogrotate not sending HUP signal
0%
Description
I have a cluster where the Ceph Monitors are actually Virtual Machines.
They have relatively small 20GB root drives. According to the ceph hardware requirements 20GB should be enough.
Sometimes I get errors reported in zabbix that there is less than 20% disk space remaining.
I log onto a monitor and check the health. The status shows HEALTH_WARN.
In the past I had some wonky hardware and it was always /var/log where the culprit was.
On the actual machine, disk usage is at 85%. Yet there are no significantly large files sitting around.
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda2 20511356 16420164 3026232 85% /
udev 2019768 4 2019764 1% /dev
tmpfs 817432 176 817256 1% /run
none 5120 0 5120 0% /run/lock
none 2043572 0 2043572 0% /run/shm
cd /var
sudo du -hsx * | sort -rh | head -10
195M lib
175M log
3.5M cache
740K backups
20K spool
4.0K tmp
4.0K opt
4.0K mail
4.0K local
0 run
When I restart ceph, the disk usage drops back down to a respectable level.
admin@ceph-mon2:/usr$ sudo service ceph-all restart
ceph-all stop/waiting
ceph-all start/running
admin@ceph-mon2:/usr$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda2 20511356 1096584 18349812 6% /
udev 2019768 4 2019764 1% /dev
tmpfs 817432 176 817256 1% /run
none 5120 0 5120 0% /run/lock
none 2043572 0 2043572 0% /run/shm
I've set my logrotate to be quite aggressive because of disk filling issues. But it would seem that perhaps the files are not being closed off properly?
Updated by Sage Weil over 10 years ago
Can you identify which files are not being closed by looking at the open files in /proc/`pidof ceph-mon`/fd and seeing which ones are shown as deleted?
Updated by Matt Hook over 10 years ago
It happened on another monitor and this time I was able to capture the output of
sudo lsof | grep ceph
Here I have a ~15GB (16003995057 bytes) file that has been deleted still held open by ceph-mon. This has been rotated by rotatelogs. It should have been closed and cleaned up.
ceph-mon 708 root 3w REG 202,2 16003995057 262236 /var/log/ceph/ceph-mon.ceph-mon1.log.1 (deleted)
The rotatelogs.d/ceph file contains
/var/log/ceph/*.log { rotate 7 daily size 500M compress sharedscripts postrotate if which invoke-rc.d > /dev/null 2>&1 && [ -x `which invoke-rc.d` ]; then invoke-rc.d ceph reload >/dev/null elif which service > /dev/null 2>&1 && [ -x `which service` ]; then service ceph reload >/dev/null fi # Possibly reload twice, but depending on ceph.conf the reload above may be a no-op if which initctl > /dev/null 2>&1 && [ -x `which initctl` ]; then # upstart reload isn't very helpful here: # https://bugs.launchpad.net/upstart/+bug/1012938 initctl list \ | sed -n 's/^\(ceph-\(mon\|osd\|mds\)\+\)[ \t]\+(\([^ \/]\+\)\/\([^ \/]\+\))[ \t]\+start\/.*$/\1 cluster=\3 id=\4/p' \ | while read l; do initctl reload -- $l 2>/dev/null || : done fi endscript missingok }
My crontab contains
30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1
There are also not as many log files in /var/log/ceph as I would have been expecting. I expect 7, I'm only seeing 2 after weeks of running.
It seems that ceph-mon is not playing nice with logrotate. Does anyone else have this issue?
Updated by Sage Weil over 10 years ago
- Subject changed from Filesystem filling up by ceph. Restarting recovers space. to logrotate not sending HUP signal
- Status changed from New to Need More Info
- Priority changed from Normal to Urgent
what distribution is this?
Updated by Matt Hook over 10 years ago
Distributor ID: Ubuntu
Description: Ubuntu 12.04 LTS
Release: 12.04
Codename: precise
logrotate version: 3.7.8
Kernel version: 3.11.4-031104-generic #201310081221 SMP Tue Oct 8 16:21:57 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
As this is a virtual machine, it was built with xen-create-image using the debootstrap process.
Updated by Robin Johnson over 10 years ago
I can confirm this on Ubuntu 13.04.
If you check with lsof, you can see that the radosgw + ceph-osd files are moved, but the HUP clearly never fired.
ceph-osd 4745 root 3w REG 252,0 422485 146457 /var/log/ceph/ceph-osd.2.log.1 (deleted) ceph-osd 4747 root 3w REG 252,0 468871 146456 /var/log/ceph/ceph-osd.1.log.1 (deleted) radosgw 7092 root 3w REG 252,0 39319309 146463 /var/log/ceph/radosgw.log.1 (deleted)
The problem is that initctl list
does not show the individual osds, or radosgw instances (which seperately needs to be added to the sed still).
root@ceph1:~# initctl list |grep -e ceph -e rados ceph-mds-all start/running ceph-osd-all start/running ceph-mds-all-starter stop/waiting radosgw stop/waiting ceph-osd-all-starter stop/waiting ceph-all start/running ceph-mon-all start/running radosgw-all-starter stop/waiting ceph-mon-all-starter stop/waiting ceph-mon (ceph/ceph1) start/running, process 3543 ceph-create-keys stop/waiting ceph-osd stop/waiting ceph-mds (ceph/ceph1) start/running, process 3534 radosgw-all start/running
Updated by Matt Hook over 10 years ago
Looking at the logrotate source code it would seem that logrotate depends on whatever is defined in postrotate to perform the HUP.
All the examples use something along the lines of
kill -HUP `cat /var/run/xxx.pid`
where xxx.pid is the PID file for the process. I didn't realise this.
So based on the ceph rotatelogs postrotate action and given that 12.04 invoke-d exists on my system it'll run:
invoke-rc.d ceph reload >/dev/null
Followed by:
initctl reload -- ceph-mon cluster=ceph id=ceph-mon3 2>/dev/null
I thought I'd do a manual test.
cd /var/log/ceph
sudo mv ceph.log ceph.log.test
Interestingly, ceph.log was recreated and being written to and the moved file ceph.log.test is no longer growing. So It would appear that perhaps ceph.log gets closed and appended to after each write.
I tried the same with ceph-mon.ceph-mon3.log. i.e. mv ceph-mon.ceph-mon3.log ceph-mon.ceph-mon3.log.test The moved file (.test) is continuing to grow.
I then manually ran initctl reload -- ceph-mon cluster=ceph id=ceph-mon3
And it created a new file ceph-mon.ceph-mon3. So it's clear that there is a problem with rotatelogs postrotate being kicked off or the postrotate script itself.
Updated by Sage Weil about 10 years ago
- Status changed from Need More Info to 12
Updated by Loïc Dachary about 10 years ago
- Status changed from 12 to Fix Under Review
Updated by Sage Weil about 10 years ago
- Status changed from Fix Under Review to Resolved