Project

General

Profile

Actions

Bug #7072

closed

logrotate not sending HUP signal

Added by Matt Hook over 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a cluster where the Ceph Monitors are actually Virtual Machines.
They have relatively small 20GB root drives. According to the ceph hardware requirements 20GB should be enough.

Sometimes I get errors reported in zabbix that there is less than 20% disk space remaining.
I log onto a monitor and check the health. The status shows HEALTH_WARN.
In the past I had some wonky hardware and it was always /var/log where the culprit was.

On the actual machine, disk usage is at 85%. Yet there are no significantly large files sitting around.

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda2 20511356 16420164 3026232 85% /
udev 2019768 4 2019764 1% /dev
tmpfs 817432 176 817256 1% /run
none 5120 0 5120 0% /run/lock
none 2043572 0 2043572 0% /run/shm

cd /var
sudo du -hsx * | sort -rh | head -10
195M lib
175M log
3.5M cache
740K backups
20K spool
4.0K tmp
4.0K opt
4.0K mail
4.0K local
0 run

When I restart ceph, the disk usage drops back down to a respectable level.

admin@ceph-mon2:/usr$ sudo service ceph-all restart
ceph-all stop/waiting
ceph-all start/running
admin@ceph-mon2:/usr$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda2 20511356 1096584 18349812 6% /
udev 2019768 4 2019764 1% /dev
tmpfs 817432 176 817256 1% /run
none 5120 0 5120 0% /run/lock
none 2043572 0 2043572 0% /run/shm

I've set my logrotate to be quite aggressive because of disk filling issues. But it would seem that perhaps the files are not being closed off properly?

Actions #1

Updated by Sage Weil over 10 years ago

Can you identify which files are not being closed by looking at the open files in /proc/`pidof ceph-mon`/fd and seeing which ones are shown as deleted?

Actions #2

Updated by Matt Hook over 10 years ago

It happened on another monitor and this time I was able to capture the output of

sudo lsof | grep ceph

Here I have a ~15GB (16003995057 bytes) file that has been deleted still held open by ceph-mon. This has been rotated by rotatelogs. It should have been closed and cleaned up.

ceph-mon    708      root    3w      REG              202,2 16003995057     262236 /var/log/ceph/ceph-mon.ceph-mon1.log.1 (deleted)

The rotatelogs.d/ceph file contains

/var/log/ceph/*.log {
    rotate 7
    daily
    size 500M
    compress
    sharedscripts
    postrotate
        if which invoke-rc.d > /dev/null 2>&1 && [ -x `which invoke-rc.d` ]; then
            invoke-rc.d ceph reload >/dev/null
        elif which service > /dev/null 2>&1 && [ -x `which service` ]; then
            service ceph reload >/dev/null
        fi
        # Possibly reload twice, but depending on ceph.conf the reload above may be a no-op
        if which initctl > /dev/null 2>&1 && [ -x `which initctl` ]; then
            # upstart reload isn't very helpful here:
            #   https://bugs.launchpad.net/upstart/+bug/1012938
        initctl list \
        | sed -n 's/^\(ceph-\(mon\|osd\|mds\)\+\)[ \t]\+(\([^ \/]\+\)\/\([^ \/]\+\))[ \t]\+start\/.*$/\1 cluster=\3 id=\4/p' \
        | while read l; do
        initctl reload -- $l 2>/dev/null || :
        done
        fi
    endscript
    missingok
}

My crontab contains

30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1

There are also not as many log files in /var/log/ceph as I would have been expecting. I expect 7, I'm only seeing 2 after weeks of running.
It seems that ceph-mon is not playing nice with logrotate. Does anyone else have this issue?

Actions #3

Updated by Sage Weil over 10 years ago

  • Subject changed from Filesystem filling up by ceph. Restarting recovers space. to logrotate not sending HUP signal
  • Status changed from New to Need More Info
  • Priority changed from Normal to Urgent

what distribution is this?

Actions #4

Updated by Matt Hook over 10 years ago

Distributor ID: Ubuntu
Description: Ubuntu 12.04 LTS
Release: 12.04
Codename: precise

logrotate version: 3.7.8
Kernel version: 3.11.4-031104-generic #201310081221 SMP Tue Oct 8 16:21:57 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

As this is a virtual machine, it was built with xen-create-image using the debootstrap process.

Actions #5

Updated by Robin Johnson over 10 years ago

I can confirm this on Ubuntu 13.04.

If you check with lsof, you can see that the radosgw + ceph-osd files are moved, but the HUP clearly never fired.

ceph-osd   4745                  root    3w      REG              252,0      422485     146457 /var/log/ceph/ceph-osd.2.log.1 (deleted)
ceph-osd   4747                  root    3w      REG              252,0      468871     146456 /var/log/ceph/ceph-osd.1.log.1 (deleted)
radosgw    7092                  root    3w      REG              252,0    39319309     146463 /var/log/ceph/radosgw.log.1 (deleted)

The problem is that initctl list does not show the individual osds, or radosgw instances (which seperately needs to be added to the sed still).

root@ceph1:~# initctl list |grep -e ceph -e rados
ceph-mds-all start/running
ceph-osd-all start/running
ceph-mds-all-starter stop/waiting
radosgw stop/waiting
ceph-osd-all-starter stop/waiting
ceph-all start/running
ceph-mon-all start/running
radosgw-all-starter stop/waiting
ceph-mon-all-starter stop/waiting
ceph-mon (ceph/ceph1) start/running, process 3543
ceph-create-keys stop/waiting
ceph-osd stop/waiting
ceph-mds (ceph/ceph1) start/running, process 3534
radosgw-all start/running
Actions #6

Updated by Matt Hook over 10 years ago

Looking at the logrotate source code it would seem that logrotate depends on whatever is defined in postrotate to perform the HUP.
All the examples use something along the lines of

kill -HUP `cat /var/run/xxx.pid`

where xxx.pid is the PID file for the process. I didn't realise this.

So based on the ceph rotatelogs postrotate action and given that 12.04 invoke-d exists on my system it'll run:

invoke-rc.d ceph reload >/dev/null

Followed by:

initctl reload -- ceph-mon cluster=ceph id=ceph-mon3 2>/dev/null

I thought I'd do a manual test.

cd /var/log/ceph
sudo mv ceph.log ceph.log.test

Interestingly, ceph.log was recreated and being written to and the moved file ceph.log.test is no longer growing. So It would appear that perhaps ceph.log gets closed and appended to after each write.

I tried the same with ceph-mon.ceph-mon3.log. i.e. mv ceph-mon.ceph-mon3.log ceph-mon.ceph-mon3.log.test The moved file (.test) is continuing to grow.

I then manually ran
initctl reload -- ceph-mon cluster=ceph id=ceph-mon3

And it created a new file ceph-mon.ceph-mon3. So it's clear that there is a problem with rotatelogs postrotate being kicked off or the postrotate script itself.

Actions #7

Updated by Sage Weil about 10 years ago

  • Status changed from Need More Info to 12
Actions #8

Updated by Loïc Dachary about 10 years ago

  • Assignee set to Loïc Dachary
Actions #9

Updated by Loïc Dachary about 10 years ago

  • Status changed from 12 to Fix Under Review
Actions #10

Updated by Sage Weil about 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF