Project

General

Profile

Actions

Bug #279

closed

OSD crash on empty pglog

Added by Wido den Hollander almost 14 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During a crushmap change, one of my OSD's crashed on a empty pglog

Core was generated by `/usr/bin/cosd -i 20 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007fa111e1da75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007fa111e1da75 in raise () from /lib/libc.so.6
#1  0x00007fa111e215c0 in abort () from /lib/libc.so.6
#2  0x00007fa1126d28e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fa1126d0d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fa1126d0d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fa1126d0e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005b5e78 in ceph::__ceph_assert_fail (assertion=0x5f4f17 "0", file=<value optimized out>, line=2161, 
    func=<value optimized out>) at common/assert.cc:30
#7  0x0000000000527494 in PG::read_log (this=0x168a590, store=<value optimized out>) at osd/PG.cc:2161
#8  0x000000000052a006 in PG::read_state (this=<value optimized out>, store=<value optimized out>) at osd/PG.cc:2369
#9  0x00000000004cd2df in OSD::load_pgs (this=<value optimized out>) at osd/OSD.cc:930
#10 0x00000000004d5188 in OSD::init (this=0x92e010) at osd/OSD.cc:481
#11 0x0000000000455a7b in main (argc=<value optimized out>, argv=<value optimized out>) at cosd.cc:245
(gdb)

The last log lines:

10.07.14_22:12:56.265713 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182
10.07.14_22:12:56.265742 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = 0
10.07.14_22:12:56.265753 7fa1132df720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182

Turns out, the pglog has a zero filesize:

root@node09:~# ls -al /srv/ceph/osd20/current/meta/pglog_1.390_0
-rw-r--r-- 1 root root 0 2010-07-14 21:30 /srv/ceph/osd20/current/meta/pglog_1.390_0
root@node09:~#

The system it is running on has 4 OSD's in total, where this OSD has the smallest data partition:

root@node09:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              14G  4.0G  9.2G  31% /
none                  742M  196K  742M   1% /dev
none                  748M     0  748M   0% /dev/shm
none                  748M   68K  748M   1% /var/run
none                  748M     0  748M   0% /var/lock
none                  748M     0  748M   0% /lib/init/rw
/dev/sda7              54G   52G  1.7G  97% /srv/ceph/osd20
/dev/sdb2             462G   48G  415G  11% /srv/ceph/osd21
/dev/sdc2             462G   60G  403G  13% /srv/ceph/osd22
/dev/sdd2             146G   45G  102G  31% /srv/ceph/osd23
root@node09:~#

As you can see, the empty file could not been caused by a filled up partition (although it seems strange that it's so full).

I tried to remove the specific pglog, but that would make it crash again:

10.07.14_22:33:32.765619 7f41f11eb720 filestore(/srv/ceph/osd20) read couldn't open /srv/ceph/osd20/current/meta/pglog_1.390_0 errno 2 No such file or directory
10.07.14_22:33:32.765630 7f41f11eb720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = -2
10.07.14_22:33:32.765641 7f41f11eb720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182

At that time my cluster was in the following state:

root@node13:~# ceph -s
10.07.14_22:35:21.889717 7f164b03b710 monclient(hunting): found mon1
10.07.14_22:35:21.911791    pg v48016: 7952 pgs: 1092 active, 3593 active+clean, 44 crashed+replay, 2980 peering, 16 crashed+peering, 52 down+peering, 36 active+degraded, 81 active+clean+degraded, 46 degraded+peering, 3 crashed+degraded+peering, 9 down+degraded+peering; 446 GB data, 1509 GB used, 4820 GB / 6330 GB avail; 444438/1567592 degraded (28.352%)
10.07.14_22:35:21.932535   mds e40: 1/1/1 up {0=up:active}, 1 up:standby(laggy or crashed), 1 up:standby
10.07.14_22:35:21.932578   osd e1119: 30 osds: 29 up, 29 in
10.07.14_22:35:21.932728   log 10.07.14_22:02:05.236888 mon0 2001:16f8:10:2::c3c3:3f9b:6789/0 194 : [INF] osd3 2001:16f8:10:2::c3c3:8f6b:6806/28330 boot
10.07.14_22:35:21.932822   mon e1: 2 mons at 2001:16f8:10:2::c3c3:3f9b:6789/0 2001:16f8:10:2::c3c3:2e5c:6789/0
root@node13:~#

Replication level for both the data and metadata pool was at 3.


Files

core.node09.14674.gz (1.59 MB) core.node09.14674.gz Core dump from osd20 Wido den Hollander, 07/14/2010 01:37 PM
osd.20.log.gz (996 KB) osd.20.log.gz Logfile from osd20 Wido den Hollander, 07/14/2010 01:37 PM
Actions

Also available in: Atom PDF