Bug #279: OSD crash on empty pglog - Ceph - Ceph

Actions

Copy link

Bug #279

closed

OSD crash on empty pglog

Added by Wido den Hollander almost 14 years ago. Updated over 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

During a crushmap change, one of my OSD's crashed on a empty pglog

Core was generated by `/usr/bin/cosd -i 20 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007fa111e1da75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007fa111e1da75 in raise () from /lib/libc.so.6
#1  0x00007fa111e215c0 in abort () from /lib/libc.so.6
#2  0x00007fa1126d28e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fa1126d0d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fa1126d0d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fa1126d0e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005b5e78 in ceph::__ceph_assert_fail (assertion=0x5f4f17 "0", file=<value optimized out>, line=2161, 
    func=<value optimized out>) at common/assert.cc:30
#7  0x0000000000527494 in PG::read_log (this=0x168a590, store=<value optimized out>) at osd/PG.cc:2161
#8  0x000000000052a006 in PG::read_state (this=<value optimized out>, store=<value optimized out>) at osd/PG.cc:2369
#9  0x00000000004cd2df in OSD::load_pgs (this=<value optimized out>) at osd/OSD.cc:930
#10 0x00000000004d5188 in OSD::init (this=0x92e010) at osd/OSD.cc:481
#11 0x0000000000455a7b in main (argc=<value optimized out>, argv=<value optimized out>) at cosd.cc:245
(gdb)

The last log lines:

10.07.14_22:12:56.265713 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182
10.07.14_22:12:56.265742 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = 0
10.07.14_22:12:56.265753 7fa1132df720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182

Turns out, the pglog has a zero filesize:

root@node09:~# ls -al /srv/ceph/osd20/current/meta/pglog_1.390_0
-rw-r--r-- 1 root root 0 2010-07-14 21:30 /srv/ceph/osd20/current/meta/pglog_1.390_0
root@node09:~#

The system it is running on has 4 OSD's in total, where this OSD has the smallest data partition:

root@node09:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              14G  4.0G  9.2G  31% /
none                  742M  196K  742M   1% /dev
none                  748M     0  748M   0% /dev/shm
none                  748M   68K  748M   1% /var/run
none                  748M     0  748M   0% /var/lock
none                  748M     0  748M   0% /lib/init/rw
/dev/sda7              54G   52G  1.7G  97% /srv/ceph/osd20
/dev/sdb2             462G   48G  415G  11% /srv/ceph/osd21
/dev/sdc2             462G   60G  403G  13% /srv/ceph/osd22
/dev/sdd2             146G   45G  102G  31% /srv/ceph/osd23
root@node09:~#

As you can see, the empty file could not been caused by a filled up partition (although it seems strange that it's so full).

I tried to remove the specific pglog, but that would make it crash again:

10.07.14_22:33:32.765619 7f41f11eb720 filestore(/srv/ceph/osd20) read couldn't open /srv/ceph/osd20/current/meta/pglog_1.390_0 errno 2 No such file or directory
10.07.14_22:33:32.765630 7f41f11eb720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = -2
10.07.14_22:33:32.765641 7f41f11eb720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182

At that time my cluster was in the following state:

root@node13:~# ceph -s
10.07.14_22:35:21.889717 7f164b03b710 monclient(hunting): found mon1
10.07.14_22:35:21.911791    pg v48016: 7952 pgs: 1092 active, 3593 active+clean, 44 crashed+replay, 2980 peering, 16 crashed+peering, 52 down+peering, 36 active+degraded, 81 active+clean+degraded, 46 degraded+peering, 3 crashed+degraded+peering, 9 down+degraded+peering; 446 GB data, 1509 GB used, 4820 GB / 6330 GB avail; 444438/1567592 degraded (28.352%)
10.07.14_22:35:21.932535   mds e40: 1/1/1 up {0=up:active}, 1 up:standby(laggy or crashed), 1 up:standby
10.07.14_22:35:21.932578   osd e1119: 30 osds: 29 up, 29 in
10.07.14_22:35:21.932728   log 10.07.14_22:02:05.236888 mon0 2001:16f8:10:2::c3c3:3f9b:6789/0 194 : [INF] osd3 2001:16f8:10:2::c3c3:8f6b:6806/28330 boot
10.07.14_22:35:21.932822   mon e1: 2 mons at 2001:16f8:10:2::c3c3:3f9b:6789/0 2001:16f8:10:2::c3c3:2e5c:6789/0
root@node13:~#

Replication level for both the data and metadata pool was at 3.

Files

Download all files

core.node09.14674.gz (1.59 MB) core.node09.14674.gz	Core dump from osd20	Wido den Hollander, 07/14/2010 01:37 PM
osd.20.log.gz (996 KB) osd.20.log.gz	Logfile from osd20	Wido den Hollander, 07/14/2010 01:37 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #279

OSD crash on empty pglog

Updated by Sage Weil almost 14 years ago

Updated by Wido den Hollander almost 14 years ago

Updated by Wido den Hollander almost 14 years ago

Updated by Wido den Hollander almost 14 years ago

Updated by Wido den Hollander almost 14 years ago

Updated by Sage Weil almost 14 years ago

Updated by Sage Weil almost 14 years ago