Project

General

Profile

Actions

Bug #279

closed

OSD crash on empty pglog

Added by Wido den Hollander almost 14 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During a crushmap change, one of my OSD's crashed on a empty pglog

Core was generated by `/usr/bin/cosd -i 20 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007fa111e1da75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007fa111e1da75 in raise () from /lib/libc.so.6
#1  0x00007fa111e215c0 in abort () from /lib/libc.so.6
#2  0x00007fa1126d28e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fa1126d0d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fa1126d0d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fa1126d0e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005b5e78 in ceph::__ceph_assert_fail (assertion=0x5f4f17 "0", file=<value optimized out>, line=2161, 
    func=<value optimized out>) at common/assert.cc:30
#7  0x0000000000527494 in PG::read_log (this=0x168a590, store=<value optimized out>) at osd/PG.cc:2161
#8  0x000000000052a006 in PG::read_state (this=<value optimized out>, store=<value optimized out>) at osd/PG.cc:2369
#9  0x00000000004cd2df in OSD::load_pgs (this=<value optimized out>) at osd/OSD.cc:930
#10 0x00000000004d5188 in OSD::init (this=0x92e010) at osd/OSD.cc:481
#11 0x0000000000455a7b in main (argc=<value optimized out>, argv=<value optimized out>) at cosd.cc:245
(gdb)

The last log lines:

10.07.14_22:12:56.265713 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182
10.07.14_22:12:56.265742 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = 0
10.07.14_22:12:56.265753 7fa1132df720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182

Turns out, the pglog has a zero filesize:

root@node09:~# ls -al /srv/ceph/osd20/current/meta/pglog_1.390_0
-rw-r--r-- 1 root root 0 2010-07-14 21:30 /srv/ceph/osd20/current/meta/pglog_1.390_0
root@node09:~#

The system it is running on has 4 OSD's in total, where this OSD has the smallest data partition:

root@node09:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              14G  4.0G  9.2G  31% /
none                  742M  196K  742M   1% /dev
none                  748M     0  748M   0% /dev/shm
none                  748M   68K  748M   1% /var/run
none                  748M     0  748M   0% /var/lock
none                  748M     0  748M   0% /lib/init/rw
/dev/sda7              54G   52G  1.7G  97% /srv/ceph/osd20
/dev/sdb2             462G   48G  415G  11% /srv/ceph/osd21
/dev/sdc2             462G   60G  403G  13% /srv/ceph/osd22
/dev/sdd2             146G   45G  102G  31% /srv/ceph/osd23
root@node09:~#

As you can see, the empty file could not been caused by a filled up partition (although it seems strange that it's so full).

I tried to remove the specific pglog, but that would make it crash again:

10.07.14_22:33:32.765619 7f41f11eb720 filestore(/srv/ceph/osd20) read couldn't open /srv/ceph/osd20/current/meta/pglog_1.390_0 errno 2 No such file or directory
10.07.14_22:33:32.765630 7f41f11eb720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = -2
10.07.14_22:33:32.765641 7f41f11eb720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182

At that time my cluster was in the following state:

root@node13:~# ceph -s
10.07.14_22:35:21.889717 7f164b03b710 monclient(hunting): found mon1
10.07.14_22:35:21.911791    pg v48016: 7952 pgs: 1092 active, 3593 active+clean, 44 crashed+replay, 2980 peering, 16 crashed+peering, 52 down+peering, 36 active+degraded, 81 active+clean+degraded, 46 degraded+peering, 3 crashed+degraded+peering, 9 down+degraded+peering; 446 GB data, 1509 GB used, 4820 GB / 6330 GB avail; 444438/1567592 degraded (28.352%)
10.07.14_22:35:21.932535   mds e40: 1/1/1 up {0=up:active}, 1 up:standby(laggy or crashed), 1 up:standby
10.07.14_22:35:21.932578   osd e1119: 30 osds: 29 up, 29 in
10.07.14_22:35:21.932728   log 10.07.14_22:02:05.236888 mon0 2001:16f8:10:2::c3c3:3f9b:6789/0 194 : [INF] osd3 2001:16f8:10:2::c3c3:8f6b:6806/28330 boot
10.07.14_22:35:21.932822   mon e1: 2 mons at 2001:16f8:10:2::c3c3:3f9b:6789/0 2001:16f8:10:2::c3c3:2e5c:6789/0
root@node13:~#

Replication level for both the data and metadata pool was at 3.


Files

core.node09.14674.gz (1.59 MB) core.node09.14674.gz Core dump from osd20 Wido den Hollander, 07/14/2010 01:37 PM
osd.20.log.gz (996 KB) osd.20.log.gz Logfile from osd20 Wido den Hollander, 07/14/2010 01:37 PM
Actions #1

Updated by Sage Weil almost 14 years ago

The real question is how the pglog got to be 0 in the first place. You don't have an older osd.20.log, do you?

(There was a logging bug I just fixed where it was overwriting instead of appending, so you probably don't.. :(

Actions #2

Updated by Wido den Hollander almost 14 years ago

I don't indeed (no log), but i've checked, there are multiple files with a filesize of 0 (pglog_*) on osd20 and other OSD's.

For example:

root@node02:/srv/ceph/osd7/current/meta# ls -al pglog*|head -n 10
-rw-r--r-- 1 root root      0 2010-07-15 15:48 pglog_0.0p7_0
-rw-r--r-- 1 root root      0 2010-07-14 21:20 pglog_0.1_0
-rw-r--r-- 1 root root      0 2010-07-14 23:16 pglog_0.101_0
-rw-r--r-- 1 root root      0 2010-07-14 23:46 pglog_0.106_0
-rw-r--r-- 1 root root  24388 2010-07-14 16:25 pglog_0.10c_0
-rw-r--r-- 1 root root      0 2010-07-14 23:31 pglog_0.10f_0
-rw-r--r-- 1 root root      0 2010-07-14 23:40 pglog_0.113_0
-rw-r--r-- 1 root root  26754 2010-07-14 16:40 pglog_0.11d_0
-rw-r--r-- 1 root root      0 2010-07-14 23:15 pglog_0.123_0
-rw-r--r-- 1 root root      0 2010-07-14 23:40 pglog_0.127_0
root@node02:/srv/ceph/osd7/current/meta#
Actions #3

Updated by Wido den Hollander almost 14 years ago

I'm seeing this crash on another osd right now. (osd5)

root@logger:/srv/ceph/issues/297_osd_crash_empty_pglog# tail -n 15 osd.5.log 
10.07.16_11:54:43.545931 7f28f3649720 filestore(/srv/ceph/osd5) collection_getattr /srv/ceph/osd5/current/0.594_head 'ondisklog' = 17
10.07.16_11:54:43.545937 7f28f3649720 osd5 7716 pg[0.594( v 255'277 (250'275,255'277]+backlog n=272 ec=2 les=234 1630/1630/1630) [] r=0 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0 inactive] read_log 24934~273
10.07.16_11:54:43.545949 7f28f3649720 filestore(/srv/ceph/osd5) read /srv/ceph/osd5/current/meta/pglog_0.594_0 24934~273
10.07.16_11:54:43.545984 7f28f3649720 filestore(/srv/ceph/osd5) read couldn't open /srv/ceph/osd5/current/meta/pglog_0.594_0 errno 2 No such file or directory
10.07.16_11:54:43.545990 7f28f3649720 filestore(/srv/ceph/osd5) read /srv/ceph/osd5/current/meta/pglog_0.594_0 24934~273 = -2
10.07.16_11:54:43.545996 7f28f3649720 osd5 7716 pg[0.594( v 255'277 (250'275,255'277]+backlog n=272 ec=2 les=234 1630/1630/1630) [] r=0 (info mismatch, log(250'275,0'0]+backlog) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 25207-24934=273
osd/PG.cc: In function 'void PG::read_log(ObjectStore*)':
osd/PG.cc:2161: FAILED assert(0)
 1: (PG::read_state(ObjectStore*)+0x846) [0x5299a6]
 2: (OSD::load_pgs()+0x12f) [0x4ce70f]
 3: (OSD::init()+0x346) [0x4d3df6]
 4: (main()+0x119b) [0x4559ab]
 5: (__libc_start_main()+0xfd) [0x7f28f2172c4d]
 6: /usr/bin/cosd() [0x454639]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
root@logger:/srv/ceph/issues/297_osd_crash_empty_pglog# 

Like you ( Sage ) asked i've made a copy of the executable and the coredump.

They can be found at "logger.ceph.widodh.nl" in /srv/ceph/issues/297_osd_crash_empty_pglog

Actions #4

Updated by Wido den Hollander almost 14 years ago

Btw, old logs are not available due to the NFS issues i had yesterday.

There is also enough disk-space on osd5:

root@node02:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              14G  1.8G   12G  14% /
none                  2.0G  220K  2.0G   1% /dev
none                  2.0G     0  2.0G   0% /dev/shm
none                  2.0G   72K  2.0G   1% /var/run
none                  2.0G     0  2.0G   0% /var/lock
none                  2.0G     0  2.0G   0% /lib/init/rw
/dev/sdc2             146G   52G   95G  36% /srv/ceph/osd6
/dev/sdd2             230G   44G  186G  20% /srv/ceph/osd7
/dev/sda7             208G   42G  167G  20% /srv/ceph/osd4
/dev/sdb2             146G   54G   93G  37% /srv/ceph/osd5
root@node02:~# 

I did not remove pglog_0.594, it seems to have vanished.

Actions #5

Updated by Wido den Hollander almost 14 years ago

Same issue occured on osd12, i've uploaded the logs + core-dump to the same directory on "logger.ceph.widodh.nl"

For now, 3 of the 30 OSD's are having this issue:

  • osd5
  • osd12
  • osd20
Actions #6

Updated by Sage Weil almost 14 years ago

i think this is fixed by e986721b06973a9379a37961e49ce334c480fa1a. not a disk full problem at all.

Actions #7

Updated by Sage Weil almost 14 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF