Bug #279
closedOSD crash on empty pglog
0%
Description
During a crushmap change, one of my OSD's crashed on a empty pglog
Core was generated by `/usr/bin/cosd -i 20 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007fa111e1da75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007fa111e1da75 in raise () from /lib/libc.so.6 #1 0x00007fa111e215c0 in abort () from /lib/libc.so.6 #2 0x00007fa1126d28e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007fa1126d0d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007fa1126d0d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007fa1126d0e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00000000005b5e78 in ceph::__ceph_assert_fail (assertion=0x5f4f17 "0", file=<value optimized out>, line=2161, func=<value optimized out>) at common/assert.cc:30 #7 0x0000000000527494 in PG::read_log (this=0x168a590, store=<value optimized out>) at osd/PG.cc:2161 #8 0x000000000052a006 in PG::read_state (this=<value optimized out>, store=<value optimized out>) at osd/PG.cc:2369 #9 0x00000000004cd2df in OSD::load_pgs (this=<value optimized out>) at osd/OSD.cc:930 #10 0x00000000004d5188 in OSD::init (this=0x92e010) at osd/OSD.cc:481 #11 0x0000000000455a7b in main (argc=<value optimized out>, argv=<value optimized out>) at cosd.cc:245 (gdb)
The last log lines:
10.07.14_22:12:56.265713 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 10.07.14_22:12:56.265742 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = 0 10.07.14_22:12:56.265753 7fa1132df720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182
Turns out, the pglog has a zero filesize:
root@node09:~# ls -al /srv/ceph/osd20/current/meta/pglog_1.390_0 -rw-r--r-- 1 root root 0 2010-07-14 21:30 /srv/ceph/osd20/current/meta/pglog_1.390_0 root@node09:~#
The system it is running on has 4 OSD's in total, where this OSD has the smallest data partition:
root@node09:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 14G 4.0G 9.2G 31% / none 742M 196K 742M 1% /dev none 748M 0 748M 0% /dev/shm none 748M 68K 748M 1% /var/run none 748M 0 748M 0% /var/lock none 748M 0 748M 0% /lib/init/rw /dev/sda7 54G 52G 1.7G 97% /srv/ceph/osd20 /dev/sdb2 462G 48G 415G 11% /srv/ceph/osd21 /dev/sdc2 462G 60G 403G 13% /srv/ceph/osd22 /dev/sdd2 146G 45G 102G 31% /srv/ceph/osd23 root@node09:~#
As you can see, the empty file could not been caused by a filled up partition (although it seems strange that it's so full).
I tried to remove the specific pglog, but that would make it crash again:
10.07.14_22:33:32.765619 7f41f11eb720 filestore(/srv/ceph/osd20) read couldn't open /srv/ceph/osd20/current/meta/pglog_1.390_0 errno 2 No such file or directory 10.07.14_22:33:32.765630 7f41f11eb720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = -2 10.07.14_22:33:32.765641 7f41f11eb720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182
At that time my cluster was in the following state:
root@node13:~# ceph -s 10.07.14_22:35:21.889717 7f164b03b710 monclient(hunting): found mon1 10.07.14_22:35:21.911791 pg v48016: 7952 pgs: 1092 active, 3593 active+clean, 44 crashed+replay, 2980 peering, 16 crashed+peering, 52 down+peering, 36 active+degraded, 81 active+clean+degraded, 46 degraded+peering, 3 crashed+degraded+peering, 9 down+degraded+peering; 446 GB data, 1509 GB used, 4820 GB / 6330 GB avail; 444438/1567592 degraded (28.352%) 10.07.14_22:35:21.932535 mds e40: 1/1/1 up {0=up:active}, 1 up:standby(laggy or crashed), 1 up:standby 10.07.14_22:35:21.932578 osd e1119: 30 osds: 29 up, 29 in 10.07.14_22:35:21.932728 log 10.07.14_22:02:05.236888 mon0 2001:16f8:10:2::c3c3:3f9b:6789/0 194 : [INF] osd3 2001:16f8:10:2::c3c3:8f6b:6806/28330 boot 10.07.14_22:35:21.932822 mon e1: 2 mons at 2001:16f8:10:2::c3c3:3f9b:6789/0 2001:16f8:10:2::c3c3:2e5c:6789/0 root@node13:~#
Replication level for both the data and metadata pool was at 3.
Files
Updated by Sage Weil almost 14 years ago
The real question is how the pglog got to be 0 in the first place. You don't have an older osd.20.log, do you?
(There was a logging bug I just fixed where it was overwriting instead of appending, so you probably don't.. :(
Updated by Wido den Hollander almost 14 years ago
I don't indeed (no log), but i've checked, there are multiple files with a filesize of 0 (pglog_*) on osd20 and other OSD's.
For example:
root@node02:/srv/ceph/osd7/current/meta# ls -al pglog*|head -n 10 -rw-r--r-- 1 root root 0 2010-07-15 15:48 pglog_0.0p7_0 -rw-r--r-- 1 root root 0 2010-07-14 21:20 pglog_0.1_0 -rw-r--r-- 1 root root 0 2010-07-14 23:16 pglog_0.101_0 -rw-r--r-- 1 root root 0 2010-07-14 23:46 pglog_0.106_0 -rw-r--r-- 1 root root 24388 2010-07-14 16:25 pglog_0.10c_0 -rw-r--r-- 1 root root 0 2010-07-14 23:31 pglog_0.10f_0 -rw-r--r-- 1 root root 0 2010-07-14 23:40 pglog_0.113_0 -rw-r--r-- 1 root root 26754 2010-07-14 16:40 pglog_0.11d_0 -rw-r--r-- 1 root root 0 2010-07-14 23:15 pglog_0.123_0 -rw-r--r-- 1 root root 0 2010-07-14 23:40 pglog_0.127_0 root@node02:/srv/ceph/osd7/current/meta#
Updated by Wido den Hollander almost 14 years ago
I'm seeing this crash on another osd right now. (osd5)
root@logger:/srv/ceph/issues/297_osd_crash_empty_pglog# tail -n 15 osd.5.log 10.07.16_11:54:43.545931 7f28f3649720 filestore(/srv/ceph/osd5) collection_getattr /srv/ceph/osd5/current/0.594_head 'ondisklog' = 17 10.07.16_11:54:43.545937 7f28f3649720 osd5 7716 pg[0.594( v 255'277 (250'275,255'277]+backlog n=272 ec=2 les=234 1630/1630/1630) [] r=0 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0 inactive] read_log 24934~273 10.07.16_11:54:43.545949 7f28f3649720 filestore(/srv/ceph/osd5) read /srv/ceph/osd5/current/meta/pglog_0.594_0 24934~273 10.07.16_11:54:43.545984 7f28f3649720 filestore(/srv/ceph/osd5) read couldn't open /srv/ceph/osd5/current/meta/pglog_0.594_0 errno 2 No such file or directory 10.07.16_11:54:43.545990 7f28f3649720 filestore(/srv/ceph/osd5) read /srv/ceph/osd5/current/meta/pglog_0.594_0 24934~273 = -2 10.07.16_11:54:43.545996 7f28f3649720 osd5 7716 pg[0.594( v 255'277 (250'275,255'277]+backlog n=272 ec=2 les=234 1630/1630/1630) [] r=0 (info mismatch, log(250'275,0'0]+backlog) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 25207-24934=273 osd/PG.cc: In function 'void PG::read_log(ObjectStore*)': osd/PG.cc:2161: FAILED assert(0) 1: (PG::read_state(ObjectStore*)+0x846) [0x5299a6] 2: (OSD::load_pgs()+0x12f) [0x4ce70f] 3: (OSD::init()+0x346) [0x4d3df6] 4: (main()+0x119b) [0x4559ab] 5: (__libc_start_main()+0xfd) [0x7f28f2172c4d] 6: /usr/bin/cosd() [0x454639] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. root@logger:/srv/ceph/issues/297_osd_crash_empty_pglog#
Like you ( Sage ) asked i've made a copy of the executable and the coredump.
They can be found at "logger.ceph.widodh.nl" in /srv/ceph/issues/297_osd_crash_empty_pglog
Updated by Wido den Hollander almost 14 years ago
Btw, old logs are not available due to the NFS issues i had yesterday.
There is also enough disk-space on osd5:
root@node02:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 14G 1.8G 12G 14% / none 2.0G 220K 2.0G 1% /dev none 2.0G 0 2.0G 0% /dev/shm none 2.0G 72K 2.0G 1% /var/run none 2.0G 0 2.0G 0% /var/lock none 2.0G 0 2.0G 0% /lib/init/rw /dev/sdc2 146G 52G 95G 36% /srv/ceph/osd6 /dev/sdd2 230G 44G 186G 20% /srv/ceph/osd7 /dev/sda7 208G 42G 167G 20% /srv/ceph/osd4 /dev/sdb2 146G 54G 93G 37% /srv/ceph/osd5 root@node02:~#
I did not remove pglog_0.594, it seems to have vanished.
Updated by Wido den Hollander almost 14 years ago
Same issue occured on osd12, i've uploaded the logs + core-dump to the same directory on "logger.ceph.widodh.nl"
For now, 3 of the 30 OSD's are having this issue:
- osd5
- osd12
- osd20
Updated by Sage Weil almost 14 years ago
i think this is fixed by e986721b06973a9379a37961e49ce334c480fa1a. not a disk full problem at all.