Actions
Bug #279
closedOSD crash on empty pglog
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
During a crushmap change, one of my OSD's crashed on a empty pglog
Core was generated by `/usr/bin/cosd -i 20 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007fa111e1da75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007fa111e1da75 in raise () from /lib/libc.so.6 #1 0x00007fa111e215c0 in abort () from /lib/libc.so.6 #2 0x00007fa1126d28e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007fa1126d0d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007fa1126d0d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007fa1126d0e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00000000005b5e78 in ceph::__ceph_assert_fail (assertion=0x5f4f17 "0", file=<value optimized out>, line=2161, func=<value optimized out>) at common/assert.cc:30 #7 0x0000000000527494 in PG::read_log (this=0x168a590, store=<value optimized out>) at osd/PG.cc:2161 #8 0x000000000052a006 in PG::read_state (this=<value optimized out>, store=<value optimized out>) at osd/PG.cc:2369 #9 0x00000000004cd2df in OSD::load_pgs (this=<value optimized out>) at osd/OSD.cc:930 #10 0x00000000004d5188 in OSD::init (this=0x92e010) at osd/OSD.cc:481 #11 0x0000000000455a7b in main (argc=<value optimized out>, argv=<value optimized out>) at cosd.cc:245 (gdb)
The last log lines:
10.07.14_22:12:56.265713 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 10.07.14_22:12:56.265742 7fa1132df720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = 0 10.07.14_22:12:56.265753 7fa1132df720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182
Turns out, the pglog has a zero filesize:
root@node09:~# ls -al /srv/ceph/osd20/current/meta/pglog_1.390_0 -rw-r--r-- 1 root root 0 2010-07-14 21:30 /srv/ceph/osd20/current/meta/pglog_1.390_0 root@node09:~#
The system it is running on has 4 OSD's in total, where this OSD has the smallest data partition:
root@node09:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 14G 4.0G 9.2G 31% / none 742M 196K 742M 1% /dev none 748M 0 748M 0% /dev/shm none 748M 68K 748M 1% /var/run none 748M 0 748M 0% /var/lock none 748M 0 748M 0% /lib/init/rw /dev/sda7 54G 52G 1.7G 97% /srv/ceph/osd20 /dev/sdb2 462G 48G 415G 11% /srv/ceph/osd21 /dev/sdc2 462G 60G 403G 13% /srv/ceph/osd22 /dev/sdd2 146G 45G 102G 31% /srv/ceph/osd23 root@node09:~#
As you can see, the empty file could not been caused by a filled up partition (although it seems strange that it's so full).
I tried to remove the specific pglog, but that would make it crash again:
10.07.14_22:33:32.765619 7f41f11eb720 filestore(/srv/ceph/osd20) read couldn't open /srv/ceph/osd20/current/meta/pglog_1.390_0 errno 2 No such file or directory 10.07.14_22:33:32.765630 7f41f11eb720 filestore(/srv/ceph/osd20) read /srv/ceph/osd20/current/meta/pglog_1.390_0 0~182 = -2 10.07.14_22:33:32.765641 7f41f11eb720 osd20 1033 pg[1.390( v 250'34 (250'32,250'34] n=5 ec=2 les=993 979/979/979) [] r=0 (info mismatch, log(250'32,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inactive] read_log got 0 bytes, expected 182-0=182
At that time my cluster was in the following state:
root@node13:~# ceph -s 10.07.14_22:35:21.889717 7f164b03b710 monclient(hunting): found mon1 10.07.14_22:35:21.911791 pg v48016: 7952 pgs: 1092 active, 3593 active+clean, 44 crashed+replay, 2980 peering, 16 crashed+peering, 52 down+peering, 36 active+degraded, 81 active+clean+degraded, 46 degraded+peering, 3 crashed+degraded+peering, 9 down+degraded+peering; 446 GB data, 1509 GB used, 4820 GB / 6330 GB avail; 444438/1567592 degraded (28.352%) 10.07.14_22:35:21.932535 mds e40: 1/1/1 up {0=up:active}, 1 up:standby(laggy or crashed), 1 up:standby 10.07.14_22:35:21.932578 osd e1119: 30 osds: 29 up, 29 in 10.07.14_22:35:21.932728 log 10.07.14_22:02:05.236888 mon0 2001:16f8:10:2::c3c3:3f9b:6789/0 194 : [INF] osd3 2001:16f8:10:2::c3c3:8f6b:6806/28330 boot 10.07.14_22:35:21.932822 mon e1: 2 mons at 2001:16f8:10:2::c3c3:3f9b:6789/0 2001:16f8:10:2::c3c3:2e5c:6789/0 root@node13:~#
Replication level for both the data and metadata pool was at 3.
Files
Actions