Bug #345
closedOSD crash: PG::read_state
0%
Description
This might be a duplicate of #279 but i'm not sure.
This morning i saw that 4 of my 12 OSD's were down (most of them killed by the OOM killer while i'm using tcmalloc).
Tried to start them again, but then osd5 crashed:
10.08.11_09:12:35.468138 7f17fb9d3720 osd5 2559 pg[0.18d( v 889'2876 lc 0'0 (889'2874,889'2876]+backlog n=2758 ec=2 les=2497 2552/2552/879) [] r=0 (info mismatch, log(0'0,0'0]) mlcod 0'0 inactive] read_log 0~250978 10.08.11_09:12:35.468157 7f17fb9d3720 filestore(/srv/ceph/osd5) read /srv/ceph/osd5/current/meta/pglog_0.18d_0 0~250978 10.08.11_09:12:35.468218 7f17fb9d3720 filestore(/srv/ceph/osd5) read couldn't open /srv/ceph/osd5/current/meta/pglog_0.18d_0 errno 2 No such file or directory 10.08.11_09:12:35.468228 7f17fb9d3720 filestore(/srv/ceph/osd5) read /srv/ceph/osd5/current/meta/pglog_0.18d_0 0~250978 = -2 10.08.11_09:12:35.468237 7f17fb9d3720 osd5 2559 pg[0.18d( v 889'2876 lc 0'0 (889'2874,889'2876]+backlog n=2758 ec=2 les=2497 2552/2552/879) [] r=0 (info mismatch, log(889'2874,0'0]+backlog) (log bound mismatch, empty) mlcod 0'0 inactive] read_log got 0 bytes, expected 250978-0=250978 osd/PG.cc: In function 'void PG::read_log(ObjectStore*)': osd/PG.cc:2168: FAILED assert(0) 1: (PG::read_state(ObjectStore*)+0x846) [0x532746] 2: (OSD::load_pgs()+0x145) [0x4e6b25] 3: (OSD::init()+0x4b8) [0x4e7508] 4: (main()+0x1d72) [0x458022] 5: (__libc_start_main()+0xfd) [0x7f17fa295c4d] 6: /usr/bin/cosd() [0x456099] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Like the log says, /srv/ceph/osd5/current/meta/pglog_0.18d_0 is missing from the disk (fs is not full).
This looks like Christian Brunner's post on the ML (cosd dying after start), but i am not using the rbd branch on my client, i'm running the latest unstable ( 0eb6cd49f6e3ec523787d09cf08d3179be270db4 ).
Like mentioned on the ML, i tried a scrub, but that fails:
root@node14:~# ceph osd scrub 5 10.08.11_09:24:22.453954 mon <- [osd,scrub,5] 10.08.11_09:24:22.455788 mon0 -> 'unknown command scrub' (-22) root@node14:~#
I've uploaded the core, logfile and binary to logger.ceph.widodh.nl in the directory /srv/ceph/issues/cosd_crash_pg_read_state
Updated by Wido den Hollander over 13 years ago
I just had the same crash on another osd. This OSD had some troubles with cephx, so i restarted it, then it crashed with the same message:
10.08.11_10:45:23.178767 7f4947d2e720 osd11 2820 pg[0.137( v 889'2760 lc 0'0 (889'2759,889'2760] n=2643 ec=2 les=2731 2764/2764/2764) [] r=0 (info mismatch, log(0'0,0'0]) mlcod 0'0 inactive] read_log 237510~3003 10.08.11_10:45:23.178790 7f4947d2e720 filestore(/srv/ceph/osd11) read /srv/ceph/osd11/current/meta/pglog_0.137_0 237510~3003 10.08.11_10:45:23.178866 7f4947d2e720 filestore(/srv/ceph/osd11) read couldn't open /srv/ceph/osd11/current/meta/pglog_0.137_0 errno 2 No such file or directory 10.08.11_10:45:23.178879 7f4947d2e720 filestore(/srv/ceph/osd11) read /srv/ceph/osd11/current/meta/pglog_0.137_0 237510~3003 = -2 10.08.11_10:45:23.178891 7f4947d2e720 osd11 2820 pg[0.137( v 889'2760 lc 0'0 (889'2759,889'2760] n=2643 ec=2 les=2731 2764/2764/2764) [] r=0 (info mismatch, log(889'2759,0'0]) (log bound mismatch, empty) mlcod 0'0 inactive] read_log got 0 bytes, expected 240513-237510=3003 osd/PG.cc: In function 'void PG::read_log(ObjectStore*)': osd/PG.cc:2168: FAILED assert(0) 1: (PG::read_state(ObjectStore*)+0x846) [0x532746] 2: (OSD::load_pgs()+0x145) [0x4e6b25] 3: (OSD::init()+0x4b8) [0x4e7508] 4: (main()+0x1d72) [0x458022] 5: (__libc_start_main()+0xfd) [0x7f49465f0c4d] 6: /usr/bin/cosd() [0x456099] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I added the log and coredump of osd11 to the same directory at logger.ceph.widodh.nl
Updated by Wido den Hollander over 13 years ago
Checked out the code, it seems you have to specify the full OSD name or * to scrub:
ceph osd scrub osd11
Or
ceph osd scrub '*'
Now my OSD's are scrubbing and i'll check if i can start the OSD's again.
Updated by Sage Weil over 13 years ago
fixed by fd080d538e9594ed6203b20e2c65a91f5aaae2d4
for any of these that aren't starting, just do 'rmdir /srv/ceph/osd$num/current/$badpgid', where $badpgid in the above case is 0.137, and start cosd again.