Project

General

Profile

Actions

Bug #1356

closed

OSD crashes during recovery with OSDMap::decode(ceph::buffer::list&)

Added by Wido den Hollander over 12 years ago. Updated over 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

Like I said some time ago, I've been seeing these kind of crashes lately.

I just tried to start my cluster (40 OSD's) up again and as always it started bouncing around, during this bouncing I saw a couple of OSD's going down with:

(gdb) bt
#0  0x00007f8afb7967bb in raise () from /lib/libpthread.so.0
#1  0x000000000057cdc3 in reraise_fatal (signum=2382) at global/signal_handler.cc:59
#2  0x000000000057d38c in handle_fatal_signal (signum=<value optimized out>) at global/signal_handler.cc:106
#3  <signal handler called>
#4  0x00007f8afa3c9a75 in raise () from /lib/libc.so.6
#5  0x00007f8afa3cd5c0 in abort () from /lib/libc.so.6
#6  0x00007f8afac7f8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7  0x00007f8afac7dd16 in ?? () from /usr/lib/libstdc++.so.6
#8  0x00007f8afac7dd43 in std::terminate() () from /usr/lib/libstdc++.so.6
#9  0x00007f8afac7de3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#10 0x000000000049c3a6 in ceph::buffer::list::iterator::advance (this=0x7fff9327f4c0, len=2, dest=0x7fff9327f56c "\377\177") at ./include/buffer.h:315
#11 ceph::buffer::list::iterator::copy (this=0x7fff9327f4c0, len=2, dest=0x7fff9327f56c "\377\177") at ./include/buffer.h:369
#12 0x00000000005630c8 in OSDMap::decode(ceph::buffer::list&) ()
#13 0x000000000052f571 in OSD::get_map (this=0x1f1aca0, epoch=8974) at osd/OSD.cc:3492
#14 0x000000000053a5ce in OSD::init (this=0x1f1aca0) at osd/OSD.cc:555
#15 0x000000000049a6ca in main (argc=<value optimized out>, argv=<value optimized out>) at cosd.cc:298
(gdb)

The logging was very low since it kills the nodes even more if I increase the level, but what I do have is:

2011-08-04 16:28:37.243041 7f8afbbb7720 journal read_entry 417271808 : seq 1523780 483 bytes
2011-08-04 16:28:37.243138 7f8afbbb7720 journal read_entry 417280000 : seq 1523781 2547 bytes
2011-08-04 16:28:37.243256 7f8afbbb7720 journal read_entry 417288192 : seq 1523782 483 bytes
2011-08-04 16:28:37.243351 7f8afbbb7720 journal read_entry 417296384 : seq 1523783 3677 bytes
2011-08-04 16:28:37.243445 7f8afbbb7720 journal read_entry 417304576 : seq 1523784 483 bytes
2011-08-04 16:28:37.248665 7f8afbbb7720 journal read_entry 417312768 : seq 1523785 502916 bytes
2011-08-04 16:28:37.253398 7f8afbbb7720 journal  kernel version is 2.6.39
2011-08-04 16:28:37.253994 7f8afbbb7720 journal _open /dev/data/journal0 fd 12: 1996488704 bytes, block size 4096 bytes, directio = 1
*** Caught signal (Aborted) **
 in thread 0x7f8afbbb7720
 ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
 1: /usr/bin/cosd() [0x57d154]
 2: (()+0xf8f0) [0x7f8afb7968f0]
 3: (gsignal()+0x35) [0x7f8afa3c9a75]
 4: (abort()+0x180) [0x7f8afa3cd5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f8afac7f8e5]
 6: (()+0xcad16) [0x7f8afac7dd16]
 7: (()+0xcad43) [0x7f8afac7dd43]
 8: (()+0xcae3e) [0x7f8afac7de3e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x156) [0x49c3a6]
 10: (OSDMap::decode(ceph::buffer::list&)+0x78) [0x5630c8]
 11: (OSD::get_map(unsigned int)+0x221) [0x52f571]
 12: (OSD::init()+0x47e) [0x53a5ce]
 13: (main()+0x25ea) [0x49a6ca]
 14: (__libc_start_main()+0xfd) [0x7f8afa3b4c4d]
 15: /usr/bin/cosd() [0x497cd9]

There is no real way to reproduce it, if I start this particular OSD again it will probably go one, but it could also be that it crashes, you never know.


Files

osdmap.8974_0 (32.5 KB) osdmap.8974_0 osdmap 8974 from osd.0 Wido den Hollander, 08/04/2011 11:50 AM
8974.osdmap (32.5 KB) 8974.osdmap osdmap 8974 from the monitor Wido den Hollander, 08/04/2011 11:52 AM

Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #1486: osd: 0-length meta/pginfo_* filesResolved09/01/2011

Actions
Actions

Also available in: Atom PDF