Actions
Bug #1356
closedOSD crashes during recovery with OSDMap::decode(ceph::buffer::list&)
% Done:
0%
Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hi,
Like I said some time ago, I've been seeing these kind of crashes lately.
I just tried to start my cluster (40 OSD's) up again and as always it started bouncing around, during this bouncing I saw a couple of OSD's going down with:
(gdb) bt #0 0x00007f8afb7967bb in raise () from /lib/libpthread.so.0 #1 0x000000000057cdc3 in reraise_fatal (signum=2382) at global/signal_handler.cc:59 #2 0x000000000057d38c in handle_fatal_signal (signum=<value optimized out>) at global/signal_handler.cc:106 #3 <signal handler called> #4 0x00007f8afa3c9a75 in raise () from /lib/libc.so.6 #5 0x00007f8afa3cd5c0 in abort () from /lib/libc.so.6 #6 0x00007f8afac7f8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #7 0x00007f8afac7dd16 in ?? () from /usr/lib/libstdc++.so.6 #8 0x00007f8afac7dd43 in std::terminate() () from /usr/lib/libstdc++.so.6 #9 0x00007f8afac7de3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #10 0x000000000049c3a6 in ceph::buffer::list::iterator::advance (this=0x7fff9327f4c0, len=2, dest=0x7fff9327f56c "\377\177") at ./include/buffer.h:315 #11 ceph::buffer::list::iterator::copy (this=0x7fff9327f4c0, len=2, dest=0x7fff9327f56c "\377\177") at ./include/buffer.h:369 #12 0x00000000005630c8 in OSDMap::decode(ceph::buffer::list&) () #13 0x000000000052f571 in OSD::get_map (this=0x1f1aca0, epoch=8974) at osd/OSD.cc:3492 #14 0x000000000053a5ce in OSD::init (this=0x1f1aca0) at osd/OSD.cc:555 #15 0x000000000049a6ca in main (argc=<value optimized out>, argv=<value optimized out>) at cosd.cc:298 (gdb)
The logging was very low since it kills the nodes even more if I increase the level, but what I do have is:
2011-08-04 16:28:37.243041 7f8afbbb7720 journal read_entry 417271808 : seq 1523780 483 bytes 2011-08-04 16:28:37.243138 7f8afbbb7720 journal read_entry 417280000 : seq 1523781 2547 bytes 2011-08-04 16:28:37.243256 7f8afbbb7720 journal read_entry 417288192 : seq 1523782 483 bytes 2011-08-04 16:28:37.243351 7f8afbbb7720 journal read_entry 417296384 : seq 1523783 3677 bytes 2011-08-04 16:28:37.243445 7f8afbbb7720 journal read_entry 417304576 : seq 1523784 483 bytes 2011-08-04 16:28:37.248665 7f8afbbb7720 journal read_entry 417312768 : seq 1523785 502916 bytes 2011-08-04 16:28:37.253398 7f8afbbb7720 journal kernel version is 2.6.39 2011-08-04 16:28:37.253994 7f8afbbb7720 journal _open /dev/data/journal0 fd 12: 1996488704 bytes, block size 4096 bytes, directio = 1 *** Caught signal (Aborted) ** in thread 0x7f8afbbb7720 ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278) 1: /usr/bin/cosd() [0x57d154] 2: (()+0xf8f0) [0x7f8afb7968f0] 3: (gsignal()+0x35) [0x7f8afa3c9a75] 4: (abort()+0x180) [0x7f8afa3cd5c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f8afac7f8e5] 6: (()+0xcad16) [0x7f8afac7dd16] 7: (()+0xcad43) [0x7f8afac7dd43] 8: (()+0xcae3e) [0x7f8afac7de3e] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x156) [0x49c3a6] 10: (OSDMap::decode(ceph::buffer::list&)+0x78) [0x5630c8] 11: (OSD::get_map(unsigned int)+0x221) [0x52f571] 12: (OSD::init()+0x47e) [0x53a5ce] 13: (main()+0x25ea) [0x49a6ca] 14: (__libc_start_main()+0xfd) [0x7f8afa3b4c4d] 15: /usr/bin/cosd() [0x497cd9]
There is no real way to reproduce it, if I start this particular OSD again it will probably go one, but it could also be that it crashes, you never know.
Files
Actions