Bug #1875
osd: ReplicatedPG::do_op
Description
I just noticed two OSD's (osd.11 and osd.20) go down in my cluster.
The backtrace of both OSD's:
Core was generated by `/usr/bin/ceph-osd -i 20 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007f085d298f2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x00007f085d298f2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00000000005f99e2 in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 0x00000000005f9b9d in handle_fatal_signal (signum=6) at global/signal_handler.cc:106 #3 <signal handler called> #4 0x00007f085b8163a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007f085b819b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x00007f085c0d4d7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007f085c0d2f26 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007f085c0d2f53 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007f085c0d304e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00000000005cde57 in copy_out (dest=<optimized out>, l=<optimized out>, o=<optimized out>, this=<optimized out>) at ./include/buffer.h:193 #11 ceph::buffer::list::iterator::copy (this=0x7f084dca91a0, len=4, dest=0x7f084dca93c4 "") at common/buffer.cc:493 #12 0x00000000004be3f0 in decode_raw<unsigned long long> (t=@0x7f084dca93c0, p=...) at ./include/encoding.h:56 #13 decode (p=..., v=@0x7f084dca9178) at ./include/encoding.h:99 #14 decode (i=..., p=...) at ./include/object.h:194 #15 decode (this=0x7f084dca9170, bl=...) at ./include/object.h:347 #16 decode (p=..., c=...) at ./include/object.h:355 #17 ReplicatedPG::do_pg_op (this=0x527e000, op=0x375fd80) at osd/ReplicatedPG.cc:285 #18 0x00000000004ec8e5 in ReplicatedPG::do_op (this=0x527e000, op=0x375fd80) at osd/ReplicatedPG.cc:420 #19 0x0000000000530edd in OSD::dequeue_op (this=0x2a48000, pg=0x527e000) at osd/OSD.cc:5532 #20 0x00000000005cbbc6 in ThreadPool::worker (this=0x2a48408) at common/WorkQueue.cc:54 #21 0x000000000055368d in ThreadPool::WorkThread::entry (this=<optimized out>) at ./common/WorkQueue.h:120 #22 0x00007f085d290efc in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #23 0x00007f085b8c189d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #24 0x0000000000000000 in ?? () (gdb)
Both core files have almost the exact same timestamp:
root@atom2:~# stat /core.atom2.31856 File: `/core.atom2.31856' Size: 630030336 Blocks: 81136 IO Block: 4096 regular file Device: fc00h/64512d Inode: 62 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2012-01-04 14:15:28.106616697 +0100 Modify: 2012-01-04 14:15:28.522611737 +0100 Change: 2012-01-04 14:15:28.522611737 +0100 root@atom2:~#
root@atom5:~# stat /core.atom5.22990 File: `/core.atom5.22990' Size: 614453248 Blocks: 55392 IO Block: 4096 regular file Device: fc00h/64512d Inode: 2833 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2012-01-04 14:15:02.692203774 +0100 Modify: 2012-01-04 14:15:02.880204313 +0100 Change: 2012-01-04 14:15:02.880204313 +0100 root@atom5:~#
For some mysterious reason all my log files are empty, so I don't have logs for this one. It will probably be hard to track it down without logs.
Both OSD's had been running for about 3 weeks now.
The version of both OSD's: ceph version 0.39-140-ge5f4910 (e5f49104ab62ba7bc42cf6ecf41c9257b46585f7)
Associated revisions
osd: return EINVAL on bad PGLS[_FILTER] handle
Fixes: #1875
Signed-off-by: Sage Weil <sage@newdream.net>
History
#1 Updated by Sage Weil almost 12 years ago
- Assignee set to Sage Weil
- Target version set to v0.40
The PGLS iterator handle format was recently changed, and this crashed while decoding it. My guess is an old binary tried to list pg contents.
Fixing it up so that it will return EINVAL on a bad handle instead of crashing! :)
#2 Updated by Sage Weil almost 12 years ago
- Status changed from New to Resolved