Project

General

Profile

Actions

Bug #1875

closed

osd: ReplicatedPG::do_op

Added by Wido den Hollander over 12 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I just noticed two OSD's (osd.11 and osd.20) go down in my cluster.

The backtrace of both OSD's:

Core was generated by `/usr/bin/ceph-osd -i 20 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f085d298f2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007f085d298f2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000005f99e2 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  0x00000000005f9b9d in handle_fatal_signal (signum=6) at global/signal_handler.cc:106
#3  <signal handler called>
#4  0x00007f085b8163a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f085b819b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f085c0d4d7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f085c0d2f26 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f085c0d2f53 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f085c0d304e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000005cde57 in copy_out (dest=<optimized out>, l=<optimized out>, o=<optimized out>, this=<optimized out>) at ./include/buffer.h:193
#11 ceph::buffer::list::iterator::copy (this=0x7f084dca91a0, len=4, dest=0x7f084dca93c4 "") at common/buffer.cc:493
#12 0x00000000004be3f0 in decode_raw<unsigned long long> (t=@0x7f084dca93c0, p=...) at ./include/encoding.h:56
#13 decode (p=..., v=@0x7f084dca9178) at ./include/encoding.h:99
#14 decode (i=..., p=...) at ./include/object.h:194
#15 decode (this=0x7f084dca9170, bl=...) at ./include/object.h:347
#16 decode (p=..., c=...) at ./include/object.h:355
#17 ReplicatedPG::do_pg_op (this=0x527e000, op=0x375fd80) at osd/ReplicatedPG.cc:285
#18 0x00000000004ec8e5 in ReplicatedPG::do_op (this=0x527e000, op=0x375fd80) at osd/ReplicatedPG.cc:420
#19 0x0000000000530edd in OSD::dequeue_op (this=0x2a48000, pg=0x527e000) at osd/OSD.cc:5532
#20 0x00000000005cbbc6 in ThreadPool::worker (this=0x2a48408) at common/WorkQueue.cc:54
#21 0x000000000055368d in ThreadPool::WorkThread::entry (this=<optimized out>) at ./common/WorkQueue.h:120
#22 0x00007f085d290efc in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#23 0x00007f085b8c189d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#24 0x0000000000000000 in ?? ()
(gdb)

Both core files have almost the exact same timestamp:

root@atom2:~# stat /core.atom2.31856 
  File: `/core.atom2.31856'
  Size: 630030336     Blocks: 81136      IO Block: 4096   regular file
Device: fc00h/64512d    Inode: 62          Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-01-04 14:15:28.106616697 +0100
Modify: 2012-01-04 14:15:28.522611737 +0100
Change: 2012-01-04 14:15:28.522611737 +0100
root@atom2:~#
root@atom5:~# stat /core.atom5.22990 
  File: `/core.atom5.22990'
  Size: 614453248     Blocks: 55392      IO Block: 4096   regular file
Device: fc00h/64512d    Inode: 2833        Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-01-04 14:15:02.692203774 +0100
Modify: 2012-01-04 14:15:02.880204313 +0100
Change: 2012-01-04 14:15:02.880204313 +0100
root@atom5:~#

For some mysterious reason all my log files are empty, so I don't have logs for this one. It will probably be hard to track it down without logs.

Both OSD's had been running for about 3 weeks now.

The version of both OSD's: ceph version 0.39-140-ge5f4910 (e5f49104ab62ba7bc42cf6ecf41c9257b46585f7)

Actions #1

Updated by Sage Weil over 12 years ago

  • Assignee set to Sage Weil
  • Target version set to v0.40

The PGLS iterator handle format was recently changed, and this crashed while decoding it. My guess is an old binary tried to list pg contents.

Fixing it up so that it will return EINVAL on a bad handle instead of crashing! :)

Actions #2

Updated by Sage Weil over 12 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF