Bug #1875: osd: ReplicatedPG::do_op - Ceph - Ceph

Actions

Copy link

Bug #1875

closed

osd: ReplicatedPG::do_op

Added by Wido den Hollander over 12 years ago. Updated over 12 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.40

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I just noticed two OSD's (osd.11 and osd.20) go down in my cluster.

The backtrace of both OSD's:

Core was generated by `/usr/bin/ceph-osd -i 20 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f085d298f2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007f085d298f2b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000005f99e2 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  0x00000000005f9b9d in handle_fatal_signal (signum=6) at global/signal_handler.cc:106
#3  <signal handler called>
#4  0x00007f085b8163a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f085b819b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f085c0d4d7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f085c0d2f26 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f085c0d2f53 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f085c0d304e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000005cde57 in copy_out (dest=<optimized out>, l=<optimized out>, o=<optimized out>, this=<optimized out>) at ./include/buffer.h:193
#11 ceph::buffer::list::iterator::copy (this=0x7f084dca91a0, len=4, dest=0x7f084dca93c4 "") at common/buffer.cc:493
#12 0x00000000004be3f0 in decode_raw<unsigned long long> (t=@0x7f084dca93c0, p=...) at ./include/encoding.h:56
#13 decode (p=..., v=@0x7f084dca9178) at ./include/encoding.h:99
#14 decode (i=..., p=...) at ./include/object.h:194
#15 decode (this=0x7f084dca9170, bl=...) at ./include/object.h:347
#16 decode (p=..., c=...) at ./include/object.h:355
#17 ReplicatedPG::do_pg_op (this=0x527e000, op=0x375fd80) at osd/ReplicatedPG.cc:285
#18 0x00000000004ec8e5 in ReplicatedPG::do_op (this=0x527e000, op=0x375fd80) at osd/ReplicatedPG.cc:420
#19 0x0000000000530edd in OSD::dequeue_op (this=0x2a48000, pg=0x527e000) at osd/OSD.cc:5532
#20 0x00000000005cbbc6 in ThreadPool::worker (this=0x2a48408) at common/WorkQueue.cc:54
#21 0x000000000055368d in ThreadPool::WorkThread::entry (this=<optimized out>) at ./common/WorkQueue.h:120
#22 0x00007f085d290efc in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#23 0x00007f085b8c189d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#24 0x0000000000000000 in ?? ()
(gdb)

Both core files have almost the exact same timestamp:

root@atom2:~# stat /core.atom2.31856 
  File: `/core.atom2.31856'
  Size: 630030336     Blocks: 81136      IO Block: 4096   regular file
Device: fc00h/64512d    Inode: 62          Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-01-04 14:15:28.106616697 +0100
Modify: 2012-01-04 14:15:28.522611737 +0100
Change: 2012-01-04 14:15:28.522611737 +0100
root@atom2:~#

root@atom5:~# stat /core.atom5.22990 
  File: `/core.atom5.22990'
  Size: 614453248     Blocks: 55392      IO Block: 4096   regular file
Device: fc00h/64512d    Inode: 2833        Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-01-04 14:15:02.692203774 +0100
Modify: 2012-01-04 14:15:02.880204313 +0100
Change: 2012-01-04 14:15:02.880204313 +0100
root@atom5:~#

For some mysterious reason all my log files are empty, so I don't have logs for this one. It will probably be hard to track it down without logs.

Both OSD's had been running for about 3 weeks now.

The version of both OSD's: ceph version 0.39-140-ge5f4910 (e5f49104ab62ba7bc42cf6ecf41c9257b46585f7)