Project

General

Profile

Actions

Bug #460

closed

OSD crash: ReplicatedPG::push_to_replica / Rb_tree

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After my cluster recovered from the latest crashes, I wanted to check if my RBD data was still in tact.

This caused osd0 to crash:

Core was generated by `/usr/bin/cosd -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  std::_Rb_tree<snapid_t, std::pair<snapid_t const, unsigned long>, std::_Select1st<std::pair<snapid_t const, unsigned long> >, std::less<snapid_t>, std::allocator<std::pair<snapid_t const, unsigned long> > >::_M_begin (this=0x2218000, 
    snapset=..., soid=<value optimized out>, missing=..., data_subset=..., clone_subsets=...)
    at /usr/include/c++/4.4/bits/stl_tree.h:482
482          { return static_cast<_Link_type>(this->_M_impl._M_header._M_parent); }
(gdb) bt
#0  std::_Rb_tree<snapid_t, std::pair<snapid_t const, unsigned long>, std::_Select1st<std::pair<snapid_t const, unsigned long> >, std::less<snapid_t>, std::allocator<std::pair<snapid_t const, unsigned long> > >::_M_begin (this=0x2218000, 
    snapset=..., soid=<value optimized out>, missing=..., data_subset=..., clone_subsets=...)
    at /usr/include/c++/4.4/bits/stl_tree.h:482
#1  std::_Rb_tree<snapid_t, std::pair<snapid_t const, unsigned long>, std::_Select1st<std::pair<snapid_t const, unsigned long> >, std::less<snapid_t>, std::allocator<std::pair<snapid_t const, unsigned long> > >::lower_bound (this=0x2218000, 
    snapset=..., soid=<value optimized out>, missing=..., data_subset=..., clone_subsets=...)
    at /usr/include/c++/4.4/bits/stl_tree.h:745
#2  std::map<snapid_t, unsigned long, std::less<snapid_t>, std::allocator<std::pair<snapid_t const, unsigned long> > >::lower_bound (this=0x2218000, snapset=..., soid=<value optimized out>, missing=..., data_subset=..., clone_subsets=...)
    at /usr/include/c++/4.4/bits/stl_map.h:701
#3  std::map<snapid_t, unsigned long, std::less<snapid_t>, std::allocator<std::pair<snapid_t const, unsigned long> > >::operator[] (this=0x2218000, snapset=..., soid=<value optimized out>, missing=..., data_subset=..., clone_subsets=...)
    at /usr/include/c++/4.4/bits/stl_map.h:447
#4  ReplicatedPG::calc_clone_subsets (this=0x2218000, snapset=..., soid=<value optimized out>, missing=..., 
    data_subset=..., clone_subsets=...) at osd/ReplicatedPG.cc:2613
#5  0x000000000049571e in ReplicatedPG::push_to_replica (this=0x2218000, obc=<value optimized out>, soid=..., peer=8)
    at osd/ReplicatedPG.cc:2831
#6  0x0000000000496083 in ReplicatedPG::recover_object_replicas (this=0x2218000, soid=...) at osd/ReplicatedPG.cc:3682
#7  0x00000000004964ab in ReplicatedPG::recover_replicas (this=0x2218000, max=<value optimized out>)
    at osd/ReplicatedPG.cc:3715
#8  0x000000000049f0ba in ReplicatedPG::start_recovery_ops (this=0x2218000, max=1) at osd/ReplicatedPG.cc:3524
#9  0x00000000004d7c6c in OSD::do_recovery (this=0x1332000, pg=0x2218000) at osd/OSD.cc:4332
#10 0x00000000005c6c0f in ThreadPool::worker (this=0x13325f8) at common/WorkQueue.cc:44
#11 0x00000000004fd9ed in ThreadPool::WorkThread::entry() ()
#12 0x000000000046e82a in Thread::_entry_func (arg=0x2218000) at ./common/Thread.h:39
#13 0x00007fcfa13459ca in start_thread () from /lib/libpthread.so.0
#14 0x00007fcfa02fd6fd in clone () from /lib/libc.so.6
#15 0x0000000000000000 in ?? ()

Restarting the OSD caused the OSD to crash again withing a few seconds.

The core, binary and logs are available on logger.pcextreme.nl:/srv/ceph/issues/osd_crash_rb_tree

Actions #1

Updated by Wido den Hollander over 13 years ago

I just tested if I could start the OSD again, but it crashed again, with almost the same backtrace:

Core was generated by `/usr/bin/cosd -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004b3d84 in operator<<(std::ostream&, SnapSet const&) ()
(gdb) bt
#0  0x00000000004b3d84 in operator<<(std::ostream&, SnapSet const&) ()
#1  0x00000000004956e5 in ReplicatedPG::push_to_replica (this=0x4539400, obc=<value optimized out>, soid=..., peer=8)
    at osd/ReplicatedPG.cc:2829
#2  0x0000000000496083 in ReplicatedPG::recover_object_replicas (this=0x4539400, soid=...) at osd/ReplicatedPG.cc:3682
#3  0x00000000004964ab in ReplicatedPG::recover_replicas (this=0x4539400, max=<value optimized out>)
    at osd/ReplicatedPG.cc:3715
#4  0x000000000049f0ba in ReplicatedPG::start_recovery_ops (this=0x4539400, max=1) at osd/ReplicatedPG.cc:3524
#5  0x00000000004d7c6c in OSD::do_recovery (this=0x212a000, pg=0x4539400) at osd/OSD.cc:4332
#6  0x00000000005c6c0f in ThreadPool::worker (this=0x212a5f8) at common/WorkQueue.cc:44
#7  0x00000000004fd9ed in ThreadPool::WorkThread::entry() ()
#8  0x000000000046e82a in Thread::_entry_func (arg=0x853780) at ./common/Thread.h:39
#9  0x00007ff3722809ca in start_thread () from /lib/libpthread.so.0
#10 0x00007ff3712386fd in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

The last few log lines:

2010-10-05 06:22:05.142665 7ff367b0e710 osd0 18755 pg[3.138( v 11803'15716 (11803'15713,11803'15716]+backlog n=32 ec=2 les=18753 18752/18752/18752) [0,8] r=0 rops=1 lcod 0'0 mlcod 0'0 active] push_to_replica rb.0.5.0000000001ba/1a v11801'12950 size 0 to osd8
2010-10-05 06:22:05.142688 7ff367b0e710 filestore(/srv/ceph/osd.0) getattr /srv/ceph/osd.0/current/3.138_head/rb.0.5.0000000001ba_head 'snapset'
2010-10-05 06:22:05.142698 7ff367b0e710 filestore(/srv/ceph/osd.0) getattr /srv/ceph/osd.0/current/3.138_head/rb.0.5.0000000001ba_head 'snapset' = -2
2010-10-05 06:22:05.142707 7ff367b0e710 filestore(/srv/ceph/osd.0) getattr /srv/ceph/osd.0/current/3.138_head/rb.0.5.0000000001ba_snapdir 'snapset'
2010-10-05 06:22:05.142716 7ff367b0e710 filestore(/srv/ceph/osd.0) getattr /srv/ceph/osd.0/current/3.138_head/rb.0.5.0000000001ba_snapdir 'snapset' = -2

I added the new core and logfiles too logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_rb_tree

Actions #2

Updated by Wido den Hollander over 13 years ago

I'm now seeing this crash on multiple OSD's.

Added some coredumps to the collection on the logger machine.

Actions #3

Updated by Wido den Hollander over 13 years ago

I just saw this crash again.

Used "cdebugpack" to gather the right files.

Added "issue_460_node02.tar.gz" to the directory on the logger machine.

Actions #4

Updated by Sage Weil over 13 years ago

The problem here is that we don't have the snapset attr. This happens when there is no _head and no _snapset object. This shouldn't ever happen... I'm pretty sure this is an object I manually futzed with the other day. Is there any timeline on getting node07 and 12 up? I'm wondering if there might be a copy on one of those hosts.

Actions #5

Updated by Wido den Hollander over 13 years ago

node07 and node12 are online again (about 12 hours).

Actions #6

Updated by Tony Butler over 13 years ago

Sage Weil wrote:

This shouldn't ever happen...

I have it happening on quite a few OSDs in my test cluster. General "fix" is to drop the mount, reformat, rebuild as a fresh OSD, and toss it back in, at which point it usually works for a while but it seems eventually the same thing will happen. I have noticed it may occur less when I just mkbtrfs on the entire device (no partitioning at all), but then the btrfs crash happens more. I am using larger RAID-0 volumes and otherwise must use GPT to make a partition so I'm unsure if that has anything at all do with it.

Running on Gentoo kernel 2.6.35-gentoo-r10 and the latest git as of a day or two ago (a7ed2ee05dc7453942018d7876401c28d3918214) and will be updating to very current (0ff6e41d76cf65007df139e5c924dd9392372f4c) now. I am not using RBD driver but the Ceph kernel client that came with the gentoo kernel.

Actions #7

Updated by Sage Weil over 13 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF