Project

General

Profile

Actions

Bug #13074

closed

tcmalloc segfaults in Pipe::writer() when bufferlist goes out of scope

Added by Brad Hubbard over 8 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

@
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: /usr/bin/ceph-osd() [0xacb3ba]
2: (()+0x10340) [0x7faea044e340]
3: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x103) [0x7faea067fac3]
4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long)+0x1b) [0x7faea067fb7b]
5: (operator delete(void*)+0x1f8) [0x7faea068ef68]
6: (std::_Rb_tree<int, std::pair<int const, std::list<Message*, std::allocator<Message*> > >, std::_Select1st<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >, std::less<int>, std::allocator<std::pair<int const, std::list<Message*, std::allocator<Message*> > > > >::_M_erase(std::_Rb_tree_node<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >)+0x58) [0xca2438]
7: (std::_Rb_tree<int, std::pair<int const, std::list<Message
, std::allocator<Message*> > >, std::_Select1st<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >, std::less<int>, std::allocator<std::pair<int const, std::list<Message*, std::allocator<Message*> > > > >::erase(int const&)+0xdf) [0xca252f]
8: (Pipe::writer()+0x93c) [0xca097c]
9: (Pipe::Writer::entry()+0xd) [0xca40dd]
10: (()+0x8182) [0x7faea0446182]
11: (clone()+0x6d) [0x7fae9e9b100d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

$ grep bcfa5c A5 -B35 dump.txt
bcfa27: 0f 88 e3 09 00 00 js bd0410 <_ZN4Pipe6writerEv+0x12f0>
ldout(msgr
>cct,1) << "writer error sending " << m << ", "
<< cpp_strerror(errno) << dendl;
fault();
}
m->put();
bcfa2d: 4c 89 e7 mov r12,%rdi
bcfa30: e8 9b 1b ad ff callq 6a15d0 <_ZN16RefCountedObject3putEv>
ptr(const char *d, unsigned l);
ptr(const ptrx%x
p);
ptr(const ptr& p, unsigned o, unsigned l);
ptr& operator= (const ptr& p);
~ptr() {
release();
bcfa35: 49 8d 7f 18 lea 0x18(%r15),%rdi
bcfa39: e8 02 86 f9 ff callq b68040 <_ZN4ceph6buffer3ptr7releaseEv>
}
#endif

// This is what actually destroys the list.
~_List_base() _GLIBCXX_NOEXCEPT { M_clear(); }
bcfa3e: 4c 89 ff mov %r15,%rdi
bcfa41: e8 2a af a6 ff callq 63a970 &lt;_ZNSt10_List_baseIN4ceph6buffer3ptrESaIS2_EE8_M_clearEv&gt;
bcfa46: e9 2d f7 ff ff jmpq bcf178 &lt;_ZN4Pipe6writerEv+0x58&gt;
bcfa4b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
if (!p->second.empty()) {
m = p->second.front();
p->second.pop_front();
}
if (p->second.empty())
out_q.erase(p->first);
bcfa50: 48 8d 70 20 lea 0x20(%rax),%rsi
bcfa54: 4c 89 ef mov %r13,%rdi
bcfa57: e8 34 97 00 00 callq bd9190 &lt;_ZNSt8_Rb_treeIiSt4pairIKiSt4listIP7MessageSaIS4_EEESt10_Select1stIS7_ESt4lessIiESaIS7_EE5eraseERS1
&gt;
bcfa5c: e9 dd fc ff ff jmpq bcf73e &lt;_ZN4Pipe6writerEv+0x61e&gt; <--------------------------------------------HERE
bcfa61: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
if (state != STATE_CONNECTING && state != STATE_WAIT && state != STATE_STANDBY &&
(is_queued() || in_seq > in_seq_acked)) {
@

So we are likely on the line above here so what function call is that?

$ c++filt _ZNSt8_Rb_treeIiSt4pairIKiSt4listIP7MessageSaIS4_EEESt10_Select1stIS7_ESt4lessIiESaIS7_EE5eraseERS1_
std::_Rb_tree<int, std::pair<int const, std::list<Message*, std::allocator<Message*> > >, std::_Select1st<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >, std::less<int>, std::allocator<std::pair<int const, std::list<Message*, std::allocator<Message*> > > > >::erase(int const&)

And that matches frame 7 so we are in the right place.

The relevant code is this.

1821 bufferlist blist = m->get_payload(); <------------NOTE
1822 blist.append(m->get_middle());
1823 blist.append(m->get_data());
1824
1825 pipe_lock.Unlock();
1826
1827 ldout(msgr->cct,20) << "writer sending " << m->get_seq() << " " << m << dendl;
1828 int rc = write_message(header, footer, blist);
1829
1830 pipe_lock.Lock();
1831 if (rc < 0) {
1832 ldout(msgr->cct,1) << "writer error sending " << m << ", "
1833 << cpp_strerror(errno) << dendl;
1834 fault();
1835 }
1836 m->put();
1837 }


256 class CEPH_BUFFER_API list {
257 // my private bits
258 std::list<ptr> _buffers; <------------NOTE

There's the standard::list we've been looking for and it's a list of "ptr"
objects which fits in with this.


$ c++filt _ZN4ceph6buffer3ptr7releaseEv
ceph::buffer::ptr::release()

So let's look at that definition.

169 class CEPH_BUFFER_API ptr {
170 raw *_raw;
171 unsigned _off, _len;
172
173 void release();
174
175 public:
176 ptr() : _raw(0), _off(0), _len(0) {}
177 ptr(raw *r);
178 ptr(unsigned l);
179 ptr(const char *d, unsigned l);
180 ptr(const ptr& p);
181 ptr(const ptr& p, unsigned o, unsigned l);
182 ptr& operator= (const ptr& p);
183 ~ptr() {
184 release();
185 }

We've seen the code above in the disassembly in my last comment so we are
definitely on the right track.

So when the blist variable goes out of scope the _buffers member variable of
type std::list<ptr> gets destroyed and that is why we end up in the standard
library function below.

$ c++filt _ZNSt10_List_baseIN4ceph6buffer3ptrESaIS2_EE8_M_clearEv
std::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >::_M_clear()

So there is a problem with the bufferlist variable "blist" when it goes out of scope and attempts to destroy its std::list of ceph::buffer::ptr. this is a similar issue to http://tracker.ceph.com/issues/3678 but, of course, the d16ad9263d7b1d3c096f56c56e9631fae8509651 commit is in 0.94.2 so this is possibly another race.

Actions #1

Updated by Greg Farnum over 8 years ago

I'm not quite following all the jumps between different code blocks here. What scenario is this crash appearing in? Do you have a core dump we can explore?

Actions #2

Updated by Brad Hubbard over 8 years ago

Sorry this isn't clearer Greg. Unfortunately the failed formatting isn't helping and I can't find a way to edit my original post now to try to fix it.

What I'm trying to describe above is that I believe this is happening when we are destructing the blist variable of type bufferlist as it goes out of scope on line 1837 above.

bufferlist is a typedef for buffer::list and a buffer::list object contains a "std::list<ptr> _buffers;" member. So what we are seeing here is a std::list<ptr> variable going out of scope and being destroyed. It looks like one of the ceph::buffer::ptr in the std::list is invalid and when we pass it to tcmalloc to delete we get a segfault. So I suspect this is a race on one of the ceph::buffer::ptr. We don't have a core dump at this stage I'm afraid.

Actions #3

Updated by Samuel Just over 8 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Samuel Just over 8 years ago

  • Status changed from New to Need More Info

We'll need more information about the conditions this occurred in.

Actions #5

Updated by Sage Weil over 8 years ago

This looks like heap corruption or use-after-free or similar. How how do you reproduce it?

Actions #6

Updated by Brad Hubbard over 8 years ago

I've asked the original reporter of this issue to add any information he can.

Actions #7

Updated by Brad Hubbard over 8 years ago

"This occurred on a system under moderate load - has not happened since and I do not know how to reproduce."

Actions #8

Updated by Sage Weil over 8 years ago

  • Status changed from Need More Info to Can't reproduce
  • Priority changed from Urgent to High
Actions

Also available in: Atom PDF