Bug #13074
closedtcmalloc segfaults in Pipe::writer() when bufferlist goes out of scope
0%
Description
@
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: /usr/bin/ceph-osd() [0xacb3ba]
2: (()+0x10340) [0x7faea044e340]
3: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x103) [0x7faea067fac3]
4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long)+0x1b) [0x7faea067fb7b]
5: (operator delete(void*)+0x1f8) [0x7faea068ef68]
6: (std::_Rb_tree<int, std::pair<int const, std::list<Message*, std::allocator<Message*> > >, std::_Select1st<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >, std::less<int>, std::allocator<std::pair<int const, std::list<Message*, std::allocator<Message*> > > > >::_M_erase(std::_Rb_tree_node<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >)+0x58) [0xca2438]
7: (std::_Rb_tree<int, std::pair<int const, std::list<Message, std::allocator<Message*> > >, std::_Select1st<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >, std::less<int>, std::allocator<std::pair<int const, std::list<Message*, std::allocator<Message*> > > > >::erase(int const&)+0xdf) [0xca252f]
8: (Pipe::writer()+0x93c) [0xca097c]
9: (Pipe::Writer::entry()+0xd) [0xca40dd]
10: (()+0x8182) [0x7faea0446182]
11: (clone()+0x6d) [0x7fae9e9b100d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
$ grep bcfa5c A5 -B35 dump.txt >cct,1) << "writer error sending " << m << ", "
bcfa27: 0f 88 e3 09 00 00 js bd0410 <_ZN4Pipe6writerEv+0x12f0>
ldout(msgr
<< cpp_strerror(errno) << dendl;
fault();
}
m->put();
bcfa2d: 4c 89 e7 mov r12,%rdi
bcfa30: e8 9b 1b ad ff callq 6a15d0 <_ZN16RefCountedObject3putEv>
ptr(const char *d, unsigned l);
ptr(const ptrx%x p);
ptr(const ptr& p, unsigned o, unsigned l);
ptr& operator= (const ptr& p);
~ptr() {
release();
bcfa35: 49 8d 7f 18 lea 0x18(%r15),%rdi
bcfa39: e8 02 86 f9 ff callq b68040 <_ZN4ceph6buffer3ptr7releaseEv>
}
#endif
// This is what actually destroys the list.
~_List_base() _GLIBCXX_NOEXCEPT
{ M_clear(); }
bcfa3e: 4c 89 ff mov %r15,%rdi
bcfa41: e8 2a af a6 ff callq 63a970 <_ZNSt10_List_baseIN4ceph6buffer3ptrESaIS2_EE8_M_clearEv>
bcfa46: e9 2d f7 ff ff jmpq bcf178 <_ZN4Pipe6writerEv+0x58>
bcfa4b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
if (!p->second.empty()) {
m = p->second.front();
p->second.pop_front();
}
if (p->second.empty())
out_q.erase(p->first);
bcfa50: 48 8d 70 20 lea 0x20(%rax),%rsi
bcfa54: 4c 89 ef mov %r13,%rdi
bcfa57: e8 34 97 00 00 callq bd9190 <_ZNSt8_Rb_treeIiSt4pairIKiSt4listIP7MessageSaIS4_EEESt10_Select1stIS7_ESt4lessIiESaIS7_EE5eraseERS1>
bcfa5c: e9 dd fc ff ff jmpq bcf73e <_ZN4Pipe6writerEv+0x61e> <--------------------------------------------HERE
bcfa61: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
if (state != STATE_CONNECTING && state != STATE_WAIT && state != STATE_STANDBY &&
(is_queued() || in_seq > in_seq_acked)) {
@
So we are likely on the line above here so what function call is that?
$ c++filt _ZNSt8_Rb_treeIiSt4pairIKiSt4listIP7MessageSaIS4_EEESt10_Select1stIS7_ESt4lessIiESaIS7_EE5eraseERS1_
std::_Rb_tree<int, std::pair<int const, std::list<Message*, std::allocator<Message*> > >, std::_Select1st<std::pair<int const, std::list<Message*, std::allocator<Message*> > > >, std::less<int>, std::allocator<std::pair<int const, std::list<Message*, std::allocator<Message*> > > > >::erase(int const&)
And that matches frame 7 so we are in the right place.
The relevant code is this.
1821 bufferlist blist = m->get_payload(); <------------NOTE
1822 blist.append(m->get_middle());
1823 blist.append(m->get_data());
1824
1825 pipe_lock.Unlock();
1826
1827 ldout(msgr->cct,20) << "writer sending " << m->get_seq() << " " << m << dendl;
1828 int rc = write_message(header, footer, blist);
1829
1830 pipe_lock.Lock();
1831 if (rc < 0) {
1832 ldout(msgr->cct,1) << "writer error sending " << m << ", "
1833 << cpp_strerror(errno) << dendl;
1834 fault();
1835 }
1836 m->put();
1837 }
256 class CEPH_BUFFER_API list {
257 // my private bits
258 std::list<ptr> _buffers; <------------NOTE
There's the standard::list we've been looking for and it's a list of "ptr"
objects which fits in with this.
$ c++filt _ZN4ceph6buffer3ptr7releaseEv
ceph::buffer::ptr::release()
So let's look at that definition.
169 class CEPH_BUFFER_API ptr {
170 raw *_raw;
171 unsigned _off, _len;
172
173 void release();
174
175 public:
176 ptr() : _raw(0), _off(0), _len(0) {}
177 ptr(raw *r);
178 ptr(unsigned l);
179 ptr(const char *d, unsigned l);
180 ptr(const ptr& p);
181 ptr(const ptr& p, unsigned o, unsigned l);
182 ptr& operator= (const ptr& p);
183 ~ptr() {
184 release();
185 }
We've seen the code above in the disassembly in my last comment so we are
definitely on the right track.
So when the blist variable goes out of scope the _buffers member variable of
type std::list<ptr> gets destroyed and that is why we end up in the standard
library function below.
$ c++filt _ZNSt10_List_baseIN4ceph6buffer3ptrESaIS2_EE8_M_clearEv
std::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >::_M_clear()
So there is a problem with the bufferlist variable "blist" when it goes out of scope and attempts to destroy its std::list of ceph::buffer::ptr. this is a similar issue to http://tracker.ceph.com/issues/3678 but, of course, the d16ad9263d7b1d3c096f56c56e9631fae8509651 commit is in 0.94.2 so this is possibly another race.
Updated by Greg Farnum over 8 years ago
I'm not quite following all the jumps between different code blocks here. What scenario is this crash appearing in? Do you have a core dump we can explore?
Updated by Brad Hubbard over 8 years ago
Sorry this isn't clearer Greg. Unfortunately the failed formatting isn't helping and I can't find a way to edit my original post now to try to fix it.
What I'm trying to describe above is that I believe this is happening when we are destructing the blist variable of type bufferlist as it goes out of scope on line 1837 above.
bufferlist is a typedef for buffer::list and a buffer::list object contains a "std::list<ptr> _buffers;" member. So what we are seeing here is a std::list<ptr> variable going out of scope and being destroyed. It looks like one of the ceph::buffer::ptr in the std::list is invalid and when we pass it to tcmalloc to delete we get a segfault. So I suspect this is a race on one of the ceph::buffer::ptr. We don't have a core dump at this stage I'm afraid.
Updated by Samuel Just over 8 years ago
- Status changed from New to Need More Info
We'll need more information about the conditions this occurred in.
Updated by Sage Weil over 8 years ago
This looks like heap corruption or use-after-free or similar. How how do you reproduce it?
Updated by Brad Hubbard over 8 years ago
I've asked the original reporter of this issue to add any information he can.
Updated by Brad Hubbard over 8 years ago
"This occurred on a system under moderate load - has not happened since and I do not know how to reproduce."
Updated by Sage Weil over 8 years ago
- Status changed from Need More Info to Can't reproduce
- Priority changed from Urgent to High