Bug #23966
closedDeleting a pool with active notify linger ops can result in seg fault
0%
Description
It's possible that if a notification is sent while a pool is being deleted, the Objecter will fail the Op w/ -ENOENT when it detects the pool deletion. However, the OSDs might not have seen the new map yet and might accept the op and issue the MWatchNotify op back to the client. Since the client doesn't expect to receive a MWatchNotify on a failed Op, it had already completed the 'on_notify_finish' callback but didn't reset the pointer to NULL.
Updated by Jason Dillaman almost 6 years ago
- Status changed from New to In Progress
- Assignee set to Jason Dillaman
Updated by Jason Dillaman almost 6 years ago
- Status changed from In Progress to Fix Under Review
Updated by Kefu Chai almost 6 years ago
Updated by Kefu Chai almost 6 years ago
Updated by Kefu Chai almost 6 years ago
the test still fails with the fixes above: /a/kchai-2018-05-06_15:50:41-rados-wip-kefu-testing-2018-05-06-2204-distro-basic-smithi/2483532/remote/smithi042/coredump/1525626866.17646.core
#0 0x0000000000000000 in ?? () #1 0x00007f75b5b4bc39 in Context::complete (this=0x55a240bb9670, r=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/include/Context.h:77 #2 0x00007f75b5b8e21c in Objecter::_linger_commit (this=0x55a240b5c930, info=0x55a240bb9850, r=-2, outbl=...) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:596 #3 0x00007f75b5b4bc39 in Context::complete (this=0x55a240b678a0, r=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/include/Context.h:77 #4 0x00007f75b5ba3cce in Objecter::handle_osd_op_reply (this=this@entry=0x55a240b5c930, m=m@entry=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:3549 #5 0x00007f75b5baf5cb in Objecter::ms_dispatch (this=0x55a240b5c930, m=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:973 #6 0x00007f75b5bb4d62 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.h:2114 #7 0x00007f75ace6c486 in Messenger::ms_fast_dispatch (m=0x7f7560001d70, this=0x55a240b5b010) at /build/ceph-13.1.0-86-g56ac91f/src/msg/Messenger.h:638 #8 DispatchQueue::fast_dispatch (this=0x55a240b5b1b8, m=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/msg/DispatchQueue.cc:71 #9 0x00007f75acefc700 in Pipe::reader (this=0x55a240bbe510) at /build/ceph-13.1.0-86-g56ac91f/src/msg/simple/Pipe.cc:1774 #10 0x00007f75acf036ed in Pipe::Reader::entry (this=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/msg/simple/Pipe.h:51 #11 0x00007f75b58b46ba in ?? ()
by looking at the backtrace. the test executable was still able to lookup the vtable of the C_notify_Finish
instance for its complete() method. but it failed to do so when it tries to lookup C_notify_Finish::finish()
. it resolved it to 0x0000000000000000
.
(gdb) f 2 #2 0x00007f75b5b8e21c in Objecter::_linger_commit (this=0x55a240b5c930, info=0x55a240bb9850, r=-2, outbl=...) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:596 (gdb) p *('librados::(anonymous namespace)::C_notify_Finish'*)info->on_notify_finish $10 = {<Context> = {_vptr.Context = 0x7f75b5e178c8 <vtable for librados::(anonymous namespace)::C_notify_Finish+16>}, cct = 0x55a240a864c0, ctx = 0x7fff99633810, objecter = 0x55a240b5c930, linger_op = 0x55a240bb9850, reply_bl = {_buffers = {<std::__cxx11::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >> = { _M_impl = {<std::allocator<std::_List_node<ceph::buffer::ptr> >> = {<__gnu_cxx::new_allocator<std::_List_node<ceph::buffer::ptr> >> = {<No data fields>}, <No data fields>}, _M_node = {<std::__detail::_List_node_base> = {_M_next = 0x55a240bb9698, _M_prev = 0x55a240bb9698}, _M_storage = { _M_storage = "\000\000\000\000\000\000\000"}}}}, <No data fields>}, _len = 0, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {<ceph::buffer::list::iterator_impl<false>> = {<std::iterator<std::forward_iterator_tag, char, long, char*, char&>> = {<No data fields>}, bl = 0x55a240bb9698, ls = 0x55a240bb9698, off = 0, p = {_M_node = 0x55a240bb9698}, p_off = 0}, <No data fields>}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, preply_bl = 0x0, preply_buf = 0x0, preply_buf_len = 0x0} (gdb) info vtbl info->on_notify_finish vtable for 'Context' @ 0x7f75b5e178c8 (subobject @ 0x55a240bb9670): [0]: 0x7f75b5b5a6b0 <librados::(anonymous namespace)::C_notify_Finish::finish(int)> [1]: 0x7f75b5b4bc20 <Context::sync_finish(int)> [2]: 0x7f75b5b5c620 <librados::(anonymous namespace)::C_notify_Finish::~C_notify_Finish()> [3]: 0x7f75b5b5c650 <librados::(anonymous namespace)::C_notify_Finish::~C_notify_Finish()> [4]: 0x7f75b5b4bc30 <Context::complete(int)> [5]: 0x7f75b5b4c3f0 <Context::sync_complete(int)>
but both vtable and the member variables of C_notify_Finish look sane.
========== Dump of assembler code for function Objecter::_linger_commit(Objecter::LingerOp*, int, ceph::buffer::list&): 0x00007f75b5b8e20f <+223>: test %rdi,%rdi 0x00007f75b5b8e212 <+226>: je 0x7f75b5b8e1b5 <Objecter::_linger_commit(Objecter::LingerOp*, int, ceph::buffer::list&)+133> 0x00007f75b5b8e214 <+228>: mov (%rdi),%rax 0x00007f75b5b8e217 <+231>: mov %ebp,%esi 0x00007f75b5b8e219 <+233>: callq *0x20(%rax) => 0x00007f75b5b8e21c <+236>: movq $0x0,0x2c8(%rbx) ========== (gdb) disass complete Dump of assembler code for function Context::complete(int): 0x00007f75b5b4bc30 <+0>: push %rbx 0x00007f75b5b4bc31 <+1>: mov (%rdi),%rax 0x00007f75b5b4bc34 <+4>: mov %rdi,%rbx 0x00007f75b5b4bc37 <+7>: callq *(%rax) => 0x00007f75b5b4bc39 <+9>: mov (%rbx),%rax 0x00007f75b5b4bc3c <+12>: mov %rbx,%rdi 0x00007f75b5b4bc3f <+15>: pop %rbx 0x00007f75b5b4bc40 <+16>: mov 0x18(%rax),%rax 0x00007f75b5b4bc44 <+20>: jmpq *%rax
(gdb) p /x *(long long*)(*(long long*)($rbx)) $17 = 0x7f75b5b5a6b0
and the address of C_notify_Finish::finish(int)
should resolve, per the stored $rbx.
Updated by Kefu Chai almost 6 years ago
will reset the member variables of C_notify_Finish in its dtor for debugging, to see if it has been destroyed or not when the segfault is triggered.
Updated by Jason Dillaman almost 6 years ago
Alternative Mimic fix: https://github.com/ceph/ceph/pull/21859
Updated by Kefu Chai almost 6 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Casey Bodley almost 6 years ago
- Copied to Backport #24058: jewel: Deleting a pool with active notify linger ops can result in seg fault added
Updated by Casey Bodley almost 6 years ago
- Copied to Backport #24059: luminous: Deleting a pool with active notify linger ops can result in seg fault added
Updated by Kefu Chai almost 6 years ago
forward-port for master: https://github.com/ceph/ceph/pull/21831
Updated by Yuri Weinstein almost 6 years ago
Updated by Nathan Cutler almost 6 years ago
- Status changed from Pending Backport to Resolved