Project

General

Profile

Bug #23966

Deleting a pool with active notify linger ops can result in seg fault

Added by Jason Dillaman over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
Start date:
05/02/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

It's possible that if a notification is sent while a pool is being deleted, the Objecter will fail the Op w/ -ENOENT when it detects the pool deletion. However, the OSDs might not have seen the new map yet and might accept the op and issue the MWatchNotify op back to the client. Since the client doesn't expect to receive a MWatchNotify on a failed Op, it had already completed the 'on_notify_finish' callback but didn't reset the pointer to NULL.

http://qa-proxy.ceph.com/teuthology/trociny-2018-05-02_11:06:00-rbd-wip-mgolub-testing-distro-basic-smithi/2466798/teuthology.log


Related issues

Copied to RADOS - Backport #24058: jewel: Deleting a pool with active notify linger ops can result in seg fault Resolved
Copied to RADOS - Backport #24059: luminous: Deleting a pool with active notify linger ops can result in seg fault Resolved

History

#1 Updated by Jason Dillaman over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman

#2 Updated by Jason Dillaman over 1 year ago

  • Status changed from In Progress to Need Review

#5 Updated by Kefu Chai over 1 year ago

the test still fails with the fixes above: /a/kchai-2018-05-06_15:50:41-rados-wip-kefu-testing-2018-05-06-2204-distro-basic-smithi/2483532/remote/smithi042/coredump/1525626866.17646.core

#0  0x0000000000000000 in ?? ()
#1  0x00007f75b5b4bc39 in Context::complete (this=0x55a240bb9670, r=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/include/Context.h:77
#2  0x00007f75b5b8e21c in Objecter::_linger_commit (this=0x55a240b5c930, info=0x55a240bb9850, r=-2, outbl=...) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:596
#3  0x00007f75b5b4bc39 in Context::complete (this=0x55a240b678a0, r=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/include/Context.h:77
#4  0x00007f75b5ba3cce in Objecter::handle_osd_op_reply (this=this@entry=0x55a240b5c930, m=m@entry=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:3549
#5  0x00007f75b5baf5cb in Objecter::ms_dispatch (this=0x55a240b5c930, m=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:973
#6  0x00007f75b5bb4d62 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.h:2114
#7  0x00007f75ace6c486 in Messenger::ms_fast_dispatch (m=0x7f7560001d70, this=0x55a240b5b010) at /build/ceph-13.1.0-86-g56ac91f/src/msg/Messenger.h:638
#8  DispatchQueue::fast_dispatch (this=0x55a240b5b1b8, m=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/msg/DispatchQueue.cc:71
#9  0x00007f75acefc700 in Pipe::reader (this=0x55a240bbe510) at /build/ceph-13.1.0-86-g56ac91f/src/msg/simple/Pipe.cc:1774
#10 0x00007f75acf036ed in Pipe::Reader::entry (this=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/msg/simple/Pipe.h:51
#11 0x00007f75b58b46ba in ?? ()

by looking at the backtrace. the test executable was still able to lookup the vtable of the C_notify_Finish instance for its complete() method. but it failed to do so when it tries to lookup C_notify_Finish::finish(). it resolved it to 0x0000000000000000.

(gdb) f 2
#2  0x00007f75b5b8e21c in Objecter::_linger_commit (this=0x55a240b5c930, info=0x55a240bb9850, r=-2, outbl=...) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:596

(gdb) p *('librados::(anonymous namespace)::C_notify_Finish'*)info->on_notify_finish
$10 = {<Context> = {_vptr.Context = 0x7f75b5e178c8 <vtable for librados::(anonymous namespace)::C_notify_Finish+16>}, cct = 0x55a240a864c0, ctx = 0x7fff99633810,
  objecter = 0x55a240b5c930, linger_op = 0x55a240bb9850, reply_bl = {_buffers = {<std::__cxx11::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >> = {
        _M_impl = {<std::allocator<std::_List_node<ceph::buffer::ptr> >> = {<__gnu_cxx::new_allocator<std::_List_node<ceph::buffer::ptr> >> = {<No data fields>}, <No data fields>},
_M_node = {<std::__detail::_List_node_base> = {_M_next = 0x55a240bb9698, _M_prev = 0x55a240bb9698}, _M_storage = {
              _M_storage = "\000\000\000\000\000\000\000"}}}}, <No data fields>}, _len = 0, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0},
    last_p = {<ceph::buffer::list::iterator_impl<false>> = {<std::iterator<std::forward_iterator_tag, char, long, char*, char&>> = {<No data fields>}, bl = 0x55a240bb9698,
        ls = 0x55a240bb9698, off = 0, p = {_M_node = 0x55a240bb9698}, p_off = 0}, <No data fields>}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1},
  preply_bl = 0x0, preply_buf = 0x0, preply_buf_len = 0x0}

(gdb) info vtbl info->on_notify_finish
vtable for 'Context' @ 0x7f75b5e178c8 (subobject @ 0x55a240bb9670):
[0]: 0x7f75b5b5a6b0 <librados::(anonymous namespace)::C_notify_Finish::finish(int)>
[1]: 0x7f75b5b4bc20 <Context::sync_finish(int)>
[2]: 0x7f75b5b5c620 <librados::(anonymous namespace)::C_notify_Finish::~C_notify_Finish()>
[3]: 0x7f75b5b5c650 <librados::(anonymous namespace)::C_notify_Finish::~C_notify_Finish()>
[4]: 0x7f75b5b4bc30 <Context::complete(int)>
[5]: 0x7f75b5b4c3f0 <Context::sync_complete(int)>

but both vtable and the member variables of C_notify_Finish look sane.

==========

Dump of assembler code for function Objecter::_linger_commit(Objecter::LingerOp*, int, ceph::buffer::list&):
   0x00007f75b5b8e20f <+223>:   test   %rdi,%rdi
   0x00007f75b5b8e212 <+226>:   je     0x7f75b5b8e1b5 <Objecter::_linger_commit(Objecter::LingerOp*, int, ceph::buffer::list&)+133>
   0x00007f75b5b8e214 <+228>:   mov    (%rdi),%rax
   0x00007f75b5b8e217 <+231>:   mov    %ebp,%esi
   0x00007f75b5b8e219 <+233>:   callq  *0x20(%rax)
=> 0x00007f75b5b8e21c <+236>:   movq   $0x0,0x2c8(%rbx)

==========

(gdb) disass complete
Dump of assembler code for function Context::complete(int):
   0x00007f75b5b4bc30 <+0>:     push   %rbx
   0x00007f75b5b4bc31 <+1>:     mov    (%rdi),%rax
   0x00007f75b5b4bc34 <+4>:     mov    %rdi,%rbx
   0x00007f75b5b4bc37 <+7>:     callq  *(%rax)
=> 0x00007f75b5b4bc39 <+9>:     mov    (%rbx),%rax
   0x00007f75b5b4bc3c <+12>:    mov    %rbx,%rdi
   0x00007f75b5b4bc3f <+15>:    pop    %rbx
   0x00007f75b5b4bc40 <+16>:    mov    0x18(%rax),%rax
   0x00007f75b5b4bc44 <+20>:    jmpq   *%rax
(gdb) p /x *(long long*)(*(long long*)($rbx))
$17 = 0x7f75b5b5a6b0

and the address of C_notify_Finish::finish(int) should resolve, per the stored $rbx.

#6 Updated by Kefu Chai over 1 year ago

will reset the member variables of C_notify_Finish in its dtor for debugging, to see if it has been destroyed or not when the segfault is triggered.

#8 Updated by Kefu Chai over 1 year ago

  • Status changed from Need Review to Pending Backport

#9 Updated by Casey Bodley over 1 year ago

  • Copied to Backport #24058: jewel: Deleting a pool with active notify linger ops can result in seg fault added

#10 Updated by Casey Bodley over 1 year ago

  • Copied to Backport #24059: luminous: Deleting a pool with active notify linger ops can result in seg fault added

#11 Updated by Kefu Chai over 1 year ago

#13 Updated by Nathan Cutler about 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF