Project

General

Profile

Actions

Bug #23966

closed

Deleting a pool with active notify linger ops can result in seg fault

Added by Jason Dillaman almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Jason Dillaman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It's possible that if a notification is sent while a pool is being deleted, the Objecter will fail the Op w/ -ENOENT when it detects the pool deletion. However, the OSDs might not have seen the new map yet and might accept the op and issue the MWatchNotify op back to the client. Since the client doesn't expect to receive a MWatchNotify on a failed Op, it had already completed the 'on_notify_finish' callback but didn't reset the pointer to NULL.

http://qa-proxy.ceph.com/teuthology/trociny-2018-05-02_11:06:00-rbd-wip-mgolub-testing-distro-basic-smithi/2466798/teuthology.log


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #24058: jewel: Deleting a pool with active notify linger ops can result in seg faultResolvedKefu ChaiActions
Copied to RADOS - Backport #24059: luminous: Deleting a pool with active notify linger ops can result in seg faultResolvedPrashant DActions
Actions #1

Updated by Jason Dillaman almost 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman
Actions #2

Updated by Jason Dillaman almost 6 years ago

  • Status changed from In Progress to Fix Under Review
Actions #5

Updated by Kefu Chai almost 6 years ago

the test still fails with the fixes above: /a/kchai-2018-05-06_15:50:41-rados-wip-kefu-testing-2018-05-06-2204-distro-basic-smithi/2483532/remote/smithi042/coredump/1525626866.17646.core

#0  0x0000000000000000 in ?? ()
#1  0x00007f75b5b4bc39 in Context::complete (this=0x55a240bb9670, r=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/include/Context.h:77
#2  0x00007f75b5b8e21c in Objecter::_linger_commit (this=0x55a240b5c930, info=0x55a240bb9850, r=-2, outbl=...) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:596
#3  0x00007f75b5b4bc39 in Context::complete (this=0x55a240b678a0, r=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/include/Context.h:77
#4  0x00007f75b5ba3cce in Objecter::handle_osd_op_reply (this=this@entry=0x55a240b5c930, m=m@entry=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:3549
#5  0x00007f75b5baf5cb in Objecter::ms_dispatch (this=0x55a240b5c930, m=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:973
#6  0x00007f75b5bb4d62 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.h:2114
#7  0x00007f75ace6c486 in Messenger::ms_fast_dispatch (m=0x7f7560001d70, this=0x55a240b5b010) at /build/ceph-13.1.0-86-g56ac91f/src/msg/Messenger.h:638
#8  DispatchQueue::fast_dispatch (this=0x55a240b5b1b8, m=0x7f7560001d70) at /build/ceph-13.1.0-86-g56ac91f/src/msg/DispatchQueue.cc:71
#9  0x00007f75acefc700 in Pipe::reader (this=0x55a240bbe510) at /build/ceph-13.1.0-86-g56ac91f/src/msg/simple/Pipe.cc:1774
#10 0x00007f75acf036ed in Pipe::Reader::entry (this=<optimized out>) at /build/ceph-13.1.0-86-g56ac91f/src/msg/simple/Pipe.h:51
#11 0x00007f75b58b46ba in ?? ()

by looking at the backtrace. the test executable was still able to lookup the vtable of the C_notify_Finish instance for its complete() method. but it failed to do so when it tries to lookup C_notify_Finish::finish(). it resolved it to 0x0000000000000000.

(gdb) f 2
#2  0x00007f75b5b8e21c in Objecter::_linger_commit (this=0x55a240b5c930, info=0x55a240bb9850, r=-2, outbl=...) at /build/ceph-13.1.0-86-g56ac91f/src/osdc/Objecter.cc:596

(gdb) p *('librados::(anonymous namespace)::C_notify_Finish'*)info->on_notify_finish
$10 = {<Context> = {_vptr.Context = 0x7f75b5e178c8 <vtable for librados::(anonymous namespace)::C_notify_Finish+16>}, cct = 0x55a240a864c0, ctx = 0x7fff99633810,
  objecter = 0x55a240b5c930, linger_op = 0x55a240bb9850, reply_bl = {_buffers = {<std::__cxx11::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >> = {
        _M_impl = {<std::allocator<std::_List_node<ceph::buffer::ptr> >> = {<__gnu_cxx::new_allocator<std::_List_node<ceph::buffer::ptr> >> = {<No data fields>}, <No data fields>},
_M_node = {<std::__detail::_List_node_base> = {_M_next = 0x55a240bb9698, _M_prev = 0x55a240bb9698}, _M_storage = {
              _M_storage = "\000\000\000\000\000\000\000"}}}}, <No data fields>}, _len = 0, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0},
    last_p = {<ceph::buffer::list::iterator_impl<false>> = {<std::iterator<std::forward_iterator_tag, char, long, char*, char&>> = {<No data fields>}, bl = 0x55a240bb9698,
        ls = 0x55a240bb9698, off = 0, p = {_M_node = 0x55a240bb9698}, p_off = 0}, <No data fields>}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1},
  preply_bl = 0x0, preply_buf = 0x0, preply_buf_len = 0x0}

(gdb) info vtbl info->on_notify_finish
vtable for 'Context' @ 0x7f75b5e178c8 (subobject @ 0x55a240bb9670):
[0]: 0x7f75b5b5a6b0 <librados::(anonymous namespace)::C_notify_Finish::finish(int)>
[1]: 0x7f75b5b4bc20 <Context::sync_finish(int)>
[2]: 0x7f75b5b5c620 <librados::(anonymous namespace)::C_notify_Finish::~C_notify_Finish()>
[3]: 0x7f75b5b5c650 <librados::(anonymous namespace)::C_notify_Finish::~C_notify_Finish()>
[4]: 0x7f75b5b4bc30 <Context::complete(int)>
[5]: 0x7f75b5b4c3f0 <Context::sync_complete(int)>

but both vtable and the member variables of C_notify_Finish look sane.

==========

Dump of assembler code for function Objecter::_linger_commit(Objecter::LingerOp*, int, ceph::buffer::list&):
   0x00007f75b5b8e20f <+223>:   test   %rdi,%rdi
   0x00007f75b5b8e212 <+226>:   je     0x7f75b5b8e1b5 <Objecter::_linger_commit(Objecter::LingerOp*, int, ceph::buffer::list&)+133>
   0x00007f75b5b8e214 <+228>:   mov    (%rdi),%rax
   0x00007f75b5b8e217 <+231>:   mov    %ebp,%esi
   0x00007f75b5b8e219 <+233>:   callq  *0x20(%rax)
=> 0x00007f75b5b8e21c <+236>:   movq   $0x0,0x2c8(%rbx)

==========

(gdb) disass complete
Dump of assembler code for function Context::complete(int):
   0x00007f75b5b4bc30 <+0>:     push   %rbx
   0x00007f75b5b4bc31 <+1>:     mov    (%rdi),%rax
   0x00007f75b5b4bc34 <+4>:     mov    %rdi,%rbx
   0x00007f75b5b4bc37 <+7>:     callq  *(%rax)
=> 0x00007f75b5b4bc39 <+9>:     mov    (%rbx),%rax
   0x00007f75b5b4bc3c <+12>:    mov    %rbx,%rdi
   0x00007f75b5b4bc3f <+15>:    pop    %rbx
   0x00007f75b5b4bc40 <+16>:    mov    0x18(%rax),%rax
   0x00007f75b5b4bc44 <+20>:    jmpq   *%rax
(gdb) p /x *(long long*)(*(long long*)($rbx))
$17 = 0x7f75b5b5a6b0

and the address of C_notify_Finish::finish(int) should resolve, per the stored $rbx.

Actions #6

Updated by Kefu Chai almost 6 years ago

will reset the member variables of C_notify_Finish in its dtor for debugging, to see if it has been destroyed or not when the segfault is triggered.

Actions #7

Updated by Jason Dillaman almost 6 years ago

Actions #8

Updated by Kefu Chai almost 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #9

Updated by Casey Bodley almost 6 years ago

  • Copied to Backport #24058: jewel: Deleting a pool with active notify linger ops can result in seg fault added
Actions #10

Updated by Casey Bodley almost 6 years ago

  • Copied to Backport #24059: luminous: Deleting a pool with active notify linger ops can result in seg fault added
Actions #11

Updated by Kefu Chai almost 6 years ago

Actions #13

Updated by Nathan Cutler almost 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF