Bug #39022
closedmsgr: segmentation fault in handle_auth_request
0%
Description
2019-03-21 21:52:01.297 7fd96dd5e700 -1 *** Caught signal (Segmentation fault) ** in thread 7fd96dd5e700 thread_name:msgr-worker-1 ceph version 14.2.0-273-g5076cbe (5076cbe4cf3d7aa19784336bc2406352d0560d63) octopus (dev) 1: (()+0xf5d0) [0x7fd9726a05d0] 2: (ProtocolV2::handle_auth_request(ceph::buffer::v14_2_0::list&)+0x335) [0x7fd974ab3c85] 3: (ProtocolV2::handle_frame_payload()+0xc3) [0x7fd974abcf33] 4: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x7fd974abd2f0] 5: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x238) [0x7fd974abd628] 6: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x7fd974aa31d4] 7: (AsyncConnection::process()+0x186) [0x7fd974a72186] 8: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa15) [0x7fd974ac68a5] 9: (()+0x5523b5) [0x7fd974acb3b5] 10: (()+0x7e327f) [0x7fd974d5c27f] 11: (()+0x7dd5) [0x7fd972698dd5] 12: (clone()+0x6d) [0x7fd971348ead] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
From: /ceph/teuthology-archive/pdonnell-2019-03-21_19:28:46-fs-wip-pdonnell-testing-20190321.172825-distro-basic-smithi/3756550/remote/smithi051/log/ceph-mds.c.log.gz
Core (RHEL7 distro): /ceph/teuthology-archive/pdonnell-2019-03-21_19:28:46-fs-wip-pdonnell-testing-20190321.172825-distro-basic-smithi/3756550/remote/smithi051/coredump/1553205121.138728.core
Haven't ruled out the MDS is responsible. Haven't ruled out this PR isn't responsible: https://github.com/ceph/ceph/pull/26348
Updated by Patrick Donnelly about 5 years ago
- Description updated (diff)
Another: /ceph/teuthology-archive/pdonnell-2019-04-06_02:21:29-fs-wip-pdonnell-testing-20190405.231924-distro-basic-smithi/3814500/teuthology.log
branch: https://github.com/ceph/ceph-ci/commits/wip-pdonnell-testing-20190405.231924
Updated by Brad Hubbard about 5 years ago
I've looked at both cores now and they are both the same issue.
(gdb) f #4 0x00007fc40dfe9643 in ProtocolV2::handle_auth_request(ceph::buffer::v14_2_0::list&) () at /usr/src/debug/ceph-15.0.0-122-gcf4d304/src/msg/async/ProtocolV2.cc:2122 2122 auth_meta->con_mode = messenger->auth_server->pick_con_mode( (gdb) disass /m ... 2122 auth_meta->con_mode = messenger->auth_server->pick_con_mode( 0x00007fc40dfe9623 <+771>: mov 0x18(%rbp),%rax 0x00007fc40dfe9627 <+775>: mov 0x10(%rbp),%rcx 0x00007fc40dfe962b <+779>: mov 0x28(%rbp),%rdx 0x00007fc40dfe962f <+783>: mov 0x118(%rax),%rdi 0x00007fc40dfe9636 <+790>: mov 0x70(%rcx),%esi 0x00007fc40dfe9639 <+793>: mov (%rdx),%edx 0x00007fc40dfe963b <+795>: lea 0x98(%r12),%rcx => 0x00007fc40dfe9643 <+803>: mov (%rdi),%rax 0x00007fc40dfe9646 <+806>: mov 0x20(%rax),%rax (gdb) i r rdi rdi 0x0 0 (gdb) i r rax rax 0x55ee6d266900 94482521811200 (gdb) p messenger $1 = (AsyncMessenger *) 0x55ee6d266900 (gdb) p ((AsyncMessenger*)0x0)->auth_server Cannot access memory at address 0x118 So this is happening while we are trying to access the messenger auth_server which should be set to the mon client and should be the same as messenger->auth_client. (gdb) l MDSDaemon::init() ... 484 messenger->set_auth_client(monc); 485 messenger->set_auth_server(monc); 486 monc->set_handle_authentication_dispatcher(this); (gdb) p messenger->auth_client $8 = (AuthClient *) 0x7fff57e297b8 (gdb) p messenger->auth_server $9 = (AuthServer *) 0x7fff57e297c0 (gdb) p mds->monc $11 = (MonClient *) 0x7fff57e297a0 It's not clear to me at all how RDI became 0
That's all I've got for now but hope to get back to this as time permits.
Updated by Sage Weil over 4 years ago
- Status changed from New to Can't reproduce
I seem to remember fixing this, although I can't find the patch now. In any case, if it's not occuring any more, let's close it.