Project

General

Profile

Actions

Bug #39022

closed

msgr: segmentation fault in handle_auth_request

Added by Patrick Donnelly about 5 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-03-21 21:52:01.297 7fd96dd5e700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fd96dd5e700 thread_name:msgr-worker-1

 ceph version 14.2.0-273-g5076cbe (5076cbe4cf3d7aa19784336bc2406352d0560d63) octopus (dev)
 1: (()+0xf5d0) [0x7fd9726a05d0]
 2: (ProtocolV2::handle_auth_request(ceph::buffer::v14_2_0::list&)+0x335) [0x7fd974ab3c85]
 3: (ProtocolV2::handle_frame_payload()+0xc3) [0x7fd974abcf33]
 4: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x7fd974abd2f0]
 5: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x238) [0x7fd974abd628]
 6: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x7fd974aa31d4]
 7: (AsyncConnection::process()+0x186) [0x7fd974a72186]
 8: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa15) [0x7fd974ac68a5]
 9: (()+0x5523b5) [0x7fd974acb3b5]
 10: (()+0x7e327f) [0x7fd974d5c27f]
 11: (()+0x7dd5) [0x7fd972698dd5]
 12: (clone()+0x6d) [0x7fd971348ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

From: /ceph/teuthology-archive/pdonnell-2019-03-21_19:28:46-fs-wip-pdonnell-testing-20190321.172825-distro-basic-smithi/3756550/remote/smithi051/log/ceph-mds.c.log.gz

Core (RHEL7 distro): /ceph/teuthology-archive/pdonnell-2019-03-21_19:28:46-fs-wip-pdonnell-testing-20190321.172825-distro-basic-smithi/3756550/remote/smithi051/coredump/1553205121.138728.core

Haven't ruled out the MDS is responsible. Haven't ruled out this PR isn't responsible: https://github.com/ceph/ceph/pull/26348

Actions #1

Updated by Patrick Donnelly about 5 years ago

  • Description updated (diff)
Actions #2

Updated by Patrick Donnelly about 5 years ago

  • Description updated (diff)
Actions #3

Updated by Patrick Donnelly about 5 years ago

  • Description updated (diff)

Another: /ceph/teuthology-archive/pdonnell-2019-04-06_02:21:29-fs-wip-pdonnell-testing-20190405.231924-distro-basic-smithi/3814500/teuthology.log

branch: https://github.com/ceph/ceph-ci/commits/wip-pdonnell-testing-20190405.231924

Actions #4

Updated by Brad Hubbard about 5 years ago

I've looked at both cores now and they are both the same issue.

(gdb) f
#4  0x00007fc40dfe9643 in ProtocolV2::handle_auth_request(ceph::buffer::v14_2_0::list&) () at /usr/src/debug/ceph-15.0.0-122-gcf4d304/src/msg/async/ProtocolV2.cc:2122
2122      auth_meta->con_mode = messenger->auth_server->pick_con_mode(
(gdb) disass /m
...
2122      auth_meta->con_mode = messenger->auth_server->pick_con_mode(
   0x00007fc40dfe9623 <+771>:   mov    0x18(%rbp),%rax
   0x00007fc40dfe9627 <+775>:   mov    0x10(%rbp),%rcx
   0x00007fc40dfe962b <+779>:   mov    0x28(%rbp),%rdx
   0x00007fc40dfe962f <+783>:   mov    0x118(%rax),%rdi
   0x00007fc40dfe9636 <+790>:   mov    0x70(%rcx),%esi
   0x00007fc40dfe9639 <+793>:   mov    (%rdx),%edx
   0x00007fc40dfe963b <+795>:   lea    0x98(%r12),%rcx
=> 0x00007fc40dfe9643 <+803>:   mov    (%rdi),%rax
   0x00007fc40dfe9646 <+806>:   mov    0x20(%rax),%rax
(gdb) i r rdi  
rdi            0x0      0
(gdb) i r rax
rax            0x55ee6d266900   94482521811200
(gdb) p messenger
$1 = (AsyncMessenger *) 0x55ee6d266900
(gdb) p ((AsyncMessenger*)0x0)->auth_server
Cannot access memory at address 0x118

So this is happening while we are trying to access the messenger auth_server which should be set to the mon client and should be the same as messenger->auth_client.

(gdb) l MDSDaemon::init()
...
484       messenger->set_auth_client(monc);
485       messenger->set_auth_server(monc);
486       monc->set_handle_authentication_dispatcher(this);

(gdb) p messenger->auth_client
$8 = (AuthClient *) 0x7fff57e297b8
(gdb) p messenger->auth_server
$9 = (AuthServer *) 0x7fff57e297c0
(gdb) p mds->monc
$11 = (MonClient *) 0x7fff57e297a0

It's not clear to me at all how RDI became 0

That's all I've got for now but hope to get back to this as time permits.
Actions #5

Updated by Sage Weil over 4 years ago

  • Status changed from New to Can't reproduce

I seem to remember fixing this, although I can't find the patch now. In any case, if it's not occuring any more, let's close it.

Actions

Also available in: Atom PDF