Project

General

Profile

Bug #43808

segv in MonClient::handle_auth_done

Added by Sage Weil about 1 month ago. Updated 9 days ago.

Status:
Need More Info
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

  -157> 2020-01-24T10:35:17.317+0000 1200a700 -1 *** Caught signal (Segmentation fault) **
 in thread 1200a700 thread_name:msgr-worker-1

 ceph version 15.0.0-9704-gb1192ba (b1192ba8ca76fda4688de6ceec97751d65ca697f) octopus (dev)
 1: (()+0x12dd0) [0xc253dd0]
 2: (MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x78) [0xf244c8]
 3: (ProtocolV2::handle_auth_done(ceph::buffer::v14_2_0::list&)+0x2f6) [0x10b6b06]
 4: (ProtocolV2::handle_frame_payload()+0x1e3) [0x10c4e23]
 5: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x10c50d0]
 6: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x475) [0x10c56a5]
 7: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x10abe5c]
 8: (AsyncConnection::process()+0x8a9) [0x10826b9]
 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0xed4447]
 10: (()+0xdd1a2c) [0xed9a2c]
 11: (()+0xc2a33) [0xcb7ba33]
 12: (()+0x82de) [0xc2492de]
 13: (clone()+0x43) [0xd4e44b3]

/a/sage-2020-01-24_01:55:08-rados-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/4698040

Related issues

Related to RADOS - Bug #44062: LibRadosWatchNotify.WatchNotify failure Triaged

History

#1 Updated by Sage Weil about 1 month ago

no core, no real clue in the log.

trying to reproduce here: http://pulpito.front.sepia.ceph.com:80/sage-2020-01-24_17:17:32-rados:verify-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/
from: teuthology-suite -s rados/verify --filter validater/valgrind.yaml -c wip-sage4-testing-2020-01-23-1347 -k distro -m smithi -p 90 --subset 1/9 -N 3

#2 Updated by Sage Weil about 1 month ago

  • Status changed from New to Need More Info

#3 Updated by Sage Weil about 1 month ago

  • Target version set to v15.0.0

#4 Updated by Sage Weil 28 days ago

/a/sage-2020-01-29_20:14:58-rados-wip-sage-testing-2020-01-29-1034-distro-basic-smithi/4718136

ceph-osd.4.log

description: rados/verify/{centos_latest.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml}
d-thrash/default/{default.yaml thrashosds-health.yaml} msgr-failures/few.yaml msgr/async-v1only.yaml
objectstore/bluestore-stupid.yaml rados.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml}

#5 Updated by Sage Weil 22 days ago

/a/sage-2020-02-05_03:10:48-rados-wip-sage2-testing-2020-02-04-1448-distro-basic-smithi/4733235

ceph-osd.2.log

description: rados/verify/{centos_latest.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml}
d-thrash/default/{default.yaml thrashosds-health.yaml} msgr-failures/few.yaml msgr/async-v1only.yaml
objectstore/filestore-xfs.yaml rados.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml}

#6 Updated by Brad Hubbard 13 days ago

  • Related to Bug #44062: LibRadosWatchNotify.WatchNotify failure added

#7 Updated by Brad Hubbard 13 days ago

  • Crash signature updated (diff)
  • Affected Versions v0.80.11 added

#8 Updated by Brad Hubbard 13 days ago

  • Affected Versions v15.0.0 added
  • Affected Versions deleted (v0.80.11)

#9 Updated by Brad Hubbard 13 days ago

See https://tracker.ceph.com/issues/44062#note-9

{
    "crash_id": "2020-02-13T01:23:06.403017Z_b067bba7-fd83-4557-ae3d-a7b04d2d6c17",
    "timestamp": "2020-02-13T01:23:06.403017Z",
    "process_name": "memcheck-amd64-",
    "entity_name": "osd.1",
    "ceph_version": "15.1.0-441-g39dbd5e",
    "utsname_hostname": "smithi103",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-147.el8.x86_64",
    "utsname_version": "#1 SMP Wed Dec 4 21:51:45 UTC 2019",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8 (Core)",
    "backtrace": [
        "(()+0x12dc0) [0xc23fdc0]",
        "(MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x78) [0xf159b8]",
        "(ProtocolV2::handle_auth_done(ceph::buffer::v14_2_0::list&)+0x350) [0x10a7d40]",
        "(ProtocolV2::handle_frame_payload()+0x1e3) [0x10b6013]",
        "(ProtocolV2::handle_read_frame_dispatch()+0x160) [0x10b62c0]",
        "(ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x37d) [0x10b679d]",
        "(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x109cdfc]",
        "(AsyncConnection::process()+0x8a9) [0x1073499]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0xec5d37]",
        "(()+0xdc331c) [0xecb31c]",
        "(()+0xc2b23) [0xcb67b23]",
        "(()+0x82de) [0xc2352de]",
        "(clone()+0x43) [0xd4d0133]" 
    ]
}

The only entries in the log that stand out to me are the following, although they are more than a second before the crash and probably the result of an injected error.

2020-02-13T01:23:05.310+0000 127f7700  1 -- 172.21.15.103:0/20065 >> [v2:172.21.15.109:6814/20141,v1:172.21.15.109:6815/20141] conn(0x14d683b0 msgr2=0x3dfa8d80 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send send error: (32) Broken pipe
2020-02-13T01:23:05.310+0000 127f7700  1 --2- 172.21.15.103:0/20065 >> [v2:172.21.15.109:6814/20141,v1:172.21.15.109:6815/20141] conn(0x14d683b0 0x3dfa8d80 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).write client ident write failed r=-32 ((32) Broken pipe)

From the valgrind log.

<error>
  <unique>0x0</unique>
  <tid>5</tid>
  <threadname>msgr-worker-2</threadname>
  <kind>InvalidRead</kind>
  <what>Invalid read of size 8</what>
  <stack>
    <frame>
      <ip>0xF159B8</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&amp;, CryptoKey*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt;*)</fn>
      <dir>/usr/src/debug/ceph-15.1.0-441.g39dbd5e.el8.x86_64/src/mon</dir>
      <file>MonClient.cc</file>
      <line>1559</line>
...
  <auxwhat>Address 0x0 is not stack'd, malloc'd or (recently) free'd</auxwhat>

So this looks like a null pointer read.

That's here.

1512 int MonClient::handle_auth_done(                                                                                                                                                                                                        
1513   Connection *con,                                                                                                                                                                                                                      
1514   AuthConnectionMeta *auth_meta,                                                                                                                                                                                                        
1515   uint64_t global_id,                                                                                                                                                                                                                   
1516   uint32_t con_mode,                                                                                                                                                                                                                    
1517   const ceph::buffer::list& bl,                                                                                                                                                                                                         
1518   CryptoKey *session_key,                                                                                                                                                                                                               
1519   std::string *connection_secret)                                                                                                                                                                                                       
1520 {
...
1556   } else {                                                                                                                                                                                                                              
1557     // verify authorizer reply                                                                                                                                                                                                          
1558     auto p = bl.begin();                                                                                                                                                                                                                
1559     if (!auth_meta->authorizer->verify_reply(p, &auth_meta->connection_secret)) {                                                                                                                                                       
1560       ldout(cct, 0) << __func__ << " failed verifying authorizer reply" 

Bad bufferlist maybe?

Looking at the disassembled ceph-osd (objdump -rdS).

This is our function and we segfaulted at an offset into the function of +0x78 according to the stack trace so that's 0xe0d940+0x78=0xe0d9b8.

0000000000e0d940 <_ZN9MonClient16handle_auth_doneEP10ConnectionP18AuthConnectionMetamjRKN4ceph6buffer7v14_2_04listEP9CryptoKeyPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE@@Base>:
...
  e0d950:       41 54                   push   %r12
  e0d952:       49 89 d4                mov    %rdx,%r12
...
  e0d9a6:       e8 d5 ed fc ff          callq  ddc780 <_ZN4ceph6buffer7v14_2_04list13iterator_implILb1EEC1EPKS2_j@@Base>
  e0d9ab:       49 8b 7c 24 78          mov    0x78(%r12),%rdi
  e0d9b0:       49 8d 54 24 58          lea    0x58(%r12),%rdx
  e0d9b5:       48 89 de                mov    %rbx,%rsi
  e0d9b8:       48 8b 07                mov    (%rdi),%rax  <===

To be continued.

#10 Updated by Brad Hubbard 10 days ago

Using gdb to break on MonClient::handle_auth_done in a local vstart cluster I see the following.

(gdb) p ((AuthConnectionMeta *)0x0)->connection_secret
Cannot access memory at address 0x58
(gdb) p ((AuthConnectionMeta *)0x0)->authorizer
Cannot access memory at address 0x78

They match the offsets we observed in the objdump assembly output and this leaads me to believe that the cause was that the authorizer field was NULL. A ceph_assert to check for that should catch it so that's the next strategy.

#11 Updated by Brad Hubbard 9 days ago

  • Crash signature updated (diff)

Also available in: Atom PDF