Project

General

Profile

Bug #43808

segv in MonClient::handle_auth_done

Added by Sage Weil 6 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

  -157> 2020-01-24T10:35:17.317+0000 1200a700 -1 *** Caught signal (Segmentation fault) **
 in thread 1200a700 thread_name:msgr-worker-1

 ceph version 15.0.0-9704-gb1192ba (b1192ba8ca76fda4688de6ceec97751d65ca697f) octopus (dev)
 1: (()+0x12dd0) [0xc253dd0]
 2: (MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x78) [0xf244c8]
 3: (ProtocolV2::handle_auth_done(ceph::buffer::v14_2_0::list&)+0x2f6) [0x10b6b06]
 4: (ProtocolV2::handle_frame_payload()+0x1e3) [0x10c4e23]
 5: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x10c50d0]
 6: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x475) [0x10c56a5]
 7: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x10abe5c]
 8: (AsyncConnection::process()+0x8a9) [0x10826b9]
 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0xed4447]
 10: (()+0xdd1a2c) [0xed9a2c]
 11: (()+0xc2a33) [0xcb7ba33]
 12: (()+0x82de) [0xc2492de]
 13: (clone()+0x43) [0xd4e44b3]

/a/sage-2020-01-24_01:55:08-rados-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/4698040

Related issues

Related to RADOS - Bug #44062: LibRadosWatchNotify.WatchNotify failure Resolved
Copied to Messengers - Backport #44516: nautilus: segv in MonClient::handle_auth_done Resolved

History

#1 Updated by Sage Weil 6 months ago

no core, no real clue in the log.

trying to reproduce here: http://pulpito.front.sepia.ceph.com:80/sage-2020-01-24_17:17:32-rados:verify-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/
from: teuthology-suite -s rados/verify --filter validater/valgrind.yaml -c wip-sage4-testing-2020-01-23-1347 -k distro -m smithi -p 90 --subset 1/9 -N 3

#2 Updated by Sage Weil 6 months ago

  • Status changed from New to Need More Info

#3 Updated by Sage Weil 6 months ago

  • Target version set to v15.0.0

#4 Updated by Sage Weil 5 months ago

/a/sage-2020-01-29_20:14:58-rados-wip-sage-testing-2020-01-29-1034-distro-basic-smithi/4718136

ceph-osd.4.log

description: rados/verify/{centos_latest.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml}
d-thrash/default/{default.yaml thrashosds-health.yaml} msgr-failures/few.yaml msgr/async-v1only.yaml
objectstore/bluestore-stupid.yaml rados.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml}

#5 Updated by Sage Weil 5 months ago

/a/sage-2020-02-05_03:10:48-rados-wip-sage2-testing-2020-02-04-1448-distro-basic-smithi/4733235

ceph-osd.2.log

description: rados/verify/{centos_latest.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml}
d-thrash/default/{default.yaml thrashosds-health.yaml} msgr-failures/few.yaml msgr/async-v1only.yaml
objectstore/filestore-xfs.yaml rados.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml}

#6 Updated by Brad Hubbard 5 months ago

  • Related to Bug #44062: LibRadosWatchNotify.WatchNotify failure added

#7 Updated by Brad Hubbard 5 months ago

  • Crash signature updated (diff)
  • Affected Versions v0.80.11 added

#8 Updated by Brad Hubbard 5 months ago

  • Affected Versions v15.0.0 added
  • Affected Versions deleted (v0.80.11)

#9 Updated by Brad Hubbard 5 months ago

See https://tracker.ceph.com/issues/44062#note-9

{
    "crash_id": "2020-02-13T01:23:06.403017Z_b067bba7-fd83-4557-ae3d-a7b04d2d6c17",
    "timestamp": "2020-02-13T01:23:06.403017Z",
    "process_name": "memcheck-amd64-",
    "entity_name": "osd.1",
    "ceph_version": "15.1.0-441-g39dbd5e",
    "utsname_hostname": "smithi103",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-147.el8.x86_64",
    "utsname_version": "#1 SMP Wed Dec 4 21:51:45 UTC 2019",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8 (Core)",
    "backtrace": [
        "(()+0x12dc0) [0xc23fdc0]",
        "(MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x78) [0xf159b8]",
        "(ProtocolV2::handle_auth_done(ceph::buffer::v14_2_0::list&)+0x350) [0x10a7d40]",
        "(ProtocolV2::handle_frame_payload()+0x1e3) [0x10b6013]",
        "(ProtocolV2::handle_read_frame_dispatch()+0x160) [0x10b62c0]",
        "(ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x37d) [0x10b679d]",
        "(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x109cdfc]",
        "(AsyncConnection::process()+0x8a9) [0x1073499]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0xec5d37]",
        "(()+0xdc331c) [0xecb31c]",
        "(()+0xc2b23) [0xcb67b23]",
        "(()+0x82de) [0xc2352de]",
        "(clone()+0x43) [0xd4d0133]" 
    ]
}

The only entries in the log that stand out to me are the following, although they are more than a second before the crash and probably the result of an injected error.

2020-02-13T01:23:05.310+0000 127f7700  1 -- 172.21.15.103:0/20065 >> [v2:172.21.15.109:6814/20141,v1:172.21.15.109:6815/20141] conn(0x14d683b0 msgr2=0x3dfa8d80 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send send error: (32) Broken pipe
2020-02-13T01:23:05.310+0000 127f7700  1 --2- 172.21.15.103:0/20065 >> [v2:172.21.15.109:6814/20141,v1:172.21.15.109:6815/20141] conn(0x14d683b0 0x3dfa8d80 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).write client ident write failed r=-32 ((32) Broken pipe)

From the valgrind log.

<error>
  <unique>0x0</unique>
  <tid>5</tid>
  <threadname>msgr-worker-2</threadname>
  <kind>InvalidRead</kind>
  <what>Invalid read of size 8</what>
  <stack>
    <frame>
      <ip>0xF159B8</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&amp;, CryptoKey*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt;*)</fn>
      <dir>/usr/src/debug/ceph-15.1.0-441.g39dbd5e.el8.x86_64/src/mon</dir>
      <file>MonClient.cc</file>
      <line>1559</line>
...
  <auxwhat>Address 0x0 is not stack'd, malloc'd or (recently) free'd</auxwhat>

So this looks like a null pointer read.

That's here.

1512 int MonClient::handle_auth_done(                                                                                                                                                                                                        
1513   Connection *con,                                                                                                                                                                                                                      
1514   AuthConnectionMeta *auth_meta,                                                                                                                                                                                                        
1515   uint64_t global_id,                                                                                                                                                                                                                   
1516   uint32_t con_mode,                                                                                                                                                                                                                    
1517   const ceph::buffer::list& bl,                                                                                                                                                                                                         
1518   CryptoKey *session_key,                                                                                                                                                                                                               
1519   std::string *connection_secret)                                                                                                                                                                                                       
1520 {
...
1556   } else {                                                                                                                                                                                                                              
1557     // verify authorizer reply                                                                                                                                                                                                          
1558     auto p = bl.begin();                                                                                                                                                                                                                
1559     if (!auth_meta->authorizer->verify_reply(p, &auth_meta->connection_secret)) {                                                                                                                                                       
1560       ldout(cct, 0) << __func__ << " failed verifying authorizer reply" 

Bad bufferlist maybe?

Looking at the disassembled ceph-osd (objdump -rdS).

This is our function and we segfaulted at an offset into the function of +0x78 according to the stack trace so that's 0xe0d940+0x78=0xe0d9b8.

0000000000e0d940 <_ZN9MonClient16handle_auth_doneEP10ConnectionP18AuthConnectionMetamjRKN4ceph6buffer7v14_2_04listEP9CryptoKeyPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE@@Base>:
...
  e0d950:       41 54                   push   %r12
  e0d952:       49 89 d4                mov    %rdx,%r12
...
  e0d9a6:       e8 d5 ed fc ff          callq  ddc780 <_ZN4ceph6buffer7v14_2_04list13iterator_implILb1EEC1EPKS2_j@@Base>
  e0d9ab:       49 8b 7c 24 78          mov    0x78(%r12),%rdi
  e0d9b0:       49 8d 54 24 58          lea    0x58(%r12),%rdx
  e0d9b5:       48 89 de                mov    %rbx,%rsi
  e0d9b8:       48 8b 07                mov    (%rdi),%rax  <===

To be continued.

#10 Updated by Brad Hubbard 5 months ago

Using gdb to break on MonClient::handle_auth_done in a local vstart cluster I see the following.

(gdb) p ((AuthConnectionMeta *)0x0)->connection_secret
Cannot access memory at address 0x58
(gdb) p ((AuthConnectionMeta *)0x0)->authorizer
Cannot access memory at address 0x78

They match the offsets we observed in the objdump assembly output and this leaads me to believe that the cause was that the authorizer field was NULL. A ceph_assert to check for that should catch it so that's the next strategy.

#11 Updated by Brad Hubbard 5 months ago

  • Crash signature updated (diff)

#12 Updated by Neha Ojha 4 months ago

/a/yuriw-2020-03-03_20:34:30-rados-master-distro-basic-smithi/4822427/ - trying to capture logs before the job goes dead

#13 Updated by Sage Weil 4 months ago

/a/sage-2020-03-03_19:20:19-rados-wip-sage2-testing-2020-03-03-0959-distro-basic-smithi/4821464

got the osd log.

  <kind>InvalidRead</kind>
  <what>Invalid read of size 8</what>
  <stack>
    <frame>
      <ip>0xF1E698</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v15_2_0::list const&amp;, CryptoKey*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt;*)</fn>
      <dir>/usr/src/debug/ceph-15.1.0-1606.gecd8670.el8.x86_64/src/mon</dir>
      <file>MonClient.cc</file>
      <line>1559</line>
    </frame>
    <frame>
      <ip>0x10B4CBF</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&amp;)</fn>
      <dir>/usr/src/debug/ceph-15.1.0-1606.gecd8670.el8.x86_64/src/msg/async</dir>
      <file>ProtocolV2.cc</file>
      <line>1846</line>
    </frame>
    <frame>
      <ip>0x10C2E42</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>ProtocolV2::handle_frame_payload()</fn>
      <dir>/usr/src/debug/ceph-15.1.0-1606.gecd8670.el8.x86_64/src/msg/async</dir>
      <file>ProtocolV2.cc</file>
      <line>1246</line>
    </frame>

#14 Updated by Sage Weil 4 months ago

ok, in one thread, we are faulting:

2020-03-04T18:46:48.246+0000 12807700  1 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2020-03-04T18:46:48.247+0000 12807700  1 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=SESSION_RECONNECTING pgs=863 cs=41 l=0 rx=0 tx=0).write reconnect write failed r=-32 ((32) Broken pipe)
2020-03-04T18:46:48.247+0000 12807700 10 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=SESSION_RECONNECTING pgs=863 cs=41 l=0 rx=0 tx=0)._fault
2020-03-04T18:46:48.275+0000 12807700  5 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=SESSION_RECONNECTING pgs=863 cs=41 l=0 rx=0 tx=0).reset_recv_state
2020-03-04T18:46:48.275+0000 12807700  1 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=START_CONNECT pgs=863 cs=42 l=0 rx=0 tx=0)._fault waiting 0.200000
...

which submits the reset_recv_state in the eventcenter,
  connection->center->submit_to(connection->center->get_id(), [this] {
    ldout(cct, 5) << "reset_recv_state reseting crypto handlers" << dendl;

but, the submitting threads goes on to do new and interesting things,
2020-03-04T18:46:48.829+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTING l=0).process
2020-03-04T18:46:48.940+0000 12807700 10 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTING_RE l=0).process connect successfully, ready to send banner
2020-03-04T18:46:48.941+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=START_CONNECT pgs=863 cs=42 l=0 rx=0 tx=0).read_event
2020-03-04T18:46:48.941+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=START_CONNECT pgs=863 cs=42 l=0 rx=0 tx=0).start_client_banner_exchange
2020-03-04T18:46:48.941+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=BANNER_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0)._banner_exchange
2020-03-04T18:46:48.962+0000 12807700 10 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 26 remaining bytes 0
2020-03-04T18:46:48.962+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=10
2020-03-04T18:46:49.025+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=BANNER_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0)._handle_peer_banner r=0
2020-03-04T18:46:49.025+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=16
2020-03-04T18:46:49.025+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=BANNER_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0)._handle_peer_banner_payload r=0
2020-03-04T18:46:49.025+0000 12807700  1 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=BANNER_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2020-03-04T18:46:49.032+0000 12807700 10 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 85 remaining bytes 0
2020-03-04T18:46:49.032+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).read_frame
2020-03-04T18:46:49.033+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=32
2020-03-04T18:46:49.033+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_preamble_main r=0
2020-03-04T18:46:49.033+0000 12807700 10 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_preamble_main got new segment: len=36 align=8
2020-03-04T18:46:49.033+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).read_frame_segment
2020-03-04T18:46:49.033+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=36
2020-03-04T18:46:49.033+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_segment r=0
2020-03-04T18:46:49.033+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=17
2020-03-04T18:46:49.034+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_epilogue_main r=0
2020-03-04T18:46:49.034+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_epilogue_main message integrity check success:  expected_crc=855695412 calculated_crc=855695412
2020-03-04T18:46:49.034+0000 12807700 10 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_dispatch tag=1
2020-03-04T18:46:49.034+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_hello payload.length()=36
2020-03-04T18:46:49.034+0000 12807700  5 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:172.21.15.180:47886/0
2020-03-04T18:46:49.034+0000 12807700  5 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_hello getsockname says I am 172.21.15.180:47886 when talking to v2:172.21.15.173:6809/534448
2020-03-04T18:46:49.035+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=HELLO_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).post_client_banner_exchange
2020-03-04T18:46:49.035+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).send_auth_request peer_type 4 auth_client 0x1ffefff190
2020-03-04T18:46:49.035+0000 12807700 10 monclient: get_auth_request con 0xfa60910 auth_method 2
2020-03-04T18:46:49.047+0000 12807700 10 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 243 remaining bytes 0
2020-03-04T18:46:49.047+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).read_frame
2020-03-04T18:46:49.047+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=32

like sending a new auth request.

then the reset_recv_state event actually happens in the event thread,

2020-03-04T18:46:49.651+0000 12807700  5 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 crc :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).reset_recv_state reseting crypto handlers

which clears auth_meta,

    auth_meta.reset(new AuthConnectionMeta);

and then we do more interesting things like process the auth response

2020-03-04T18:46:50.783+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).process
2020-03-04T18:46:50.783+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=32
2020-03-04T18:46:50.797+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_preamble_main r=0
2020-03-04T18:46:50.797+0000 12807700 10 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_preamble_main got new segment: len=52 align=8
2020-03-04T18:46:50.798+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).read_frame_segment
2020-03-04T18:46:50.798+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=52
2020-03-04T18:46:50.798+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_segment r=0
2020-03-04T18:46:50.798+0000 12807700 20 -- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 msgr2=0x14ccac20 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=17
2020-03-04T18:46:50.798+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_epilogue_main r=0
2020-03-04T18:46:50.798+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_epilogue_main message integrity check success:  expected_crc=2822581327 calculated_crc=2822581327
2020-03-04T18:46:50.798+0000 12807700 10 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_read_frame_dispatch tag=6
2020-03-04T18:46:50.798+0000 12807700 20 --2- v2:172.21.15.180:6816/3161509 >> v2:172.21.15.173:6809/534448 conn(0xfa60910 0x14ccac20 unknown :-1 s=AUTH_CONNECTING pgs=863 cs=42 l=0 rx=0 tx=0).handle_auth_done payload.length()=52

...and crash because the auth_meta was pulled out from underneath us.

#15 Updated by Sage Weil 4 months ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Radoslaw Zarzynski

This is related to 8a86719b062cd3b4b6c64a1c027564c092600117, which made this reset work async.

#16 Updated by Radoslaw Zarzynski 4 months ago

  • Status changed from In Progress to Fix Under Review

#17 Updated by Sage Weil 4 months ago

  • Pull request ID set to 33742

#18 Updated by Sage Weil 4 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus

I'm guessing this should go to nautilus?

#19 Updated by Radoslaw Zarzynski 4 months ago

  • Copied to Backport #44516: nautilus: segv in MonClient::handle_auth_done added

#20 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF