Project

General

Profile

Actions

Bug #40026

closed

segfaults in ProtocolV2::write_message()

Added by Kefu Chai almost 5 years ago. Updated about 4 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
AsyncMessenger
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):

da4b6d3718217e1315bac3f77b0626dc8731be67c6bdbc979abb7bfc84f77322

Crash signature (v2):

Description

2019-05-23T23:11:43.053 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp: [ RUN      ] LibRadosTierPP.FlushWriteRaces
2019-05-23T23:11:43.053 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp: *** Caught signal (Segmentation fault) **
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  in thread 7faddb59e700 thread_name:msgr-worker-0
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  ceph version 15.0.0-1350-ga320639 (a32063954d0e8ec0931671f858f620ab0a5c011c) octopus (dev)
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  1: (()+0x12890) [0x7fae07b2e890]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  2: (ProtocolV2::write_message(Message*, bool)+0x6cf) [0x7fadff038fbf]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  3: (ProtocolV2::write_event()+0x183) [0x7fadff04ed13]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  4: (AsyncConnection::handle_write()+0x43) [0x7fadff00b9e3]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x135f) [0x7fadff05de7f]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  6: (()+0x52f40b) [0x7fadff06340b]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  7: (()+0xbe733) [0x7fadfe1e8733]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  8: (()+0x76db) [0x7fae07b236db]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  9: (clone()+0x3f) [0x7fadfdc4288f]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp: 2019-05-23 23:11:43.047 7faddb59e700 -1 *** Caught signal (Segmentation fault) **
...
2019-05-23T23:11:43.063 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:     -3> 2019-05-23 23:11:31.035 7fadf8fd4700  5 asok(0x5555f6e729b0) entry start
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:     -2> 2019-05-23 23:11:31.063 7fadf8fd4700  5 asok(0x5555f6e729b0) AdminSocket: request '0' '' to 0x5555f6f37ec0 returned 1 bytes
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:     -1> 2019-05-23 23:11:32.851 7fadf8fd4700  5 asok(0x5555f6e729b0) AdminSocket: request '0' '' to 0x5555f6f37ec0 returned 1 bytes
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:      0> 2019-05-23 23:11:43.047 7faddb59e700 -1 *** Caught signal (Segmentation fault) **
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp:  in thread 7faddb59e700 thread_name:msgr-worker-0

/a/kchai-2019-05-23_21:46:20-rados-wip-kefu-testing-2019-05-23-0009-distro-basic-smithi/3973056

it was a test branch but the branch didn't contain any messenger related changes.


Related issues 2 (0 open2 closed)

Has duplicate RADOS - Bug #41214: Segmentation fault in mon/test_config_key_caps.shDuplicate08/12/2019

Actions
Is duplicate of Ceph - Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part)ResolvedRadoslaw Zarzynski

Actions
Actions #1

Updated by Brad Hubbard over 4 years ago

  • Has duplicate Bug #41214: Segmentation fault in mon/test_config_key_caps.sh added
Actions #2

Updated by Brad Hubbard over 4 years ago

  • Category set to AsyncMessenger
  • Status changed from New to 12
  • Assignee set to Brad Hubbard
  • Source set to Development

From the instance in #41214

(gdb) bt
#0  0x00007f3942914dba in ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer (session_stream_handlers=..., this=0x7f393cd3a580)
    at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/frames_v2.h:270
#1  ProtocolV2::write_message (this=this@entry=0x7f39380f7710, m=m@entry=0x7f3920002fc0, more=more@entry=false) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:515
#2  0x00007f394292de5b in ProtocolV2::write_event (this=0x7f39380f7710) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:614
#3  0x00007f39428e7ff3 in AsyncConnection::handle_write (this=0x7f39380f5190) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/AsyncConnection.cc:692
#4  0x00007f394293d9c7 in EventCenter::process_events (this=this@entry=0x7f39380f3a90, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f393cd3ae88)
    at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/Event.cc:441
#5  0x00007f39429421d8 in NetworkStack::<lambda()>::operator() (__closure=0x7f39380f32e8) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/Stack.cc:53
#6  std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316
#7  0x00007f3942c42b1f in execute_native_thread_routine () from /usr/lib/ceph/libceph-common.so.0
#8  0x00007f394d28d6ba in start_thread (arg=0x7f393cd3b700) at pthread_create.c:333
#9  0x00007f394cfc341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) l
265             epilogue_secure_block_t epilogue;
266             ::memset(&epilogue, 0, sizeof(epilogue));
267             ceph::bufferlist epilogue_bl;
268             epilogue_bl.append(reinterpret_cast<const char*>(&epilogue),
269                                sizeof(epilogue));
270             session_stream_handlers.tx->authenticated_encrypt_update(epilogue_bl);
271           }
272           return session_stream_handlers.tx->authenticated_encrypt_final();
273         } else {
274           // plain mode
(gdb) x/i $pc
=> 0x7f3942914dba <ProtocolV2::write_message(Message*, bool)+1754>:     mov    (%rdi),%rax
(gdb) i reg rdi
rdi            0x0      0

Still analysing this.

Actions #3

Updated by Brad Hubbard over 4 years ago

We obtain the value for %rdi from an object at the address of %r15.

   0x00007f3942914dae <+1742>:  callq  0x7f394296b900 <ceph::buffer::v14_2_0::list::append(char const*, unsigned int)>
   0x00007f3942914db3 <+1747>:  mov    0x40(%r15),%rdi
   0x00007f3942914db7 <+1751>:  mov    %rbp,%rsi
=> 0x00007f3942914dba <+1754>:  mov    (%rdi),%rax

(gdb) i r r15
r15            0x7f39380f7710   139883730401040
(gdb) up
#1  ProtocolV2::write_message (this=this@entry=0x7f39380f7710, m=m@entry=0x7f3920002fc0, more=more@entry=false) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:515
515       connection->outcoming_bl.append(message.get_buffer(session_stream_handlers));
(gdb) whatis this
type = ProtocolV2 * const

(gdb) p ((ProtocolV2*)0x0)->session_stream_handlers->tx
Cannot access memory at address 0x40
(gdb) p &this->session_stream_handlers->tx
$10 = (std::unique_ptr<ceph::crypto::onwire::TxHandler, std::default_delete<ceph::crypto::onwire::TxHandler> > *) 0x7f39380f7750
(gdb) x/a 0x7f39380f7750
0x7f39380f7750: 0x0

So it looks like we segfaulted because the session_stream_handlers were not set (rx is 0x0 as well).

Actions #4

Updated by Brad Hubbard over 4 years ago

We are in a block that checks whether session_stream_handlers.tx is valid so I'm going to assume for now that this is a race and look for another thread that cleared the session_stream_handlers.

238 public:                                                                                                                                                                                                                                  
239   ceph::bufferlist get_buffer(                                                                                                                                                                                                           
240     ceph::crypto::onwire::rxtx_t &session_stream_handlers)                                                                                                                                                                               
241   {                                                                                                                                                                                                                                      
242     fill_preamble();                                                                                                                                                                                                                     
243     if (session_stream_handlers.tx) {
Actions #5

Updated by Brad Hubbard over 4 years ago

Thread 3 is shutting down the mon client.

Thread 3 (Thread 0x7f393ef3c700 (LWP 14894)):                                                                         
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185                                                                                                                                           
#1  0x00007f3942bf385c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/ceph/libceph-common.so.0                                                                                                             
#2  0x00007f394265baff in Finisher::wait_for_empty (this=this@entry=0x7f393ef3b300) at /build/ceph-14.2.2-258-gd95a3b7/src/common/Finisher.cc:35                                                                                             
#3  0x00007f3942998328 in MonClient::shutdown (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:498                                                                                                   
#4  0x00007f394299d6a6 in MonClient::get_monmap_and_config (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:177
(gdb) t
[Current thread is 1 (Thread 0x7f393cd3b700 (LWP 14896))]
(gdb) f
#3  0x00007f39428e7ff3 in AsyncConnection::handle_write (this=0x7f39380f5190) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/AsyncConnection.cc:692
692       protocol->write_event();
(gdb) p this->msgr
$43 = (Messenger *) 0x7f39380edfe0
(gdb) t
[Current thread is 3 (Thread 0x7f393ef3c700 (LWP 14894))]
(gdb) f
#4  0x00007f394299d6a6 in MonClient::get_monmap_and_config (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:177
177       shutdown();
(gdb) p messenger
$44 = (Messenger *) 0x7f39380edfe0

So both threads share the same Messenger and are potentially manipulating the same structures. Not clear to me yet what is happening here.

Actions #6

Updated by Brad Hubbard over 4 years ago

I managed to reproduce this rerunning the job from #41214 after three or four repetitions. The coredump is similar but also quite different so hopefully will shed more light on the issue.

Actions #7

Updated by Brad Hubbard over 4 years ago

OK, I think I see what's happening here. I'll write it up and work out a PR when I get a chance. Basically, when we call MonClient::get_monmap_and_config we end up calling MonClient::shutdown which results in the connection being destructed which is not good while it's still being used in the thread write_message :P More soon.

Actions #8

Updated by Kefu Chai over 4 years ago

2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc: [ RUN      ] LibRadosMisc.ShutdownRace
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc: *** Caught signal (Segmentation fault) **
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc:  in thread 7f993df83700 thread_name:msgr-worker-2
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc:  ceph version 15.0.0-5728-g2069c7f (2069c7f06f156d5465a98a88331dfd8bcb6b79f4) octopus (dev)
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc:  1: (()+0xf5d0) [0x7f9ac18ad5d0]
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc:  2: (ProtocolV2::write_message(Message*, bool)+0x6c4) [0x7f9ab896f514]
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc:  3: (ProtocolV2::write_event()+0x17b) [0x7f9ab898597b]
2019-10-03T10:53:13.388 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc:  4: (AsyncConnection::handle_write()+0x40) [0x7f9ab8941950]
2019-10-03T10:53:13.388 INFO:tasks.workunit.client.0.mira109.stdout:                 api_misc:  5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1427) [0x7f9ab8992877]

/a//kchai-2019-10-03_10:23:44-rados-wip-kefu-testing-2019-10-02-1028-distro-basic-mira/4353765

Actions #9

Updated by Brad Hubbard over 4 years ago

Unfortunately, the latest core is not useful.

(gdb) t a a bt
warning: Couldn't find general-purpose registers in core file.
PC register is not available

I can only view the thread that faulted which is the same as above. I am following up on this however.

Actions #10

Updated by Kefu Chai over 4 years ago

2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout:                 api_misc:  1: (()+0x12890) [0x7fd5f1dd3890]
2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout:                 api_misc:  2: (ProtocolV2::write_message(Message*, bool)+0x5f3) [0x7fd5e94ad333]
2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout:                 api_misc:  3: (ProtocolV2::write_event()+0x17b) [0x7fd5e94c32ab]
2019-10-14T07:22:34.779 INFO:tasks.workunit.client.0.smithi089.stdout:                 api_misc:  4: (AsyncConnection::handle_write()+0x43) [0x7fd5e947deb3]
2019-10-14T07:22:34.779 INFO:tasks.workunit.client.0.smithi089.stdout:                 api_misc:  5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned lon
g, std::ratio<1l, 1000000000l> >*)+0x13bf) [0x7fd5e94d080f]

/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412648

/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412640

2019-10-14T07:45:10.437 INFO:tasks.workunit.client.0.smithi072.stdout:                 api_misc:  in thread 7f00517ba700 thread_name:msgr-worker-0
....
2019-10-14T07:45:15.013 INFO:tasks.workunit.client.0.smithi072.stderr:bash: line 1: 14793 Segmentation fault      (core dumped) ceph_test_rados_api_misc 2>&1

/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412668

/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412647

Actions #11

Updated by Brad Hubbard over 4 years ago

Thanks a lot Kefu. I'll take a look at these new instances today.

Actions #13

Updated by Kefu Chai over 4 years ago

  • Related to Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) added
Actions #14

Updated by Brad Hubbard over 4 years ago

Yep, this definitely looks like a dup of #42026 to me.

Actions #15

Updated by Brad Hubbard over 4 years ago

  • Related to deleted (Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part))
Actions #16

Updated by Brad Hubbard over 4 years ago

  • Is duplicate of Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) added
Actions #17

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #18

Updated by Brad Hubbard over 4 years ago

  • Status changed from New to Duplicate
Actions #19

Updated by David Zafman about 4 years ago

  • Pull request ID set to 30717
Actions #20

Updated by David Zafman about 4 years ago

  • Pull request ID deleted (30717)
Actions #21

Updated by Neha Ojha about 4 years ago

  • Crash signature (v1) updated (diff)
Actions

Also available in: Atom PDF