Bug #40026
closedsegfaults in ProtocolV2::write_message()
0%
da4b6d3718217e1315bac3f77b0626dc8731be67c6bdbc979abb7bfc84f77322
Description
2019-05-23T23:11:43.053 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: [ RUN ] LibRadosTierPP.FlushWriteRaces 2019-05-23T23:11:43.053 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: *** Caught signal (Segmentation fault) ** 2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: in thread 7faddb59e700 thread_name:msgr-worker-0 2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: ceph version 15.0.0-1350-ga320639 (a32063954d0e8ec0931671f858f620ab0a5c011c) octopus (dev) 2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 1: (()+0x12890) [0x7fae07b2e890] 2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 2: (ProtocolV2::write_message(Message*, bool)+0x6cf) [0x7fadff038fbf] 2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 3: (ProtocolV2::write_event()+0x183) [0x7fadff04ed13] 2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 4: (AsyncConnection::handle_write()+0x43) [0x7fadff00b9e3] 2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x135f) [0x7fadff05de7f] 2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 6: (()+0x52f40b) [0x7fadff06340b] 2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 7: (()+0xbe733) [0x7fadfe1e8733] 2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 8: (()+0x76db) [0x7fae07b236db] 2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 9: (clone()+0x3f) [0x7fadfdc4288f] 2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 2019-05-23 23:11:43.047 7faddb59e700 -1 *** Caught signal (Segmentation fault) ** ... 2019-05-23T23:11:43.063 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: -3> 2019-05-23 23:11:31.035 7fadf8fd4700 5 asok(0x5555f6e729b0) entry start 2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: -2> 2019-05-23 23:11:31.063 7fadf8fd4700 5 asok(0x5555f6e729b0) AdminSocket: request '0' '' to 0x5555f6f37ec0 returned 1 bytes 2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: -1> 2019-05-23 23:11:32.851 7fadf8fd4700 5 asok(0x5555f6e729b0) AdminSocket: request '0' '' to 0x5555f6f37ec0 returned 1 bytes 2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 0> 2019-05-23 23:11:43.047 7faddb59e700 -1 *** Caught signal (Segmentation fault) ** 2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: in thread 7faddb59e700 thread_name:msgr-worker-0
/a/kchai-2019-05-23_21:46:20-rados-wip-kefu-testing-2019-05-23-0009-distro-basic-smithi/3973056
it was a test branch but the branch didn't contain any messenger related changes.
Updated by Brad Hubbard over 4 years ago
- Has duplicate Bug #41214: Segmentation fault in mon/test_config_key_caps.sh added
Updated by Brad Hubbard over 4 years ago
- Category set to AsyncMessenger
- Status changed from New to 12
- Assignee set to Brad Hubbard
- Source set to Development
From the instance in #41214
(gdb) bt #0 0x00007f3942914dba in ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer (session_stream_handlers=..., this=0x7f393cd3a580) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/frames_v2.h:270 #1 ProtocolV2::write_message (this=this@entry=0x7f39380f7710, m=m@entry=0x7f3920002fc0, more=more@entry=false) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:515 #2 0x00007f394292de5b in ProtocolV2::write_event (this=0x7f39380f7710) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:614 #3 0x00007f39428e7ff3 in AsyncConnection::handle_write (this=0x7f39380f5190) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/AsyncConnection.cc:692 #4 0x00007f394293d9c7 in EventCenter::process_events (this=this@entry=0x7f39380f3a90, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f393cd3ae88) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/Event.cc:441 #5 0x00007f39429421d8 in NetworkStack::<lambda()>::operator() (__closure=0x7f39380f32e8) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/Stack.cc:53 #6 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316 #7 0x00007f3942c42b1f in execute_native_thread_routine () from /usr/lib/ceph/libceph-common.so.0 #8 0x00007f394d28d6ba in start_thread (arg=0x7f393cd3b700) at pthread_create.c:333 #9 0x00007f394cfc341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) l 265 epilogue_secure_block_t epilogue; 266 ::memset(&epilogue, 0, sizeof(epilogue)); 267 ceph::bufferlist epilogue_bl; 268 epilogue_bl.append(reinterpret_cast<const char*>(&epilogue), 269 sizeof(epilogue)); 270 session_stream_handlers.tx->authenticated_encrypt_update(epilogue_bl); 271 } 272 return session_stream_handlers.tx->authenticated_encrypt_final(); 273 } else { 274 // plain mode (gdb) x/i $pc => 0x7f3942914dba <ProtocolV2::write_message(Message*, bool)+1754>: mov (%rdi),%rax (gdb) i reg rdi rdi 0x0 0
Still analysing this.
Updated by Brad Hubbard over 4 years ago
We obtain the value for %rdi from an object at the address of %r15.
0x00007f3942914dae <+1742>: callq 0x7f394296b900 <ceph::buffer::v14_2_0::list::append(char const*, unsigned int)> 0x00007f3942914db3 <+1747>: mov 0x40(%r15),%rdi 0x00007f3942914db7 <+1751>: mov %rbp,%rsi => 0x00007f3942914dba <+1754>: mov (%rdi),%rax (gdb) i r r15 r15 0x7f39380f7710 139883730401040 (gdb) up #1 ProtocolV2::write_message (this=this@entry=0x7f39380f7710, m=m@entry=0x7f3920002fc0, more=more@entry=false) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:515 515 connection->outcoming_bl.append(message.get_buffer(session_stream_handlers)); (gdb) whatis this type = ProtocolV2 * const (gdb) p ((ProtocolV2*)0x0)->session_stream_handlers->tx Cannot access memory at address 0x40 (gdb) p &this->session_stream_handlers->tx $10 = (std::unique_ptr<ceph::crypto::onwire::TxHandler, std::default_delete<ceph::crypto::onwire::TxHandler> > *) 0x7f39380f7750 (gdb) x/a 0x7f39380f7750 0x7f39380f7750: 0x0
So it looks like we segfaulted because the session_stream_handlers were not set (rx is 0x0 as well).
Updated by Brad Hubbard over 4 years ago
We are in a block that checks whether session_stream_handlers.tx is valid so I'm going to assume for now that this is a race and look for another thread that cleared the session_stream_handlers.
238 public: 239 ceph::bufferlist get_buffer( 240 ceph::crypto::onwire::rxtx_t &session_stream_handlers) 241 { 242 fill_preamble(); 243 if (session_stream_handlers.tx) {
Updated by Brad Hubbard over 4 years ago
Thread 3 is shutting down the mon client.
Thread 3 (Thread 0x7f393ef3c700 (LWP 14894)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f3942bf385c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/ceph/libceph-common.so.0 #2 0x00007f394265baff in Finisher::wait_for_empty (this=this@entry=0x7f393ef3b300) at /build/ceph-14.2.2-258-gd95a3b7/src/common/Finisher.cc:35 #3 0x00007f3942998328 in MonClient::shutdown (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:498 #4 0x00007f394299d6a6 in MonClient::get_monmap_and_config (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:177
(gdb) t [Current thread is 1 (Thread 0x7f393cd3b700 (LWP 14896))] (gdb) f #3 0x00007f39428e7ff3 in AsyncConnection::handle_write (this=0x7f39380f5190) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/AsyncConnection.cc:692 692 protocol->write_event(); (gdb) p this->msgr $43 = (Messenger *) 0x7f39380edfe0
(gdb) t [Current thread is 3 (Thread 0x7f393ef3c700 (LWP 14894))] (gdb) f #4 0x00007f394299d6a6 in MonClient::get_monmap_and_config (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:177 177 shutdown(); (gdb) p messenger $44 = (Messenger *) 0x7f39380edfe0
So both threads share the same Messenger and are potentially manipulating the same structures. Not clear to me yet what is happening here.
Updated by Brad Hubbard over 4 years ago
I managed to reproduce this rerunning the job from #41214 after three or four repetitions. The coredump is similar but also quite different so hopefully will shed more light on the issue.
Updated by Brad Hubbard over 4 years ago
OK, I think I see what's happening here. I'll write it up and work out a PR when I get a chance. Basically, when we call MonClient::get_monmap_and_config we end up calling MonClient::shutdown which results in the connection being destructed which is not good while it's still being used in the thread write_message :P More soon.
Updated by Kefu Chai over 4 years ago
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: [ RUN ] LibRadosMisc.ShutdownRace 2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: *** Caught signal (Segmentation fault) ** 2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: in thread 7f993df83700 thread_name:msgr-worker-2 2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: ceph version 15.0.0-5728-g2069c7f (2069c7f06f156d5465a98a88331dfd8bcb6b79f4) octopus (dev) 2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 1: (()+0xf5d0) [0x7f9ac18ad5d0] 2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 2: (ProtocolV2::write_message(Message*, bool)+0x6c4) [0x7f9ab896f514] 2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 3: (ProtocolV2::write_event()+0x17b) [0x7f9ab898597b] 2019-10-03T10:53:13.388 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 4: (AsyncConnection::handle_write()+0x40) [0x7f9ab8941950] 2019-10-03T10:53:13.388 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1427) [0x7f9ab8992877]
/a//kchai-2019-10-03_10:23:44-rados-wip-kefu-testing-2019-10-02-1028-distro-basic-mira/4353765
Updated by Brad Hubbard over 4 years ago
Unfortunately, the latest core is not useful.
(gdb) t a a bt warning: Couldn't find general-purpose registers in core file. PC register is not available
I can only view the thread that faulted which is the same as above. I am following up on this however.
Updated by Kefu Chai over 4 years ago
2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 1: (()+0x12890) [0x7fd5f1dd3890] 2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 2: (ProtocolV2::write_message(Message*, bool)+0x5f3) [0x7fd5e94ad333] 2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 3: (ProtocolV2::write_event()+0x17b) [0x7fd5e94c32ab] 2019-10-14T07:22:34.779 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 4: (AsyncConnection::handle_write()+0x43) [0x7fd5e947deb3] 2019-10-14T07:22:34.779 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned lon g, std::ratio<1l, 1000000000l> >*)+0x13bf) [0x7fd5e94d080f]
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412648
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412640
2019-10-14T07:45:10.437 INFO:tasks.workunit.client.0.smithi072.stdout: api_misc: in thread 7f00517ba700 thread_name:msgr-worker-0 .... 2019-10-14T07:45:15.013 INFO:tasks.workunit.client.0.smithi072.stderr:bash: line 1: 14793 Segmentation fault (core dumped) ceph_test_rados_api_misc 2>&1
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412668
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412647
Updated by Brad Hubbard over 4 years ago
Thanks a lot Kefu. I'll take a look at these new instances today.
Updated by Kefu Chai over 4 years ago
for more failures, see http://pulpito.ceph.com/kchai-2019-10-15_01:31:27-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-mira/
also, this issue might be a dup of #42026
Updated by Kefu Chai over 4 years ago
- Related to Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) added
Updated by Brad Hubbard over 4 years ago
Yep, this definitely looks like a dup of #42026 to me.
Updated by Brad Hubbard over 4 years ago
- Related to deleted (Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part))
Updated by Brad Hubbard over 4 years ago
- Is duplicate of Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) added