Bug #40026
closed
segfaults in ProtocolV2::write_message()
Added by Kefu Chai almost 5 years ago.
Updated about 4 years ago.
Crash signature (v1):
da4b6d3718217e1315bac3f77b0626dc8731be67c6bdbc979abb7bfc84f77322
Description
2019-05-23T23:11:43.053 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: [ RUN ] LibRadosTierPP.FlushWriteRaces
2019-05-23T23:11:43.053 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: *** Caught signal (Segmentation fault) **
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: in thread 7faddb59e700 thread_name:msgr-worker-0
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: ceph version 15.0.0-1350-ga320639 (a32063954d0e8ec0931671f858f620ab0a5c011c) octopus (dev)
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 1: (()+0x12890) [0x7fae07b2e890]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 2: (ProtocolV2::write_message(Message*, bool)+0x6cf) [0x7fadff038fbf]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 3: (ProtocolV2::write_event()+0x183) [0x7fadff04ed13]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 4: (AsyncConnection::handle_write()+0x43) [0x7fadff00b9e3]
2019-05-23T23:11:43.054 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x135f) [0x7fadff05de7f]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 6: (()+0x52f40b) [0x7fadff06340b]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 7: (()+0xbe733) [0x7fadfe1e8733]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 8: (()+0x76db) [0x7fae07b236db]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 9: (clone()+0x3f) [0x7fadfdc4288f]
2019-05-23T23:11:43.055 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 2019-05-23 23:11:43.047 7faddb59e700 -1 *** Caught signal (Segmentation fault) **
...
2019-05-23T23:11:43.063 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: -3> 2019-05-23 23:11:31.035 7fadf8fd4700 5 asok(0x5555f6e729b0) entry start
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: -2> 2019-05-23 23:11:31.063 7fadf8fd4700 5 asok(0x5555f6e729b0) AdminSocket: request '0' '' to 0x5555f6f37ec0 returned 1 bytes
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: -1> 2019-05-23 23:11:32.851 7fadf8fd4700 5 asok(0x5555f6e729b0) AdminSocket: request '0' '' to 0x5555f6f37ec0 returned 1 bytes
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: 0> 2019-05-23 23:11:43.047 7faddb59e700 -1 *** Caught signal (Segmentation fault) **
2019-05-23T23:11:43.064 INFO:tasks.workunit.client.0.smithi023.stdout: api_tier_pp: in thread 7faddb59e700 thread_name:msgr-worker-0
/a/kchai-2019-05-23_21:46:20-rados-wip-kefu-testing-2019-05-23-0009-distro-basic-smithi/3973056
it was a test branch but the branch didn't contain any messenger related changes.
- Has duplicate Bug #41214: Segmentation fault in mon/test_config_key_caps.sh added
- Category set to AsyncMessenger
- Status changed from New to 12
- Assignee set to Brad Hubbard
- Source set to Development
From the instance in #41214
(gdb) bt
#0 0x00007f3942914dba in ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer (session_stream_handlers=..., this=0x7f393cd3a580)
at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/frames_v2.h:270
#1 ProtocolV2::write_message (this=this@entry=0x7f39380f7710, m=m@entry=0x7f3920002fc0, more=more@entry=false) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:515
#2 0x00007f394292de5b in ProtocolV2::write_event (this=0x7f39380f7710) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:614
#3 0x00007f39428e7ff3 in AsyncConnection::handle_write (this=0x7f39380f5190) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/AsyncConnection.cc:692
#4 0x00007f394293d9c7 in EventCenter::process_events (this=this@entry=0x7f39380f3a90, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f393cd3ae88)
at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/Event.cc:441
#5 0x00007f39429421d8 in NetworkStack::<lambda()>::operator() (__closure=0x7f39380f32e8) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/Stack.cc:53
#6 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316
#7 0x00007f3942c42b1f in execute_native_thread_routine () from /usr/lib/ceph/libceph-common.so.0
#8 0x00007f394d28d6ba in start_thread (arg=0x7f393cd3b700) at pthread_create.c:333
#9 0x00007f394cfc341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) l
265 epilogue_secure_block_t epilogue;
266 ::memset(&epilogue, 0, sizeof(epilogue));
267 ceph::bufferlist epilogue_bl;
268 epilogue_bl.append(reinterpret_cast<const char*>(&epilogue),
269 sizeof(epilogue));
270 session_stream_handlers.tx->authenticated_encrypt_update(epilogue_bl);
271 }
272 return session_stream_handlers.tx->authenticated_encrypt_final();
273 } else {
274 // plain mode
(gdb) x/i $pc
=> 0x7f3942914dba <ProtocolV2::write_message(Message*, bool)+1754>: mov (%rdi),%rax
(gdb) i reg rdi
rdi 0x0 0
Still analysing this.
We obtain the value for %rdi from an object at the address of %r15.
0x00007f3942914dae <+1742>: callq 0x7f394296b900 <ceph::buffer::v14_2_0::list::append(char const*, unsigned int)>
0x00007f3942914db3 <+1747>: mov 0x40(%r15),%rdi
0x00007f3942914db7 <+1751>: mov %rbp,%rsi
=> 0x00007f3942914dba <+1754>: mov (%rdi),%rax
(gdb) i r r15
r15 0x7f39380f7710 139883730401040
(gdb) up
#1 ProtocolV2::write_message (this=this@entry=0x7f39380f7710, m=m@entry=0x7f3920002fc0, more=more@entry=false) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/ProtocolV2.cc:515
515 connection->outcoming_bl.append(message.get_buffer(session_stream_handlers));
(gdb) whatis this
type = ProtocolV2 * const
(gdb) p ((ProtocolV2*)0x0)->session_stream_handlers->tx
Cannot access memory at address 0x40
(gdb) p &this->session_stream_handlers->tx
$10 = (std::unique_ptr<ceph::crypto::onwire::TxHandler, std::default_delete<ceph::crypto::onwire::TxHandler> > *) 0x7f39380f7750
(gdb) x/a 0x7f39380f7750
0x7f39380f7750: 0x0
So it looks like we segfaulted because the session_stream_handlers were not set (rx is 0x0 as well).
We are in a block that checks whether session_stream_handlers.tx is valid so I'm going to assume for now that this is a race and look for another thread that cleared the session_stream_handlers.
238 public:
239 ceph::bufferlist get_buffer(
240 ceph::crypto::onwire::rxtx_t &session_stream_handlers)
241 {
242 fill_preamble();
243 if (session_stream_handlers.tx) {
Thread 3 is shutting down the mon client.
Thread 3 (Thread 0x7f393ef3c700 (LWP 14894)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f3942bf385c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/ceph/libceph-common.so.0
#2 0x00007f394265baff in Finisher::wait_for_empty (this=this@entry=0x7f393ef3b300) at /build/ceph-14.2.2-258-gd95a3b7/src/common/Finisher.cc:35
#3 0x00007f3942998328 in MonClient::shutdown (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:498
#4 0x00007f394299d6a6 in MonClient::get_monmap_and_config (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:177
(gdb) t
[Current thread is 1 (Thread 0x7f393cd3b700 (LWP 14896))]
(gdb) f
#3 0x00007f39428e7ff3 in AsyncConnection::handle_write (this=0x7f39380f5190) at /build/ceph-14.2.2-258-gd95a3b7/src/msg/async/AsyncConnection.cc:692
692 protocol->write_event();
(gdb) p this->msgr
$43 = (Messenger *) 0x7f39380edfe0
(gdb) t
[Current thread is 3 (Thread 0x7f393ef3c700 (LWP 14894))]
(gdb) f
#4 0x00007f394299d6a6 in MonClient::get_monmap_and_config (this=this@entry=0x7f393ef3aef0) at /build/ceph-14.2.2-258-gd95a3b7/src/mon/MonClient.cc:177
177 shutdown();
(gdb) p messenger
$44 = (Messenger *) 0x7f39380edfe0
So both threads share the same Messenger and are potentially manipulating the same structures. Not clear to me yet what is happening here.
I managed to reproduce this rerunning the job from #41214 after three or four repetitions. The coredump is similar but also quite different so hopefully will shed more light on the issue.
OK, I think I see what's happening here. I'll write it up and work out a PR when I get a chance. Basically, when we call MonClient::get_monmap_and_config we end up calling MonClient::shutdown which results in the connection being destructed which is not good while it's still being used in the thread write_message :P More soon.
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: [ RUN ] LibRadosMisc.ShutdownRace
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: *** Caught signal (Segmentation fault) **
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: in thread 7f993df83700 thread_name:msgr-worker-2
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: ceph version 15.0.0-5728-g2069c7f (2069c7f06f156d5465a98a88331dfd8bcb6b79f4) octopus (dev)
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 1: (()+0xf5d0) [0x7f9ac18ad5d0]
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 2: (ProtocolV2::write_message(Message*, bool)+0x6c4) [0x7f9ab896f514]
2019-10-03T10:53:13.387 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 3: (ProtocolV2::write_event()+0x17b) [0x7f9ab898597b]
2019-10-03T10:53:13.388 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 4: (AsyncConnection::handle_write()+0x40) [0x7f9ab8941950]
2019-10-03T10:53:13.388 INFO:tasks.workunit.client.0.mira109.stdout: api_misc: 5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1427) [0x7f9ab8992877]
/a//kchai-2019-10-03_10:23:44-rados-wip-kefu-testing-2019-10-02-1028-distro-basic-mira/4353765
Unfortunately, the latest core is not useful.
(gdb) t a a bt
warning: Couldn't find general-purpose registers in core file.
PC register is not available
I can only view the thread that faulted which is the same as above. I am following up on this however.
2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 1: (()+0x12890) [0x7fd5f1dd3890]
2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 2: (ProtocolV2::write_message(Message*, bool)+0x5f3) [0x7fd5e94ad333]
2019-10-14T07:22:34.778 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 3: (ProtocolV2::write_event()+0x17b) [0x7fd5e94c32ab]
2019-10-14T07:22:34.779 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 4: (AsyncConnection::handle_write()+0x43) [0x7fd5e947deb3]
2019-10-14T07:22:34.779 INFO:tasks.workunit.client.0.smithi089.stdout: api_misc: 5: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned lon
g, std::ratio<1l, 1000000000l> >*)+0x13bf) [0x7fd5e94d080f]
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412648
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412640
2019-10-14T07:45:10.437 INFO:tasks.workunit.client.0.smithi072.stdout: api_misc: in thread 7f00517ba700 thread_name:msgr-worker-0
....
2019-10-14T07:45:15.013 INFO:tasks.workunit.client.0.smithi072.stderr:bash: line 1: 14793 Segmentation fault (core dumped) ceph_test_rados_api_misc 2>&1
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412668
/a/kchai-2019-10-14_06:20:56-rados-wip-kefu-testing-2019-10-14-1111-distro-basic-smithi/4412647
Thanks a lot Kefu. I'll take a look at these new instances today.
- Related to Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) added
Yep, this definitely looks like a dup of #42026 to me.
- Related to deleted (Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part))
- Is duplicate of Bug #42026: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) added
- Status changed from 12 to New
- Status changed from New to Duplicate
- Pull request ID set to 30717
- Pull request ID deleted (
30717)
- Crash signature (v1) updated (diff)
Also available in: Atom
PDF