Actions
Bug #63008
openrados: race condition in rados_ping_monitor can cause segmentation fault
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
In the CI of go-ceph we noticed a sporadic segmentation fault in the test for rados_ping_monitor. (See https://github.com/ceph/go-ceph/issues/921)
Further investigation showed, that we can create the segmentation fault also with a simple C loop:
inline int test_ping(rados_t c) {
char* outstr;
size_t outlen;
int ret;
for (int i=0; i<10000; ++i) {
printf("X%d ", i);
ret = rados_ping_monitor(c, "a", &outstr, &outlen);
rados_buffer_free(outstr);
}
return ret;
}
The backtrace looks like this:
Core was generated by `./rados.test'. Program terminated with signal SIGSEGV, Segmentation fault. #0 runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154 154 RET [Current thread is 1 (Thread 0x7ff5aeffe700 (LWP 6184))] To enable execution of this file add add-auto-load-safe-path /opt/go/src/runtime/runtime-gdb.py line to your configuration file "/root/.gdbinit". To completely disable this security protection add set auto-load safe-path / line to your configuration file "/root/.gdbinit". For more information about this security protection see the "Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" #0 runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154 #1 0x00000000004585db in runtime.raisebadsignal (sig=11, c=0x7ff5aeffa850) at /opt/go/src/runtime/signal_unix.go:967 #2 0x0000000000458a27 in runtime.badsignal (sig=11, c=0x7ff5aeffa850) at /opt/go/src/runtime/signal_unix.go:1076 #3 0x0000000000457368 in runtime.sigtrampgo (sig=11, info=0x7ff5aeffa9f0, ctx=0x7ff5aeffa8c0) at /opt/go/src/runtime/signal_unix.go:468 #4 0x0000000000478a66 in runtime.sigtramp () at /opt/go/src/runtime/sys_linux_amd64.s:352 #5 <signal handler called> #6 MonConnection::get_auth_request (this=0x0, method=0x17257a0, preferred_modes=0x7ff5aeffb680, bl=0x7ff5aeffb6a0, entity_name=..., want_keys=0, keyring=0x7ffd722b3130) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/mon/MonClient.cc:1702 #7 0x00007ff6147305eb in non-virtual thunk to MonClientPinger::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*) () from /usr/lib64/ceph/libceph-common.so.2 #8 0x00007ff6146a3672 in ProtocolV2::send_auth_request (this=0x2264020, allowed_methods=...) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1692 #9 0x00007ff6146a3ddc in ProtocolV2::send_auth_request (this=0x2264020) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.h:217 #10 ProtocolV2::post_client_banner_exchange (this=0x2264020) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1680 #11 0x00007ff61469d62c in ProtocolV2::run_continuation (this=0x2264020, continuation=...) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47 #12 0x00007ff614665ae9 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x2261fd0) at /usr/include/c++/8/bits/std_function.h:682 #13 AsyncConnection::process (this=0x2261c30) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/AsyncConnection.cc:454 #14 0x00007ff6146bfc87 in EventCenter::process_events ( this=this@entry=0x17f15e0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7ff5aeffbee8) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/build/boost/include/boost/container/new_allocator.hpp:165 #15 0x00007ff6146c619c in NetworkStack::<lambda()>::operator() ( __closure=0x171f938, __closure=0x171f938) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/Stack.cc:53 #16 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297 #17 0x00007ff6129f3b23 in execute_native_thread_routine () from /lib64/libstdc++.so.6 #18 0x00007ff61d5d51ca in start_thread () from /lib64/libpthread.so.0 #19 0x00007ff61d029e73 in clone () from /lib64/libc.so.6
The crash is caused because the mc member in the MonClientPinger is NULL:
(gdb) fr 11 #11 0x00007f99544cd62c in ProtocolV2::run_continuation (this=0x1e89930, continuation=...) at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47 47 CONTINUATION_RUN(continuation) (gdb) print ((MonClientPinger*)(messenger->auth_client)).mc $1 = std::unique_ptr<MonConnection> = {get() = 0x0}
Updated by Radoslaw Zarzynski 7 months ago
- Assignee set to Nitzan Mordechai
Huh, looks like a problem (a race condition?) within MonClient
.
Nitzan, would you mind taking a look?
Updated by Nitzan Mordechai 6 months ago
Sven Anderson, can you please let me know how did you initialize rados_t c ?
Did you use few threads to invoke test_ping ?
any other condition for the cluster itself? Did you start it by vstart?
Actions