Project

General

Profile

Actions

Bug #63008

open

rados: race condition in rados_ping_monitor can cause segmentation fault

Added by Sven Anderson 7 months ago. Updated 6 months ago.

Status:
Need More Info
Priority:
Normal
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In the CI of go-ceph we noticed a sporadic segmentation fault in the test for rados_ping_monitor. (See https://github.com/ceph/go-ceph/issues/921)

Further investigation showed, that we can create the segmentation fault also with a simple C loop:

inline int test_ping(rados_t c) {
    char* outstr;
    size_t outlen;
    int ret;
    for (int i=0; i<10000; ++i) {
        printf("X%d ", i);
        ret = rados_ping_monitor(c, "a", &outstr, &outlen);
        rados_buffer_free(outstr);
    }
    return ret;
}

The backtrace looks like this:

Core was generated by `./rados.test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
154        RET
[Current thread is 1 (Thread 0x7ff5aeffe700 (LWP 6184))]
To enable execution of this file add
    add-auto-load-safe-path /opt/go/src/runtime/runtime-gdb.py
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path" 
#0  runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
#1  0x00000000004585db in runtime.raisebadsignal (sig=11, c=0x7ff5aeffa850)
    at /opt/go/src/runtime/signal_unix.go:967
#2  0x0000000000458a27 in runtime.badsignal (sig=11, c=0x7ff5aeffa850)
    at /opt/go/src/runtime/signal_unix.go:1076
#3  0x0000000000457368 in runtime.sigtrampgo (sig=11, info=0x7ff5aeffa9f0, 
    ctx=0x7ff5aeffa8c0) at /opt/go/src/runtime/signal_unix.go:468
#4  0x0000000000478a66 in runtime.sigtramp ()
    at /opt/go/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  MonConnection::get_auth_request (this=0x0, method=0x17257a0, 
    preferred_modes=0x7ff5aeffb680, bl=0x7ff5aeffb6a0, entity_name=..., 
    want_keys=0, keyring=0x7ffd722b3130)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/mon/MonClient.cc:1702
#7  0x00007ff6147305eb in non-virtual thunk to MonClientPinger::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*) ()
   from /usr/lib64/ceph/libceph-common.so.2
#8  0x00007ff6146a3672 in ProtocolV2::send_auth_request (this=0x2264020, 
    allowed_methods=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1692
#9  0x00007ff6146a3ddc in ProtocolV2::send_auth_request (this=0x2264020)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.h:217
#10 ProtocolV2::post_client_banner_exchange (this=0x2264020)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1680
#11 0x00007ff61469d62c in ProtocolV2::run_continuation (this=0x2264020, 
    continuation=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#12 0x00007ff614665ae9 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, 
    this=0x2261fd0) at /usr/include/c++/8/bits/std_function.h:682
#13 AsyncConnection::process (this=0x2261c30)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/AsyncConnection.cc:454
#14 0x00007ff6146bfc87 in EventCenter::process_events (
    this=this@entry=0x17f15e0, timeout_microseconds=<optimized out>, 
    timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7ff5aeffbee8)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/build/boost/include/boost/container/new_allocator.hpp:165
#15 0x00007ff6146c619c in NetworkStack::<lambda()>::operator() (
    __closure=0x171f938, __closure=0x171f938)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/Stack.cc:53
#16 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/8/bits/std_function.h:297
#17 0x00007ff6129f3b23 in execute_native_thread_routine ()
   from /lib64/libstdc++.so.6
#18 0x00007ff61d5d51ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007ff61d029e73 in clone () from /lib64/libc.so.6

The crash is caused because the mc member in the MonClientPinger is NULL:

(gdb) fr 11
#11 0x00007f99544cd62c in ProtocolV2::run_continuation (this=0x1e89930, 
    continuation=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
47        CONTINUATION_RUN(continuation)
(gdb) print ((MonClientPinger*)(messenger->auth_client)).mc
$1 = std::unique_ptr<MonConnection> = {get() = 0x0}
Actions #1

Updated by Radoslaw Zarzynski 7 months ago

  • Assignee set to Nitzan Mordechai

Huh, looks like a problem (a race condition?) within MonClient.

Nitzan, would you mind taking a look?

Actions #2

Updated by Nitzan Mordechai 6 months ago

Sven Anderson, can you please let me know how did you initialize rados_t c ?
Did you use few threads to invoke test_ping ?
any other condition for the cluster itself? Did you start it by vstart?

Actions #3

Updated by Neha Ojha 6 months ago

  • Status changed from New to Need More Info
Actions

Also available in: Atom PDF