Project

General

Profile

Actions

Bug #50371

open

Segmentation fault (core dumped) ceph_test_rados_api_watch_notify_pp

Added by Brad Hubbard about 3 years ago. Updated 25 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
segfault
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/nojha-2021-04-14_00:54:53-rados-master-distro-basic-smithi/6044164

2021-04-14T01:40:16.439 INFO:tasks.workunit.client.0.smithi079.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1
2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr:bash: line 1: 36189 Segmentation fault      (core dumped) ceph_test_rados_api_watch_notify_pp 2>&1
2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr:     36190 Done                    | tee ceph_test_rados_api_watch_notify_pp.log
2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr:     36192 Done                    | sed "s/^/      api_watch_notify_pp: /" 

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #50042: rados/test.sh: api_watch_notify failuresResolvedNitzan Mordechai

Actions
Actions #1

Updated by Brad Hubbard about 3 years ago

Looking at the coredump.

(gdb) bt
#0  0x00007f9bf380a752 in WatchInfo::handle_error (err=-107, cookie=94290589651712, this=0x55c1bd1a6ec0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:1605
#1  WatchInfo::operator() (bl=..., notifier_id=0, cookie=94290589651712, notify_id=<optimized out>, ec=..., this=0x55c1bd1a6ec0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:1614
#2  fu2::abi_310::detail::invocation::invoke<WatchInfo&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list> (callable=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:126
#3  fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&) (data=<optimized out>, capacity=<optimized out>, args#0=..., args#1=<optimized out>, args#2=94290589651712, args#3=0, args#4=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:518
#4  0x00007f9bf3857b97 in fu2::abi_310::detail::type_erasure::tables::vtable<fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >::invoke<0ul, fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long const&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list>(fu2::abi_310::detail::type_erasure::data_accessor*&&, unsigned long const&, boost::system::error_code&&, unsigned long&&, unsigned long&&, unsigned long&&, ceph::buffer::v15_2_0::list&&) const (this=0x55c1bd1a1700) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:1121
#5  fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >::invoke<0ul, fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list>(fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >&, boost::system::error_code&&, unsigned long&&, unsigned long&&, unsigned long&&, ceph::buffer::v15_2_0::list&&) (erasure=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:1125
#6  fu2::abi_310::detail::type_erasure::invocation_table::operator_impl<0ul, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::operator()(boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&) (args#4=..., args#3=0, args#2=94290589651712, args#1=0, args#0=..., this=0x55c1bd1a16f0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:702
#7  CB_DoWatchError::operator() (this=<synthetic pointer>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/osdc/Objecter.cc:642
#8  boost::asio::asio_handler_invoke<CB_DoWatchError> (function=<synthetic pointer>...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/handler_invoke_hook.hpp:88
#9  boost_asio_handler_invoke_helpers::invoke<CB_DoWatchError, CB_DoWatchError> (context=<synthetic pointer>..., function=<synthetic pointer>...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#10 boost::asio::detail::handler_work<CB_DoWatchError, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0u>, void>::complete<CB_DoWatchError> (this=<synthetic pointer>, handler=<synthetic pointer>..., function=<synthetic pointer>...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/handler_work.hpp:425
#11 boost::asio::detail::completion_handler<CB_DoWatchError, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0u> >::do_complete (owner=<optimized out>, base=0x7f9bc0002280) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/completion_handler.hpp:74
#12 0x00007f9bf381ba95 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x55c1bd0a7e80, this=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#13 boost::asio::detail::strand_service::do_complete (owner=0x55c1bd0a7e80, base=0x55c1bd1d5160, ec=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/impl/strand_service.ipp:169
#14 0x00007f9bf381dd88 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x55c1bd0a7e80, this=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#15 boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=0x55c1bd0a7e80) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/impl/scheduler.ipp:481
#16 boost::asio::detail::scheduler::run (this=0x55c1bd0a7e80, ec=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/impl/scheduler.ipp:204
#17 0x00007f9bf382315a in boost::asio::io_context::run (this=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/impl/io_context.ipp:63
#18 ceph::async::io_context_pool::start(short)::{lambda()#1}::operator()() const (__closure=0x55c1bd1b2528) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/common/async/context_pool.h:68
#19 std::__invoke_impl<void, ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::__invoke_other, ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#20 std::__invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::__invoke_result&&, (ceph::async::io_context_pool::start(short)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#21 std::invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::invoke_result&&, (ceph::async::io_context_pool::start(short)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/8/functional:81
#22 make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}::operator()<{lambda()#1}> (fun=..., this=0x55c1bd1b2530) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/common/Thread.h:79
#23 std::__invoke_impl<void, make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}>(std::__invoke_other, make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}&&, {lambda()#1}&&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#24 std::__invoke<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}>(make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}&&) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#25 std::thread::_Invoker<std::tuple<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) (this=0x55c1bd1b2528) at /usr/include/c++/8/thread:244
#26 std::thread::_Invoker<std::tuple<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}> >::operator()() (this=0x55c1bd1b2528) at /usr/include/c++/8/thread:253
#27 std::thread::_State_impl<std::thread::_Invoker<std::tuple<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}> > >::_M_run() (this=0x55c1bd1b2520) at /usr/include/c++/8/thread:196
#28 0x00007f9bf0a65ba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#29 0x00007f9bf354314a in start_thread () from /lib64/libpthread.so.0
#30 0x00007f9bf0142f23 in clone () from /lib64/libc.so.6

Thread 2 matches the segfault PID from the log entry 36189 and is the main() thread so that makes sense.

(gdb) info thread 2
  Id   Target Id                         Frame 
  2    Thread 0x7f9bf3d11540 (LWP 36189) 0x00007f9bf35492fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Why did it segfault?

(gdb) x/i $pc
=> 0x7f9bf380a752 <fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)+114>:   callq  *0x18(%rax)
(gdb) info reg rax
rax            0x3331              13105

So we segfaulted dereferencing a bogus address. Where did it come from?

(gdb) disass /s
1605          ctx2->handle_error(cookie, err);
   0x00007f9bf380a749 <+105>:   mov    (%rdi),%rax
   0x00007f9bf380a74c <+108>:   mov    %r13d,%edx
   0x00007f9bf380a74f <+111>:   mov    %rbp,%rsi
=> 0x00007f9bf380a752 <+114>:   callq  *0x18(%rax)
(gdb) i r rdi
rdi            0x7ffd93cb5b60      140727083031392
(gdb) x/a 0x7ffd93cb5b60
0x7ffd93cb5b60: 0x3331
(gdb) x/a 0x7ffd93cb5b60
0x7ffd93cb5b60: 0x3331
(gdb) p ctx2
$15 = (librados::v14_2_0::WatchCtx2 *) 0x7ffd93cb5b60
(gdb) p this
$17 = (WatchInfo * const) 0x55c1bd1a6ec0
(gdb) p *ctx2
$16 = {
  _vptr.WatchCtx2 = 0x3331
}

So the WatchCtx2 pointer in our WatchInfo object is bad.

Let's look for those values in the other threads.

We only see the WatchInfo* in thread 1 and the address of the WatchCtx2* doesn't show up anywhere but the bogus value shows up in the dump of thread 2.

#6  0x000055c1bcaea9bd in LibRadosWatchNotifyPP_AioWatchNotify2_Test::TestBody (this=0x55c1bd093e60) at /usr/include/c++/8/ext/new_allocator.h:79
        gtest_ar = {
          success_ = true,
          message_ = std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >> = {
            get() = 0x7f9bf04046c0 <_IO_2_1_stdout_>
          }
        }
        buf = '\314' <repeats 127 times>, <incomplete sequence \314>
        bl1 = {
          _buffers = {
            _root = {
              next = 0x55c1bd0c1de0
            },
            _tail = 0x55c1bd0c1de0
          },
          _carriage = 0x55c1bd0c1de0,
          _len = 128,
          _num = 1,
          static always_empty_bptr = {
            <ceph::buffer::v15_2_0::ptr_hook> = {
              next = 0x0
            },
            <ceph::buffer::v15_2_0::ptr> = {
              _raw = 0x0,
              _off = 0,
              _len = 0
            }, <No data fields>}
        }
        handle = 140307781909789
        ctx = {
          <librados::v14_2_0::WatchCtx2> = {
            _vptr.WatchCtx2 = 0x3331
          },
          members of WatchNotifyTestCtx2:
          notify = 0x0
        }

(gdb) t 2
[Switching to thread 2 (Thread 0x7f9bf3d11540 (LWP 36189))]
#0  0x00007f9bf35492fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f9bf35492fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9bf0a5f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007f9bf37fac4b in std::condition_variable::wait<librados::IoCtxImpl::operate(const object_t&, ObjectOperation*, ceph::real_time*, int)::<lambda()> > (__p=..., __lock=..., this=0x7ffd93cb5770) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:667
#3  librados::IoCtxImpl::operate (this=0x55c1bd1ade80, oid=..., o=<optimized out>, pmtime=<optimized out>, flags=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:667
#4  0x00007f9bf37fb015 in librados::IoCtxImpl::write (this=0x55c1bd1ade80, oid=..., bl=..., len=len@entry=128, off=off@entry=0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:594
#5  0x00007f9bf37cf9b1 in librados::v14_2_0::IoCtx::write (this=this@entry=0x55c1bd093e80, oid="foo", bl=..., len=len@entry=128, off=off@entry=0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/librados_cxx.cc:1252
#6  0x000055c1bcaea9bd in LibRadosWatchNotifyPP_AioWatchNotify2_Test::TestBody (this=0x55c1bd093e60) at /usr/include/c++/8/ext/new_allocator.h:79
#7  0x000055c1bcb2f0be in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x55c1bcb43709 "the test body", method=<optimized out>, object=0x55c1bd093e60) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2586
#8  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x55c1bd093e60, method=<optimized out>, location=location@entry=0x55c1bcb43709 "the test body") at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2641
#9  0x000055c1bcb21ccb in testing::Test::Run (this=0x55c1bd093e60) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2680
#10 testing::Test::Run (this=0x55c1bd093e60) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2670
#11 0x000055c1bcb21e35 in testing::TestInfo::Run (this=0x55c1bd0aed70) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2858
#12 testing::TestInfo::Run (this=0x55c1bd0aed70) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2831
#13 0x000055c1bcb22391 in testing::TestSuite::Run (this=0x55c1bd0abef0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:3012
#14 testing::TestSuite::Run (this=0x55c1bd0abef0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2991
#15 0x000055c1bcb23fa5 in testing::internal::UnitTestImpl::RunAllTests (this=0x55c1bd0a0de0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:5720
#16 0x000055c1bcb2f62e in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x55c1bcb44d08 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x55c1bd0a0de0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2586
#17 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x55c1bd0a0de0, method=<optimized out>, location=location@entry=0x55c1bcb44d08 "auxiliary test code (environments or event listeners)") at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2641
#18 0x000055c1bcb21f00 in testing::UnitTest::Run (this=0x55c1bcd677a0 <testing::UnitTest::GetInstance()::instance>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/include/gtest/gtest.h:1429
#19 0x000055c1bcae4a5d in RUN_ALL_TESTS () at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/include/gtest/gtest.h:2486
#20 main (argc=<optimized out>, argv=0x7ffd93cb60e8) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googlemock/src/gmock_main.cc:70

After looking at this for some time I think we may have a problem with the reuse of the ioctx and, specifically unwatch leaving the LingerOp.handle in some inconsistent state. It looks like we are using the LingerOp.handle that is populated with local variables that have gone out of scope. Need to look into this further.

Actions #2

Updated by Brad Hubbard about 3 years ago

Here's the log from the coredump that I believe should pinpoint the issue but I'll need to analyse it further today.

2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:39 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch unwatch cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x55c1bdc5a670 con 0x55c1bd1ace70
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 5
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 5 subsystem: 14 client.4816.objecter 1 in flight
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter tick
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter  pinging osd that serves lingering tid 11 (osd.4)
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter _send_linger_ping 11 now 1842.638147s
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1c5740 osd=4 4
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 4 40
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _send_op 40 to 53.1 on osd.4
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:40 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch ping cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x7f9bc0003ee0 con 0x55c1bd1ace70
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter _maybe_request_map subscribing (onetime) to next osd map
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 31 monclient: _renew_subs
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 31 monclient: _send_mon_message to mon.a at v1:172.21.15.79:6789/0
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.79:6789/0 -- mon_subscribe({osdmap=84}) v3 -- 0x7f9bc0004290 con 0x7f9bd00618d0
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- ping magic: 0 v1 -- 0x7f9bc00044e0 con 0x55c1bd1ace70
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 <== osd.4 v1:172.21.15.103:6805/31037 12 ==== osd_op_reply(39 foo [watch unwatch cookie 94290589651712] v83'8 uv6 ondisk = 0) v8 ==== 147+0+0 (unknown 2284314154 0 0) 0x7f9be0001060 con 0x55c1bd1ace70
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter ms_dispatch 0x55c1bd1f1c60 osd_op_reply(39 foo [watch unwatch cookie 94290589651712] v83'8 uv6 ondisk = 0) v8
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter in handle_osd_op_reply
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 7 subsystem: 14 client.4816.objecter handle_osd_op_reply 39 ondisk uv 6 in 53.1 attempt 0
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter  op 0 rval 0 len 0
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter handle_osd_op_reply completed tid 39
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _finish_op 39
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 6
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _session_op_remove 4 39
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 5 subsystem: 14 client.4816.objecter 1 in flight
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 <== osd.4 v1:172.21.15.103:6805/31037 13 ==== osd_op_reply(40 foo [watch ping cookie 94290589651712] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 ==== 147+0+0 (unknown 1918824261 0 0) 0x7f9be0001060 con 0x55c1bd1ace70
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter ms_dispatch 0x55c1bd1f1c60 osd_op_reply(40 foo [watch ping cookie 94290589651712] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter in handle_osd_op_reply
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 7 subsystem: 14 client.4816.objecter handle_osd_op_reply 40 ondisk uv 0 in 53.1 attempt 0
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter  op 0 rval -107 len 0
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter handle_osd_op_reply completed tid 40
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _finish_op 40
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 5
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _session_op_remove 4 40
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 5 subsystem: 14 client.4816.objecter 0 in flight
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter _linger_ping 11 sent 1842.638147s gen 0 = osd:107 (last_error system:0 register_gen 0)
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 15 librados: write oid=foo nspace=test-rados-api-smithi079-36189-13
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _op_submit op 0x55c1bd1a4fe0
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter _calc_target epoch 83 base foo @50;test-rados-api-smithi079-36189-13 precalc_pgid 0 pgid 0.0 is_write
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter _calc_target target foo @53;test-rados-api-smithi079-36189-13 -> pgid 53.247618ce
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _calc_target  raw pgid 53.247618ce -> actual 53.2 acting [1,2] primary 1
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter _get_session s=0x55c1bd1b8470 osd=1 3
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _op_submit oid foo '@50;test-rados-api-smithi079-36189-13' '@53;test-rados-api-smithi079-36189-13' [write 0~128 in=128b] tid 41 osd.1
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1b8470 osd=1 3
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 1 41
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _send_op 41 to 53.2 on osd.1
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.79:6800/30996 -- osd_op(unknown.0.0:41 53.2 53:73186e24:test-rados-api-smithi079-36189-13::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x55c1bdc5a670 con 0x55c1bd1d5ee0
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1b8470 osd=1 4
2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 5 subsystem: 14 client.4816.objecter 1 in flight
2021-04-14 01:40:16.376168 thread: 0x7f9bec1a6700 priority: 10 subsystem: 15 librados: handle_error cookie 94290589651712 err -107
2021-04-14 01:40:16.406168 thread: 0x7f9beb9a5700 priority: 20 subsystem: 14 client.4816.objecter _linger_cancel linger_id=11
2021-04-14 01:40:16.406168 thread: 0x7f9beb9a5700 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 3
2021-04-14 01:40:16.406168 thread: 0x7f9beb9a5700 priority: 15 subsystem: 14 client.4816.objecter _session_linger_op_remove 4 11
Actions #3

Updated by Brad Hubbard about 3 years ago

  • Related to Bug #50042: rados/test.sh: api_watch_notify failures added
Actions #4

Updated by Brad Hubbard about 3 years ago

  • Status changed from New to Closed

This looks like an issue that only comes about because of the problem seen in https://tracker.ceph.com/issues/50042. Closing this as we shouldn't see it again once the related tracker is fixed.

Actions #5

Updated by Laura Flores 12 months ago

  • Translation missing: en.field_tag_list set to test-failure
  • Status changed from Closed to New

/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253544

2023-04-29T06:53:02.055 INFO:tasks.workunit.client.0.smithi019.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1

...

2023-04-29T06:53:02.168 INFO:tasks.workunit.client.0.smithi019.stderr:bash: line 1: 149990 Segmentation fault      (core dumped) ceph_test_rados_api_watch_notify_pp 2>&1
2023-04-29T06:53:02.168 INFO:tasks.workunit.client.0.smithi019.stderr:     149991 Done                    | tee ceph_test_rados_api_watch_notify_pp.log
2023-04-29T06:53:02.168 INFO:tasks.workunit.client.0.smithi019.stderr:     149992 Done                    | sed "s/^/      api_watch_notify_pp: /" 

Reopening this since there was another instance in main, and the fix for https://tracker.ceph.com/issues/50042 was merged several months ago.

Actions #6

Updated by Brad Hubbard 12 months ago

Laura Flores wrote:

/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253544
[...]

Reopening this since there was another instance in main, and the fix for https://tracker.ceph.com/issues/50042 was merged several months ago.

I'll try and take a look at the coredump tomorrow to see if I can confirm it's the same issue.

Actions #7

Updated by Laura Flores 12 months ago

Thanks Brad! Let me know how I can help.

I found another instance in Pacific:
/a/yuriw-2023-05-06_14:41:44-rados-pacific-release-distro-default-smithi/7264110

There is a core file and an executable available at /a/yuriw-2023-05-06_14:41:44-rados-pacific-release-distro-default-smithi/7264110/remote/smithi055/coredump.

Actions #8

Updated by Laura Flores 12 months ago

  • Backport set to pacific
Actions #9

Updated by Brad Hubbard 12 months ago

Looking at the kern.log.gz file to get some hints this time.

2023-04-29T06:53:02.007129+00:00 smithi019 kernel: io_context_pool[179499]: segfault at 52 ip 00007f318fc13373 sp 00007f31856249c0 error 4 in librados.so.2.0.0[7f318fb86000+127000]
(gdb) info threads
  Id   Target Id                          Frame 
* 1    Thread 0x7f3185627700 (LWP 179499) 0x00007f318fc13373 in WatchInfo::handle_error (err=-107, cookie=94770927113072, this=0x5631936e1a80) at /usr/src/debug/ceph-18.0.0-3608.g9cd39654.el8.x86_64/src/librados/IoCtxImpl.cc:1610
  2    Thread 0x7f318762b700 (LWP 178238) futex_wait_cancelable (private=0, expected=0, futex_word=0x5631935db7a4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7f318808c700 (LWP 178243) 0x00007f318c540027 in epoll_wait (epfd=3, events=0x563193650360, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  4    Thread 0x7f3185e28700 (LWP 178244) 0x00007f318c540027 in epoll_wait (epfd=6, events=0x5631935ee530, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7f317de28700 (LWP 178245) 0x00007f318c540027 in epoll_wait (epfd=9, events=0x5631936a4170, maxevents=5000, timeout=10000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  6    Thread 0x7f3186e2a700 (LWP 179497) futex_abstimed_wait_cancelable (private=0, abstime=0x7f3186e27d60, expected=0, futex_word=0x7f3174001858) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  7    Thread 0x7f3186629700 (LWP 179498) futex_wait_cancelable (private=0, expected=0, futex_word=0x56319357a45c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  8    Thread 0x7f3184e26700 (LWP 179502) 0x00007f318c5350e1 in __GI___poll (fds=fds@entry=0x7f3184e23d20, nfds=nfds@entry=2, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  9    Thread 0x7f3188b22880 (LWP 149990) futex_wait_cancelable (private=0, expected=0, futex_word=0x7fff1e68ebc8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  10   Thread 0x7f317ffff700 (LWP 179504) futex_abstimed_wait_cancelable (private=0, abstime=0x7f317fffcd40, expected=0, futex_word=0x563193708a44) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  11   Thread 0x7f317f7fe700 (LWP 179505) futex_wait_cancelable (private=0, expected=0, futex_word=0x5631936059d0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  12   Thread 0x7f317effd700 (LWP 179506) futex_wait_cancelable (private=0, expected=0, futex_word=0x563193605b50) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  13   Thread 0x7f317e7fc700 (LWP 179508) futex_abstimed_wait_cancelable (private=0, abstime=0x7f317e7f9d30, expected=0, futex_word=0x5631935c49f8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  14   Thread 0x7f317d627700 (LWP 179945) futex_wait_cancelable (private=0, expected=0, futex_word=0x5631935c4ef8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88

Thread 1.

(gdb) x/i $pc
=> 0x7f318fc13373 <fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)+195>:   mov    0x18(%rax),%rax
(gdb) info reg rax
rax            0x3a                58
(gdb)  i r rdi
rdi            0x7fff1e68ef40      140733703581504
(gdb) x/a 0x7fff1e68ef40
0x7fff1e68ef40: 0x3a
(gdb) p/d 0x3a
$1 = 58
(gdb) p ctx2
$2 = (librados::v14_2_0::WatchCtx2 *) 0x7fff1e68ef40
(gdb) p *ctx2
$3 = {_vptr.WatchCtx2 = 0x3a}

So this looks like the same issue. I'll try and dig for more info here tomorrow as I've run out of time today.

Actions #10

Updated by Laura Flores 9 months ago

/a/yuriw-2023-08-02_20:21:03-rados-wip-yuri3-testing-2023-08-01-0825-pacific-distro-default-smithi/7358894

Actions #11

Updated by Aishwarya Mathuria 7 months ago

/a/yuriw-2023-10-05_21:43:37-rados-wip-yuri6-testing-2023-10-04-0901-distro-default-smithi/7412033

Actions #12

Updated by Laura Flores 5 months ago

/a/yuriw-2023-12-08_21:24:43-rados-wip-yuri5-testing-2023-12-07-0740-pacific-distro-default-smithi/7485076

Actions #13

Updated by Radoslaw Zarzynski 4 months ago

bump up

Actions #14

Updated by Laura Flores 25 days ago

/a/yuriw-2024-03-24_22:19:24-rados-wip-yuri10-testing-2024-03-24-1159-distro-default-smithi/7620506

Actions

Also available in: Atom PDF