Bug #50371
openSegmentation fault (core dumped) ceph_test_rados_api_watch_notify_pp
0%
Description
/a/nojha-2021-04-14_00:54:53-rados-master-distro-basic-smithi/6044164
2021-04-14T01:40:16.439 INFO:tasks.workunit.client.0.smithi079.stdout: api_watch_notify_pp: [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1 2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr:bash: line 1: 36189 Segmentation fault (core dumped) ceph_test_rados_api_watch_notify_pp 2>&1 2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr: 36190 Done | tee ceph_test_rados_api_watch_notify_pp.log 2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr: 36192 Done | sed "s/^/ api_watch_notify_pp: /"
Updated by Brad Hubbard about 3 years ago
Looking at the coredump.
(gdb) bt #0 0x00007f9bf380a752 in WatchInfo::handle_error (err=-107, cookie=94290589651712, this=0x55c1bd1a6ec0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:1605 #1 WatchInfo::operator() (bl=..., notifier_id=0, cookie=94290589651712, notify_id=<optimized out>, ec=..., this=0x55c1bd1a6ec0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:1614 #2 fu2::abi_310::detail::invocation::invoke<WatchInfo&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list> (callable=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:126 #3 fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&) (data=<optimized out>, capacity=<optimized out>, args#0=..., args#1=<optimized out>, args#2=94290589651712, args#3=0, args#4=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:518 #4 0x00007f9bf3857b97 in fu2::abi_310::detail::type_erasure::tables::vtable<fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >::invoke<0ul, fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long const&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list>(fu2::abi_310::detail::type_erasure::data_accessor*&&, unsigned long const&, boost::system::error_code&&, unsigned long&&, unsigned long&&, unsigned long&&, ceph::buffer::v15_2_0::list&&) const (this=0x55c1bd1a1700) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:1121 #5 fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >::invoke<0ul, fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list>(fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >&, boost::system::error_code&&, unsigned long&&, unsigned long&&, unsigned long&&, ceph::buffer::v15_2_0::list&&) (erasure=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:1125 #6 fu2::abi_310::detail::type_erasure::invocation_table::operator_impl<0ul, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::operator()(boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&) (args#4=..., args#3=0, args#2=94290589651712, args#1=0, args#0=..., this=0x55c1bd1a16f0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/include/function2.hpp:702 #7 CB_DoWatchError::operator() (this=<synthetic pointer>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/osdc/Objecter.cc:642 #8 boost::asio::asio_handler_invoke<CB_DoWatchError> (function=<synthetic pointer>...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/handler_invoke_hook.hpp:88 #9 boost_asio_handler_invoke_helpers::invoke<CB_DoWatchError, CB_DoWatchError> (context=<synthetic pointer>..., function=<synthetic pointer>...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54 #10 boost::asio::detail::handler_work<CB_DoWatchError, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0u>, void>::complete<CB_DoWatchError> (this=<synthetic pointer>, handler=<synthetic pointer>..., function=<synthetic pointer>...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/handler_work.hpp:425 #11 boost::asio::detail::completion_handler<CB_DoWatchError, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0u> >::do_complete (owner=<optimized out>, base=0x7f9bc0002280) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/completion_handler.hpp:74 #12 0x00007f9bf381ba95 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x55c1bd0a7e80, this=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/scheduler_operation.hpp:40 #13 boost::asio::detail::strand_service::do_complete (owner=0x55c1bd0a7e80, base=0x55c1bd1d5160, ec=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/impl/strand_service.ipp:169 #14 0x00007f9bf381dd88 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x55c1bd0a7e80, this=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/scheduler_operation.hpp:40 #15 boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=0x55c1bd0a7e80) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/impl/scheduler.ipp:481 #16 boost::asio::detail::scheduler::run (this=0x55c1bd0a7e80, ec=...) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/impl/scheduler.ipp:204 #17 0x00007f9bf382315a in boost::asio::io_context::run (this=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/impl/io_context.ipp:63 #18 ceph::async::io_context_pool::start(short)::{lambda()#1}::operator()() const (__closure=0x55c1bd1b2528) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/common/async/context_pool.h:68 #19 std::__invoke_impl<void, ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::__invoke_other, ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60 #20 std::__invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::__invoke_result&&, (ceph::async::io_context_pool::start(short)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95 #21 std::invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::invoke_result&&, (ceph::async::io_context_pool::start(short)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/8/functional:81 #22 make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}::operator()<{lambda()#1}> (fun=..., this=0x55c1bd1b2530) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/common/Thread.h:79 #23 std::__invoke_impl<void, make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}>(std::__invoke_other, make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}&&, {lambda()#1}&&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60 #24 std::__invoke<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}>(make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}&&) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95 #25 std::thread::_Invoker<std::tuple<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) (this=0x55c1bd1b2528) at /usr/include/c++/8/thread:244 #26 std::thread::_Invoker<std::tuple<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}> >::operator()() (this=0x55c1bd1b2528) at /usr/include/c++/8/thread:253 #27 std::thread::_State_impl<std::thread::_Invoker<std::tuple<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}> > >::_M_run() (this=0x55c1bd1b2520) at /usr/include/c++/8/thread:196 #28 0x00007f9bf0a65ba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6 #29 0x00007f9bf354314a in start_thread () from /lib64/libpthread.so.0 #30 0x00007f9bf0142f23 in clone () from /lib64/libc.so.6
Thread 2 matches the segfault PID from the log entry 36189 and is the main() thread so that makes sense.
(gdb) info thread 2 Id Target Id Frame 2 Thread 0x7f9bf3d11540 (LWP 36189) 0x00007f9bf35492fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Why did it segfault?
(gdb) x/i $pc => 0x7f9bf380a752 <fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)+114>: callq *0x18(%rax) (gdb) info reg rax rax 0x3331 13105
So we segfaulted dereferencing a bogus address. Where did it come from?
(gdb) disass /s 1605 ctx2->handle_error(cookie, err); 0x00007f9bf380a749 <+105>: mov (%rdi),%rax 0x00007f9bf380a74c <+108>: mov %r13d,%edx 0x00007f9bf380a74f <+111>: mov %rbp,%rsi => 0x00007f9bf380a752 <+114>: callq *0x18(%rax) (gdb) i r rdi rdi 0x7ffd93cb5b60 140727083031392 (gdb) x/a 0x7ffd93cb5b60 0x7ffd93cb5b60: 0x3331 (gdb) x/a 0x7ffd93cb5b60 0x7ffd93cb5b60: 0x3331 (gdb) p ctx2 $15 = (librados::v14_2_0::WatchCtx2 *) 0x7ffd93cb5b60 (gdb) p this $17 = (WatchInfo * const) 0x55c1bd1a6ec0 (gdb) p *ctx2 $16 = { _vptr.WatchCtx2 = 0x3331 }
So the WatchCtx2 pointer in our WatchInfo object is bad.
Let's look for those values in the other threads.
We only see the WatchInfo* in thread 1 and the address of the WatchCtx2* doesn't show up anywhere but the bogus value shows up in the dump of thread 2.
#6 0x000055c1bcaea9bd in LibRadosWatchNotifyPP_AioWatchNotify2_Test::TestBody (this=0x55c1bd093e60) at /usr/include/c++/8/ext/new_allocator.h:79 gtest_ar = { success_ = true, message_ = std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >> = { get() = 0x7f9bf04046c0 <_IO_2_1_stdout_> } } buf = '\314' <repeats 127 times>, <incomplete sequence \314> bl1 = { _buffers = { _root = { next = 0x55c1bd0c1de0 }, _tail = 0x55c1bd0c1de0 }, _carriage = 0x55c1bd0c1de0, _len = 128, _num = 1, static always_empty_bptr = { <ceph::buffer::v15_2_0::ptr_hook> = { next = 0x0 }, <ceph::buffer::v15_2_0::ptr> = { _raw = 0x0, _off = 0, _len = 0 }, <No data fields>} } handle = 140307781909789 ctx = { <librados::v14_2_0::WatchCtx2> = { _vptr.WatchCtx2 = 0x3331 }, members of WatchNotifyTestCtx2: notify = 0x0 } (gdb) t 2 [Switching to thread 2 (Thread 0x7f9bf3d11540 (LWP 36189))] #0 0x00007f9bf35492fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f9bf35492fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f9bf0a5f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6 #2 0x00007f9bf37fac4b in std::condition_variable::wait<librados::IoCtxImpl::operate(const object_t&, ObjectOperation*, ceph::real_time*, int)::<lambda()> > (__p=..., __lock=..., this=0x7ffd93cb5770) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:667 #3 librados::IoCtxImpl::operate (this=0x55c1bd1ade80, oid=..., o=<optimized out>, pmtime=<optimized out>, flags=<optimized out>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:667 #4 0x00007f9bf37fb015 in librados::IoCtxImpl::write (this=0x55c1bd1ade80, oid=..., bl=..., len=len@entry=128, off=off@entry=0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/IoCtxImpl.cc:594 #5 0x00007f9bf37cf9b1 in librados::v14_2_0::IoCtx::write (this=this@entry=0x55c1bd093e80, oid="foo", bl=..., len=len@entry=128, off=off@entry=0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/librados/librados_cxx.cc:1252 #6 0x000055c1bcaea9bd in LibRadosWatchNotifyPP_AioWatchNotify2_Test::TestBody (this=0x55c1bd093e60) at /usr/include/c++/8/ext/new_allocator.h:79 #7 0x000055c1bcb2f0be in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x55c1bcb43709 "the test body", method=<optimized out>, object=0x55c1bd093e60) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2586 #8 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x55c1bd093e60, method=<optimized out>, location=location@entry=0x55c1bcb43709 "the test body") at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2641 #9 0x000055c1bcb21ccb in testing::Test::Run (this=0x55c1bd093e60) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2680 #10 testing::Test::Run (this=0x55c1bd093e60) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2670 #11 0x000055c1bcb21e35 in testing::TestInfo::Run (this=0x55c1bd0aed70) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2858 #12 testing::TestInfo::Run (this=0x55c1bd0aed70) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2831 #13 0x000055c1bcb22391 in testing::TestSuite::Run (this=0x55c1bd0abef0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:3012 #14 testing::TestSuite::Run (this=0x55c1bd0abef0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2991 #15 0x000055c1bcb23fa5 in testing::internal::UnitTestImpl::RunAllTests (this=0x55c1bd0a0de0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:5720 #16 0x000055c1bcb2f62e in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x55c1bcb44d08 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x55c1bd0a0de0) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2586 #17 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x55c1bd0a0de0, method=<optimized out>, location=location@entry=0x55c1bcb44d08 "auxiliary test code (environments or event listeners)") at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/src/gtest.cc:2641 #18 0x000055c1bcb21f00 in testing::UnitTest::Run (this=0x55c1bcd677a0 <testing::UnitTest::GetInstance()::instance>) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/include/gtest/gtest.h:1429 #19 0x000055c1bcae4a5d in RUN_ALL_TESTS () at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googletest/include/gtest/gtest.h:2486 #20 main (argc=<optimized out>, argv=0x7ffd93cb60e8) at /usr/src/debug/ceph-17.0.0-3132.g8f57c508.el8.x86_64/src/googletest/googlemock/src/gmock_main.cc:70
After looking at this for some time I think we may have a problem with the reuse of the ioctx and, specifically unwatch leaving the LingerOp.handle in some inconsistent state. It looks like we are using the LingerOp.handle that is populated with local variables that have gone out of scope. Need to look into this further.
Updated by Brad Hubbard about 3 years ago
Here's the log from the coredump that I believe should pinpoint the issue but I'll need to analyse it further today.
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:39 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch unwatch cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x55c1bdc5a670 con 0x55c1bd1ace70 2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 5 2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 5 subsystem: 14 client.4816.objecter 1 in flight 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter tick 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter pinging osd that serves lingering tid 11 (osd.4) 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter _send_linger_ping 11 now 1842.638147s 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1c5740 osd=4 4 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 4 40 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _send_op 40 to 53.1 on osd.4 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:40 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch ping cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x7f9bc0003ee0 con 0x55c1bd1ace70 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter _maybe_request_map subscribing (onetime) to next osd map 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 31 monclient: _renew_subs 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 31 monclient: _send_mon_message to mon.a at v1:172.21.15.79:6789/0 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.79:6789/0 -- mon_subscribe({osdmap=84}) v3 -- 0x7f9bc0004290 con 0x7f9bd00618d0 2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- ping magic: 0 v1 -- 0x7f9bc00044e0 con 0x55c1bd1ace70 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 <== osd.4 v1:172.21.15.103:6805/31037 12 ==== osd_op_reply(39 foo [watch unwatch cookie 94290589651712] v83'8 uv6 ondisk = 0) v8 ==== 147+0+0 (unknown 2284314154 0 0) 0x7f9be0001060 con 0x55c1bd1ace70 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter ms_dispatch 0x55c1bd1f1c60 osd_op_reply(39 foo [watch unwatch cookie 94290589651712] v83'8 uv6 ondisk = 0) v8 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter in handle_osd_op_reply 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 7 subsystem: 14 client.4816.objecter handle_osd_op_reply 39 ondisk uv 6 in 53.1 attempt 0 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter op 0 rval 0 len 0 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter handle_osd_op_reply completed tid 39 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _finish_op 39 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 6 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _session_op_remove 4 39 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 5 subsystem: 14 client.4816.objecter 1 in flight 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 <== osd.4 v1:172.21.15.103:6805/31037 13 ==== osd_op_reply(40 foo [watch ping cookie 94290589651712] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 ==== 147+0+0 (unknown 1918824261 0 0) 0x7f9be0001060 con 0x55c1bd1ace70 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter ms_dispatch 0x55c1bd1f1c60 osd_op_reply(40 foo [watch ping cookie 94290589651712] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter in handle_osd_op_reply 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 7 subsystem: 14 client.4816.objecter handle_osd_op_reply 40 ondisk uv 0 in 53.1 attempt 0 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter op 0 rval -107 len 0 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter handle_osd_op_reply completed tid 40 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _finish_op 40 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 5 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _session_op_remove 4 40 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 5 subsystem: 14 client.4816.objecter 0 in flight 2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter _linger_ping 11 sent 1842.638147s gen 0 = osd:107 (last_error system:0 register_gen 0) 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 15 librados: write oid=foo nspace=test-rados-api-smithi079-36189-13 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _op_submit op 0x55c1bd1a4fe0 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter _calc_target epoch 83 base foo @50;test-rados-api-smithi079-36189-13 precalc_pgid 0 pgid 0.0 is_write 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter _calc_target target foo @53;test-rados-api-smithi079-36189-13 -> pgid 53.247618ce 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _calc_target raw pgid 53.247618ce -> actual 53.2 acting [1,2] primary 1 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter _get_session s=0x55c1bd1b8470 osd=1 3 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _op_submit oid foo '@50;test-rados-api-smithi079-36189-13' '@53;test-rados-api-smithi079-36189-13' [write 0~128 in=128b] tid 41 osd.1 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1b8470 osd=1 3 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 1 41 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _send_op 41 to 53.2 on osd.1 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.79:6800/30996 -- osd_op(unknown.0.0:41 53.2 53:73186e24:test-rados-api-smithi079-36189-13::foo:head [write 0~128 in=128b] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x55c1bdc5a670 con 0x55c1bd1d5ee0 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1b8470 osd=1 4 2021-04-14 01:40:16.376168 thread: 0x7f9bf3d11540 priority: 5 subsystem: 14 client.4816.objecter 1 in flight 2021-04-14 01:40:16.376168 thread: 0x7f9bec1a6700 priority: 10 subsystem: 15 librados: handle_error cookie 94290589651712 err -107 2021-04-14 01:40:16.406168 thread: 0x7f9beb9a5700 priority: 20 subsystem: 14 client.4816.objecter _linger_cancel linger_id=11 2021-04-14 01:40:16.406168 thread: 0x7f9beb9a5700 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 3 2021-04-14 01:40:16.406168 thread: 0x7f9beb9a5700 priority: 15 subsystem: 14 client.4816.objecter _session_linger_op_remove 4 11
Updated by Brad Hubbard about 3 years ago
- Related to Bug #50042: rados/test.sh: api_watch_notify failures added
Updated by Brad Hubbard about 3 years ago
- Status changed from New to Closed
This looks like an issue that only comes about because of the problem seen in https://tracker.ceph.com/issues/50042. Closing this as we shouldn't see it again once the related tracker is fixed.
Updated by Laura Flores 12 months ago
- Translation missing: en.field_tag_list set to test-failure
- Status changed from Closed to New
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253544
2023-04-29T06:53:02.055 INFO:tasks.workunit.client.0.smithi019.stdout: api_watch_notify_pp: [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1
...
2023-04-29T06:53:02.168 INFO:tasks.workunit.client.0.smithi019.stderr:bash: line 1: 149990 Segmentation fault (core dumped) ceph_test_rados_api_watch_notify_pp 2>&1
2023-04-29T06:53:02.168 INFO:tasks.workunit.client.0.smithi019.stderr: 149991 Done | tee ceph_test_rados_api_watch_notify_pp.log
2023-04-29T06:53:02.168 INFO:tasks.workunit.client.0.smithi019.stderr: 149992 Done | sed "s/^/ api_watch_notify_pp: /"
Reopening this since there was another instance in main, and the fix for https://tracker.ceph.com/issues/50042 was merged several months ago.
Updated by Brad Hubbard 12 months ago
Laura Flores wrote:
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253544
[...]Reopening this since there was another instance in main, and the fix for https://tracker.ceph.com/issues/50042 was merged several months ago.
I'll try and take a look at the coredump tomorrow to see if I can confirm it's the same issue.
Updated by Laura Flores 12 months ago
Thanks Brad! Let me know how I can help.
I found another instance in Pacific:
/a/yuriw-2023-05-06_14:41:44-rados-pacific-release-distro-default-smithi/7264110
There is a core file and an executable available at /a/yuriw-2023-05-06_14:41:44-rados-pacific-release-distro-default-smithi/7264110/remote/smithi055/coredump.
Updated by Brad Hubbard 12 months ago
Looking at the kern.log.gz file to get some hints this time.
2023-04-29T06:53:02.007129+00:00 smithi019 kernel: io_context_pool[179499]: segfault at 52 ip 00007f318fc13373 sp 00007f31856249c0 error 4 in librados.so.2.0.0[7f318fb86000+127000]
(gdb) info threads Id Target Id Frame * 1 Thread 0x7f3185627700 (LWP 179499) 0x00007f318fc13373 in WatchInfo::handle_error (err=-107, cookie=94770927113072, this=0x5631936e1a80) at /usr/src/debug/ceph-18.0.0-3608.g9cd39654.el8.x86_64/src/librados/IoCtxImpl.cc:1610 2 Thread 0x7f318762b700 (LWP 178238) futex_wait_cancelable (private=0, expected=0, futex_word=0x5631935db7a4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 3 Thread 0x7f318808c700 (LWP 178243) 0x00007f318c540027 in epoll_wait (epfd=3, events=0x563193650360, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 4 Thread 0x7f3185e28700 (LWP 178244) 0x00007f318c540027 in epoll_wait (epfd=6, events=0x5631935ee530, maxevents=5000, timeout=30000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 5 Thread 0x7f317de28700 (LWP 178245) 0x00007f318c540027 in epoll_wait (epfd=9, events=0x5631936a4170, maxevents=5000, timeout=10000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 6 Thread 0x7f3186e2a700 (LWP 179497) futex_abstimed_wait_cancelable (private=0, abstime=0x7f3186e27d60, expected=0, futex_word=0x7f3174001858) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 7 Thread 0x7f3186629700 (LWP 179498) futex_wait_cancelable (private=0, expected=0, futex_word=0x56319357a45c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 8 Thread 0x7f3184e26700 (LWP 179502) 0x00007f318c5350e1 in __GI___poll (fds=fds@entry=0x7f3184e23d20, nfds=nfds@entry=2, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 9 Thread 0x7f3188b22880 (LWP 149990) futex_wait_cancelable (private=0, expected=0, futex_word=0x7fff1e68ebc8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 10 Thread 0x7f317ffff700 (LWP 179504) futex_abstimed_wait_cancelable (private=0, abstime=0x7f317fffcd40, expected=0, futex_word=0x563193708a44) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 11 Thread 0x7f317f7fe700 (LWP 179505) futex_wait_cancelable (private=0, expected=0, futex_word=0x5631936059d0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 12 Thread 0x7f317effd700 (LWP 179506) futex_wait_cancelable (private=0, expected=0, futex_word=0x563193605b50) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 13 Thread 0x7f317e7fc700 (LWP 179508) futex_abstimed_wait_cancelable (private=0, abstime=0x7f317e7f9d30, expected=0, futex_word=0x5631935c49f8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 14 Thread 0x7f317d627700 (LWP 179945) futex_wait_cancelable (private=0, expected=0, futex_word=0x5631935c4ef8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
Thread 1.
(gdb) x/i $pc => 0x7f318fc13373 <fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)+195>: mov 0x18(%rax),%rax (gdb) info reg rax rax 0x3a 58 (gdb) i r rdi rdi 0x7fff1e68ef40 140733703581504 (gdb) x/a 0x7fff1e68ef40 0x7fff1e68ef40: 0x3a (gdb) p/d 0x3a $1 = 58 (gdb) p ctx2 $2 = (librados::v14_2_0::WatchCtx2 *) 0x7fff1e68ef40 (gdb) p *ctx2 $3 = {_vptr.WatchCtx2 = 0x3a}
So this looks like the same issue. I'll try and dig for more info here tomorrow as I've run out of time today.
Updated by Laura Flores 9 months ago
/a/yuriw-2023-08-02_20:21:03-rados-wip-yuri3-testing-2023-08-01-0825-pacific-distro-default-smithi/7358894
Updated by Aishwarya Mathuria 7 months ago
/a/yuriw-2023-10-05_21:43:37-rados-wip-yuri6-testing-2023-10-04-0901-distro-default-smithi/7412033
Updated by Laura Flores 5 months ago
/a/yuriw-2023-12-08_21:24:43-rados-wip-yuri5-testing-2023-12-07-0740-pacific-distro-default-smithi/7485076
Updated by Laura Flores 25 days ago
/a/yuriw-2024-03-24_22:19:24-rados-wip-yuri10-testing-2024-03-24-1159-distro-default-smithi/7620506