Project

General

Profile

Bug #57152

segfault in librados via libcephsqlite

Added by Patrick Donnelly 4 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
librados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a post on the ML about a segfault in the mgr:

"[ceph-users] Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug External"

The user kindly shared the core dump which is available here:

smithi098 in /tmp/sqlite-seg

root@smithi098:/tmp/sqlite-seg# gdb -q  /usr/bin/sqlite3
...
(gdb) core CoreDump 
[New LWP 481006]
[New LWP 481010]
[New LWP 480968]
[New LWP 481002]
[New LWP 481013]
[New LWP 481003]
[New LWP 481004]
[New LWP 481011]
[New LWP 481014]
[New LWP 481005]
[New LWP 481015]
[New LWP 481012]
[New LWP 481045]
[New LWP 481016]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `sqlite3'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
599     ./src/include/buffer.h: No such file or directory.
[Current thread is 1 (Thread 0x7fc1b6d7c700 (LWP 481006))]
(gdb) bt
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
#1  ceph::buffer::v15_2_0::list::buffers_t::operator= (other=..., this=0x55d3974d3ed0) at ./src/include/buffer.h:505
#2  ceph::buffer::v15_2_0::list::operator= (this=0x55d3974d3ed0, other=...) at ./src/include/buffer.h:970
#3  0x00007fc1bad35606 in Message::claim_data (bl=..., this=0x7fc1a001b680) at ./src/msg/Message.h:431
#4  Objecter::handle_osd_op_reply (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:3509
#5  0x00007fc1bad37064 in Objecter::ms_dispatch (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:984
#6  0x00007fc1bad3add6 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at ./src/osdc/Objecter.h:2659
#7  0x00007fc1ba588b00 in Messenger::ms_fast_dispatch (m=..., this=<optimized out>) at ./src/msg/DispatchQueue.cc:75
#8  DispatchQueue::fast_dispatch (this=0x55d3974356a0, m=...) at ./src/msg/DispatchQueue.cc:74
#9  0x00007fc1ba68b794 in DispatchQueue::fast_dispatch (m=0x7fc1a001b680, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:67
#10 ProtocolV2::handle_message (this=<optimized out>) at ./src/msg/async/ProtocolV2.cc:1496
#11 0x00007fc1ba6a17f0 in ProtocolV2::handle_read_frame_dispatch (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1157
#12 0x00007fc1ba6a1ae5 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1347
#13 0x00007fc1ba6a2e14 in ProtocolV2::handle_read_frame_epilogue_main (this=0x55d3974ee540, buffer=..., r=0) at ./src/msg/async/ProtocolV2.cc:1324
#14 0x00007fc1ba686b79 in ProtocolV2::run_continuation (this=0x55d3974ee540, continuation=...) at ./src/msg/async/ProtocolV2.cc:49
#15 0x00007fc1ba649504 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55d3973b9c08)
    at /usr/include/c++/9/bits/std_function.h:683
#16 AsyncConnection::process (this=0x55d3973b9870) at ./src/msg/async/AsyncConnection.cc:454
#17 0x00007fc1ba6ac185 in EventCenter::process_events (this=this@entry=0x55d3974392a0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7fc1b6d7be28) at /usr/include/c++/9/bits/basic_ios.h:282
#18 0x00007fc1ba6b464b in NetworkStack::<lambda()>::operator() (__closure=<optimized out>, __closure=<optimized out>) at ./src/msg/async/Stack.cc:50
#19 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#20 0x00007fc1b9f6bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007fc1bb404609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fc1bb323133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fc1bb1cf740 (LWP 480968))]
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d397944770) at ../sysdeps/nptl/futex-internal.h:183
183     ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d397944770) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d397944720, cond=0x55d397944748) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x55d397944748, mutex=0x55d397944720) at pthread_cond_wait.c:647
#3  0x00007fc1b9f65e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fc1bac9ee5b in std::condition_variable::wait<librados::AioCompletionImpl::wait_for_complete()::{lambda()#1}>(std::unique_lock<std::mutex>&, librados::AioCompletionImpl::wait_for_complete()::{lambda()#1}) (__p=..., __lock=..., this=0x55d397944748) at ./src/librados/AioCompletionImpl.h:63
#5  librados::AioCompletionImpl::wait_for_complete (this=0x55d397944720) at ./src/librados/AioCompletionImpl.h:63
#6  librados::v14_2_0::AioCompletion::wait_for_complete (this=<optimized out>) at ./src/librados/librados_cxx.cc:1050
#7  0x00007fc1bae649c0 in SimpleRADOSStriper::read (this=0x55d3974e6c70, data=data@entry=0x55d3974eee38, len=len@entry=65536, off=off@entry=4129788)
    at /usr/include/c++/9/bits/unique_ptr.h:360
#8  0x00007fc1bae2d428 in Read (file=0x55d3974dd260, buf=0x55d3974eee38, len=65536, off=4129788) at /usr/include/c++/9/bits/unique_ptr.h:360
#9  0x000055d396b6eccb in sqlite3OsRead (offset=<optimized out>, amt=<optimized out>, pBuf=0x55d3974eee38, id=0x55d3974dd260) at sqlite3.c:53758
#10 pager_playback_one_page (pPager=pPager@entry=0x55d3974dcfd8, pOffset=pOffset@entry=0x55d3974dd038, pDone=pDone@entry=0x0, isMainJrnl=isMainJrnl@entry=1, isSavepnt=isSavepnt@entry=0)
    at sqlite3.c:53758
#11 0x000055d396b6f7ac in pager_playback (pPager=pPager@entry=0x55d3974dcfd8, isHot=1) at sqlite3.c:54349
#12 0x000055d396b70c2d in sqlite3PagerSharedLock (pPager=0x55d3974dcfd8) at sqlite3.c:56768
#13 0x000055d396b713f8 in lockBtree (pBt=0x55d397343f98) at sqlite3.c:67193
#14 sqlite3BtreeBeginTrans (p=0x55d3974dd698, wrflag=wrflag@entry=0, pSchemaVersion=pSchemaVersion@entry=0x0) at sqlite3.c:2032
#15 0x000055d396ba6c21 in sqlite3InitOne (db=0x55d3973b7eb8, iDb=iDb@entry=0, pzErrMsg=pzErrMsg@entry=0x7ffecd8e4078, mFlags=mFlags@entry=0) at sqlite3.c:127249
#16 0x000055d396ba6dec in sqlite3Init (db=db@entry=0x55d3973b7eb8, pzErrMsg=pzErrMsg@entry=0x7ffecd8e4078) at sqlite3.c:127434
#17 0x000055d396ba6e2f in sqlite3ReadSchema (pParse=pParse@entry=0x7ffecd8e4070) at sqlite3.c:127460
#18 0x000055d396bb489d in sqlite3Pragma (pParse=pParse@entry=0x7ffecd8e4070, pId1=pId1@entry=0x7ffecd8e36b8, pId2=pId2@entry=0x7ffecd8e36d0, pValue=pValue@entry=0x0, 
    minusFlag=minusFlag@entry=0) at sqlite3.c:124887
#19 0x000055d396bb6e79 in yy_reduce (yyLookahead=<optimized out>, pParse=0x7ffecd8e4070, yyLookaheadToken=..., yyruleno=239, yypParser=0x7ffecd8e3670) at sqlite3.c:156709
#20 sqlite3Parser (yyminor=..., yymajor=<optimized out>, yyp=0x7ffecd8e3670) at sqlite3.c:26203
#21 sqlite3RunParser (pParse=0x7ffecd8e4070, zSql=0x55d396bf007e "", pzErrMsg=0x7ffecd8e4068) at sqlite3.c:27477
#22 0x000055d396bbbd33 in sqlite3Prepare (db=db@entry=0x55d3973b7eb8, zSql=zSql@entry=0x55d396bf006a "PRAGMA database_list", nBytes=nBytes@entry=-1, prepFlags=prepFlags@entry=128, 
    pReprepare=pReprepare@entry=0x0, ppStmt=ppStmt@entry=0x7ffecd8e4360, pzTail=0x0) at sqlite3.c:127661
#23 0x000055d396bbc0c7 in sqlite3LockAndPrepare (db=0x55d3973b7eb8, zSql=0x55d396bf006a "PRAGMA database_list", nBytes=-1, prepFlags=prepFlags@entry=128, pOld=pOld@entry=0x0, 
    ppStmt=0x7ffecd8e4360, pzTail=0x0) at sqlite3.c:127733
#24 0x000055d396bbc1da in sqlite3_prepare_v2 (db=<optimized out>, zSql=<optimized out>, nBytes=<optimized out>, ppStmt=<optimized out>, pzTail=<optimized out>) at sqlite3.c:127817
#25 0x000055d396b059e4 in do_meta_command (zLine=<optimized out>, p=<optimized out>) at shell.c:17672
#26 0x000055d396b09ff1 in process_input (p=0x7ffecd8e5c60) at shell.c:18365
#27 0x000055d396ae9c27 in main (argc=<optimized out>, argv=<optimized out>) at shell.c:19156
(gdb) frame 7
#7  0x00007fc1bae649c0 in SimpleRADOSStriper::read (this=0x55d3974e6c70, data=data@entry=0x55d3974eee38, len=len@entry=65536, off=off@entry=4129788)
    at /usr/include/c++/9/bits/unique_ptr.h:360
360           get() const noexcept
(gdb) print reads
$1 = std::vector of length 2, capacity 2 = {{first = {_buffers = {_root = {next = 0x7fc1a00068b0}, _tail = 0x7fc1a00068b0}, 
      _carriage = 0x7fc1bae0d180 <ceph::buffer::v15_2_0::list::always_empty_bptr>, _len = 0, _num = 0, static always_empty_bptr = {<ceph::buffer::v15_2_0::ptr_hook> = {
          next = 0x0}, <ceph::buffer::v15_2_0::ptr> = {_raw = 0x0, _off = 0, _len = 0}, <No data fields>}}, second = std::unique_ptr<librados::v14_2_0::AioCompletion> = {
      get() = 0x7fc1a0000be0}}, {first = {_buffers = {_root = {next = 0x7fc1a00068d8}, _tail = 0x7fc1a00068d8}, _carriage = 0x7fc1bae0d180 <ceph::buffer::v15_2_0::list::always_empty_bptr>, 
      _len = 0, _num = 0, static always_empty_bptr = {<ceph::buffer::v15_2_0::ptr_hook> = {next = 0x0}, <ceph::buffer::v15_2_0::ptr> = {_raw = 0x0, _off = 0, _len = 0}, <No data fields>}}, 
    second = std::unique_ptr<librados::v14_2_0::AioCompletion> = {get() = 0x55d3978dc290}}}

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fc1b6d7c700 (LWP 481006))]
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
599     ./src/include/buffer.h: No such file or directory.
(gdb) bt
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
#1  ceph::buffer::v15_2_0::list::buffers_t::operator= (other=..., this=0x55d3974d3ed0) at ./src/include/buffer.h:505
#2  ceph::buffer::v15_2_0::list::operator= (this=0x55d3974d3ed0, other=...) at ./src/include/buffer.h:970
#3  0x00007fc1bad35606 in Message::claim_data (bl=..., this=0x7fc1a001b680) at ./src/msg/Message.h:431
#4  Objecter::handle_osd_op_reply (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:3509
#5  0x00007fc1bad37064 in Objecter::ms_dispatch (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:984
#6  0x00007fc1bad3add6 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at ./src/osdc/Objecter.h:2659
#7  0x00007fc1ba588b00 in Messenger::ms_fast_dispatch (m=..., this=<optimized out>) at ./src/msg/DispatchQueue.cc:75
#8  DispatchQueue::fast_dispatch (this=0x55d3974356a0, m=...) at ./src/msg/DispatchQueue.cc:74
#9  0x00007fc1ba68b794 in DispatchQueue::fast_dispatch (m=0x7fc1a001b680, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:67
#10 ProtocolV2::handle_message (this=<optimized out>) at ./src/msg/async/ProtocolV2.cc:1496
#11 0x00007fc1ba6a17f0 in ProtocolV2::handle_read_frame_dispatch (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1157
#12 0x00007fc1ba6a1ae5 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1347
#13 0x00007fc1ba6a2e14 in ProtocolV2::handle_read_frame_epilogue_main (this=0x55d3974ee540, buffer=..., r=0) at ./src/msg/async/ProtocolV2.cc:1324
#14 0x00007fc1ba686b79 in ProtocolV2::run_continuation (this=0x55d3974ee540, continuation=...) at ./src/msg/async/ProtocolV2.cc:49
#15 0x00007fc1ba649504 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55d3973b9c08)
    at /usr/include/c++/9/bits/std_function.h:683
#16 AsyncConnection::process (this=0x55d3973b9870) at ./src/msg/async/AsyncConnection.cc:454
#17 0x00007fc1ba6ac185 in EventCenter::process_events (this=this@entry=0x55d3974392a0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7fc1b6d7be28) at /usr/include/c++/9/bits/basic_ios.h:282
#18 0x00007fc1ba6b464b in NetworkStack::<lambda()>::operator() (__closure=<optimized out>, __closure=<optimized out>) at ./src/msg/async/Stack.cc:50
#19 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#20 0x00007fc1b9f6bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007fc1bb404609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fc1bb323133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) print *this
$2 = {_root = {next = 0x0}, _tail = 0x55d39730c010}

relevant cephsqlite code is here:

https://github.com/ceph/ceph/blob/v17.2.3/src/SimpleRADOSStriper.cc#L491-L501

I'm not seeing anything obviously wrong in terms of an API violation. It looks like a bug in librados?


Related issues

Copied to RADOS - Backport #57372: quincy: segfault in librados via libcephsqlite Resolved
Copied to RADOS - Backport #57373: pacific: segfault in librados via libcephsqlite Resolved

History

#1 Updated by Matan Breizman 4 months ago

  • Assignee set to Matan Breizman

#2 Updated by Patrick Donnelly 4 months ago

The reporter on the ML shared the .mgr pool in question:

ceph-post-file: 29b36dc5-dba8-463d-bc0e-1349020f984f

#3 Updated by Matan Breizman 4 months ago

There seems to be an issue with the gcc version used to compile, I noticed similar issue when compiling `examples/librados` with gcc 8.5/11.2
  • 8.5: 'make' builds successfully.
  • 11.2:
    In file included from ../../src/include/rados/librados.hpp:11,
                     from hello_world.cc:14:
    ../../src/include/rados/buffer.h: In member function ‘void ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose()’:
    ../../src/include/rados/buffer.h:599:20: error: implicitly-declared ‘ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>& ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>::operator=(const ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>&)’ is deprecated [-Werror=deprecated-copy]
      599 |           it = it->next;
          |                    ^~~~
    ../../src/include/rados/buffer.h:442:9: note: because ‘ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>’ has user-provided ‘ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<T>::buffers_iterator(const ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<T>&) [with T = ceph::buffer::v15_2_0::ptr_node]’
      442 |         buffers_iterator(const buffers_iterator<T>& other)
          |         ^~~~~~~~~~~~~~~~
    
    

From the bt shared here, buffers_t::clear_and_dispose seems to cause the seg fault as well.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
599     ./src/include/buffer.h: No such file or directory.

#4 Updated by Matan Breizman 4 months ago

  • Pull request ID set to 47672

This PR seems to resolve the compilation errors mentioned above.
Please let me know your thoughts.

#5 Updated by Patrick Donnelly 4 months ago

Matan Breizman wrote:

This PR seems to resolve the compilation errors mentioned above.
Please let me know your thoughts.

So the problem is that gcc 8.5, which compiles successfully, generates code which causes the segfault?

#6 Updated by Matan Breizman 4 months ago

So the problem is that gcc 8.5, which compiles successfully, generates code which causes the segfault?

From the error message: `/src/include/buffer.h: No such file or directory.` and from not being able to compile buffer.h with newer version of gcc, is it possible that buffer was never actually built correctly and caused the segfault?

#7 Updated by Patrick Donnelly 4 months ago

Matan Breizman wrote:

So the problem is that gcc 8.5, which compiles successfully, generates code which causes the segfault?

From the error message: `/src/include/buffer.h: No such file or directory.` and from not being able to compile buffer.h with newer version of gcc, is it possible that buffer was never actually built correctly and caused the segfault?

I find that unlikely. What version of gcc was used to compile the ubuntu packages on ceph.com for debian?

The error message is because the source file is unavailable for gdb, I think.

#8 Updated by Patrick Donnelly 4 months ago

Patrick Donnelly wrote:

Matan Breizman wrote:

So the problem is that gcc 8.5, which compiles successfully, generates code which causes the segfault?

From the error message: `/src/include/buffer.h: No such file or directory.` and from not being able to compile buffer.h with newer version of gcc, is it possible that buffer was never actually built correctly and caused the segfault?

I find that unlikely.

It seems to me the problem is that `it->cur` is NULL which I think indicates a violation of an invariant in the buffer library: that buffer_t::_root is never NULL. See:

(gdb) bt
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
#1  ceph::buffer::v15_2_0::list::buffers_t::operator= (other=..., this=0x55d3974d3ed0) at ./src/include/buffer.h:505
#2  ceph::buffer::v15_2_0::list::operator= (this=0x55d3974d3ed0, other=...) at ./src/include/buffer.h:970
#3  0x00007fc1bad35606 in Message::claim_data (bl=..., this=0x7fc1a001b680) at ./src/msg/Message.h:431
#4  Objecter::handle_osd_op_reply (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:3509
#5  0x00007fc1bad37064 in Objecter::ms_dispatch (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:984
#6  0x00007fc1bad3add6 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at ./src/osdc/Objecter.h:2659
#7  0x00007fc1ba588b00 in Messenger::ms_fast_dispatch (m=..., this=<optimized out>) at ./src/msg/DispatchQueue.cc:75
#8  DispatchQueue::fast_dispatch (this=0x55d3974356a0, m=...) at ./src/msg/DispatchQueue.cc:74
#9  0x00007fc1ba68b794 in DispatchQueue::fast_dispatch (m=0x7fc1a001b680, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:67
#10 ProtocolV2::handle_message (this=<optimized out>) at ./src/msg/async/ProtocolV2.cc:1496
#11 0x00007fc1ba6a17f0 in ProtocolV2::handle_read_frame_dispatch (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1157
#12 0x00007fc1ba6a1ae5 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1347
#13 0x00007fc1ba6a2e14 in ProtocolV2::handle_read_frame_epilogue_main (this=0x55d3974ee540, buffer=..., r=0) at ./src/msg/async/ProtocolV2.cc:1324
#14 0x00007fc1ba686b79 in ProtocolV2::run_continuation (this=0x55d3974ee540, continuation=...) at ./src/msg/async/ProtocolV2.cc:49
#15 0x00007fc1ba649504 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55d3973b9c08) at /usr/include/c++/9/bits/std_function.h:683
#16 AsyncConnection::process (this=0x55d3973b9870) at ./src/msg/async/AsyncConnection.cc:454
#17 0x00007fc1ba6ac185 in EventCenter::process_events (this=this@entry=0x55d3974392a0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7fc1b6d7be28) at /usr/include/c++/9/bits/basic_ios.h:282
#18 0x00007fc1ba6b464b in NetworkStack::<lambda()>::operator() (__closure=<optimized out>, __closure=<optimized out>) at ./src/msg/async/Stack.cc:50
#19 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#20 0x00007fc1b9f6bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007fc1bb404609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fc1bb323133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) print it
$1 = {cur = 0x0}
(gdb) print *this
$2 = {_root = {next = 0x0}, _tail = 0x55d39730c010}

I'm not an expert on this code but, from looking around briefly, it seems this->_root.next should never be NULL.

#9 Updated by Matan Breizman 4 months ago

  • Pull request ID deleted (47672)

#10 Updated by Matan Breizman 4 months ago

I have managed to reproduce similar segfault.
The relevant code:
https://github.com/ceph/ceph/blob/main/src/SimpleRADOSStriper.cc#L504-L511

#5  librados::AioCompletionImpl::wait_for_complete (this=0x55d397944720) at ./src/librados/AioCompletionImpl.h:63
#6  librados::v14_2_0::AioCompletion::wait_for_complete (this=<optimized out>) at ./src/librados/librados_cxx.cc:1050
#7  0x00007fc1bae649c0 in SimpleRADOSStriper::read (this=0x55d3974e6c70, data=data@entry=0x55d3974eee38, len=len@entry=65536, off=off@entry=4129788)
    at /usr/include/c++/9/bits/unique_ptr.h:360

It seems like the issue comes from the way we handle AioCompletion, here is a simplified version of SimpleRADOSStriper::read with the segfault.

https://gist.github.com/Matan-B/c790b9526fa4c40c0e8d19688ae36b20

Patrick, Is there a reason for wait_for_complete() to be called in a different loop rather than the one we call aio_read? Because from the gist attached, it looks like this may solve this issue.

#11 Updated by Patrick Donnelly 4 months ago

Matan Breizman wrote:

I have managed to reproduce similar segfault.
The relevant code:
https://github.com/ceph/ceph/blob/main/src/SimpleRADOSStriper.cc#L504-L511
[...]

It seems like the issue comes from the way we handle AioCompletion, here is a simplified version of SimpleRADOSStriper::read with the segfault.

https://gist.github.com/Matan-B/c790b9526fa4c40c0e8d19688ae36b20

Patrick, Is there a reason for wait_for_complete() to be called in a different loop rather than the one we call aio_read?

To parallelize the read requests!

Because from the gist attached, it looks like this may solve this issue.

It's interesting that making the requests sequential makes it go away but that only indicates there's a serious issue with the multithreaded aio requests. I suggest digging deeper into that.

#12 Updated by Matan Breizman 3 months ago

  • Pull request ID set to 47841

#13 Updated by Matan Breizman 3 months ago

  • Status changed from New to Pending Backport

#14 Updated by Backport Bot 3 months ago

  • Copied to Backport #57372: quincy: segfault in librados via libcephsqlite added

#15 Updated by Backport Bot 3 months ago

  • Copied to Backport #57373: pacific: segfault in librados via libcephsqlite added

#16 Updated by Backport Bot 3 months ago

  • Tags set to backport_processed

#17 Updated by Matan Breizman about 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF