Project

General

Profile

Actions

Bug #57152

closed

segfault in librados via libcephsqlite

Added by Patrick Donnelly over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
librados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a post on the ML about a segfault in the mgr:

"[ceph-users] Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug External"

The user kindly shared the core dump which is available here:

smithi098 in /tmp/sqlite-seg

root@smithi098:/tmp/sqlite-seg# gdb -q  /usr/bin/sqlite3
...
(gdb) core CoreDump 
[New LWP 481006]
[New LWP 481010]
[New LWP 480968]
[New LWP 481002]
[New LWP 481013]
[New LWP 481003]
[New LWP 481004]
[New LWP 481011]
[New LWP 481014]
[New LWP 481005]
[New LWP 481015]
[New LWP 481012]
[New LWP 481045]
[New LWP 481016]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `sqlite3'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
599     ./src/include/buffer.h: No such file or directory.
[Current thread is 1 (Thread 0x7fc1b6d7c700 (LWP 481006))]
(gdb) bt
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
#1  ceph::buffer::v15_2_0::list::buffers_t::operator= (other=..., this=0x55d3974d3ed0) at ./src/include/buffer.h:505
#2  ceph::buffer::v15_2_0::list::operator= (this=0x55d3974d3ed0, other=...) at ./src/include/buffer.h:970
#3  0x00007fc1bad35606 in Message::claim_data (bl=..., this=0x7fc1a001b680) at ./src/msg/Message.h:431
#4  Objecter::handle_osd_op_reply (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:3509
#5  0x00007fc1bad37064 in Objecter::ms_dispatch (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:984
#6  0x00007fc1bad3add6 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at ./src/osdc/Objecter.h:2659
#7  0x00007fc1ba588b00 in Messenger::ms_fast_dispatch (m=..., this=<optimized out>) at ./src/msg/DispatchQueue.cc:75
#8  DispatchQueue::fast_dispatch (this=0x55d3974356a0, m=...) at ./src/msg/DispatchQueue.cc:74
#9  0x00007fc1ba68b794 in DispatchQueue::fast_dispatch (m=0x7fc1a001b680, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:67
#10 ProtocolV2::handle_message (this=<optimized out>) at ./src/msg/async/ProtocolV2.cc:1496
#11 0x00007fc1ba6a17f0 in ProtocolV2::handle_read_frame_dispatch (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1157
#12 0x00007fc1ba6a1ae5 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1347
#13 0x00007fc1ba6a2e14 in ProtocolV2::handle_read_frame_epilogue_main (this=0x55d3974ee540, buffer=..., r=0) at ./src/msg/async/ProtocolV2.cc:1324
#14 0x00007fc1ba686b79 in ProtocolV2::run_continuation (this=0x55d3974ee540, continuation=...) at ./src/msg/async/ProtocolV2.cc:49
#15 0x00007fc1ba649504 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55d3973b9c08)
    at /usr/include/c++/9/bits/std_function.h:683
#16 AsyncConnection::process (this=0x55d3973b9870) at ./src/msg/async/AsyncConnection.cc:454
#17 0x00007fc1ba6ac185 in EventCenter::process_events (this=this@entry=0x55d3974392a0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7fc1b6d7be28) at /usr/include/c++/9/bits/basic_ios.h:282
#18 0x00007fc1ba6b464b in NetworkStack::<lambda()>::operator() (__closure=<optimized out>, __closure=<optimized out>) at ./src/msg/async/Stack.cc:50
#19 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#20 0x00007fc1b9f6bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007fc1bb404609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fc1bb323133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fc1bb1cf740 (LWP 480968))]
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d397944770) at ../sysdeps/nptl/futex-internal.h:183
183     ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d397944770) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d397944720, cond=0x55d397944748) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x55d397944748, mutex=0x55d397944720) at pthread_cond_wait.c:647
#3  0x00007fc1b9f65e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fc1bac9ee5b in std::condition_variable::wait<librados::AioCompletionImpl::wait_for_complete()::{lambda()#1}>(std::unique_lock<std::mutex>&, librados::AioCompletionImpl::wait_for_complete()::{lambda()#1}) (__p=..., __lock=..., this=0x55d397944748) at ./src/librados/AioCompletionImpl.h:63
#5  librados::AioCompletionImpl::wait_for_complete (this=0x55d397944720) at ./src/librados/AioCompletionImpl.h:63
#6  librados::v14_2_0::AioCompletion::wait_for_complete (this=<optimized out>) at ./src/librados/librados_cxx.cc:1050
#7  0x00007fc1bae649c0 in SimpleRADOSStriper::read (this=0x55d3974e6c70, data=data@entry=0x55d3974eee38, len=len@entry=65536, off=off@entry=4129788)
    at /usr/include/c++/9/bits/unique_ptr.h:360
#8  0x00007fc1bae2d428 in Read (file=0x55d3974dd260, buf=0x55d3974eee38, len=65536, off=4129788) at /usr/include/c++/9/bits/unique_ptr.h:360
#9  0x000055d396b6eccb in sqlite3OsRead (offset=<optimized out>, amt=<optimized out>, pBuf=0x55d3974eee38, id=0x55d3974dd260) at sqlite3.c:53758
#10 pager_playback_one_page (pPager=pPager@entry=0x55d3974dcfd8, pOffset=pOffset@entry=0x55d3974dd038, pDone=pDone@entry=0x0, isMainJrnl=isMainJrnl@entry=1, isSavepnt=isSavepnt@entry=0)
    at sqlite3.c:53758
#11 0x000055d396b6f7ac in pager_playback (pPager=pPager@entry=0x55d3974dcfd8, isHot=1) at sqlite3.c:54349
#12 0x000055d396b70c2d in sqlite3PagerSharedLock (pPager=0x55d3974dcfd8) at sqlite3.c:56768
#13 0x000055d396b713f8 in lockBtree (pBt=0x55d397343f98) at sqlite3.c:67193
#14 sqlite3BtreeBeginTrans (p=0x55d3974dd698, wrflag=wrflag@entry=0, pSchemaVersion=pSchemaVersion@entry=0x0) at sqlite3.c:2032
#15 0x000055d396ba6c21 in sqlite3InitOne (db=0x55d3973b7eb8, iDb=iDb@entry=0, pzErrMsg=pzErrMsg@entry=0x7ffecd8e4078, mFlags=mFlags@entry=0) at sqlite3.c:127249
#16 0x000055d396ba6dec in sqlite3Init (db=db@entry=0x55d3973b7eb8, pzErrMsg=pzErrMsg@entry=0x7ffecd8e4078) at sqlite3.c:127434
#17 0x000055d396ba6e2f in sqlite3ReadSchema (pParse=pParse@entry=0x7ffecd8e4070) at sqlite3.c:127460
#18 0x000055d396bb489d in sqlite3Pragma (pParse=pParse@entry=0x7ffecd8e4070, pId1=pId1@entry=0x7ffecd8e36b8, pId2=pId2@entry=0x7ffecd8e36d0, pValue=pValue@entry=0x0, 
    minusFlag=minusFlag@entry=0) at sqlite3.c:124887
#19 0x000055d396bb6e79 in yy_reduce (yyLookahead=<optimized out>, pParse=0x7ffecd8e4070, yyLookaheadToken=..., yyruleno=239, yypParser=0x7ffecd8e3670) at sqlite3.c:156709
#20 sqlite3Parser (yyminor=..., yymajor=<optimized out>, yyp=0x7ffecd8e3670) at sqlite3.c:26203
#21 sqlite3RunParser (pParse=0x7ffecd8e4070, zSql=0x55d396bf007e "", pzErrMsg=0x7ffecd8e4068) at sqlite3.c:27477
#22 0x000055d396bbbd33 in sqlite3Prepare (db=db@entry=0x55d3973b7eb8, zSql=zSql@entry=0x55d396bf006a "PRAGMA database_list", nBytes=nBytes@entry=-1, prepFlags=prepFlags@entry=128, 
    pReprepare=pReprepare@entry=0x0, ppStmt=ppStmt@entry=0x7ffecd8e4360, pzTail=0x0) at sqlite3.c:127661
#23 0x000055d396bbc0c7 in sqlite3LockAndPrepare (db=0x55d3973b7eb8, zSql=0x55d396bf006a "PRAGMA database_list", nBytes=-1, prepFlags=prepFlags@entry=128, pOld=pOld@entry=0x0, 
    ppStmt=0x7ffecd8e4360, pzTail=0x0) at sqlite3.c:127733
#24 0x000055d396bbc1da in sqlite3_prepare_v2 (db=<optimized out>, zSql=<optimized out>, nBytes=<optimized out>, ppStmt=<optimized out>, pzTail=<optimized out>) at sqlite3.c:127817
#25 0x000055d396b059e4 in do_meta_command (zLine=<optimized out>, p=<optimized out>) at shell.c:17672
#26 0x000055d396b09ff1 in process_input (p=0x7ffecd8e5c60) at shell.c:18365
#27 0x000055d396ae9c27 in main (argc=<optimized out>, argv=<optimized out>) at shell.c:19156
(gdb) frame 7
#7  0x00007fc1bae649c0 in SimpleRADOSStriper::read (this=0x55d3974e6c70, data=data@entry=0x55d3974eee38, len=len@entry=65536, off=off@entry=4129788)
    at /usr/include/c++/9/bits/unique_ptr.h:360
360           get() const noexcept
(gdb) print reads
$1 = std::vector of length 2, capacity 2 = {{first = {_buffers = {_root = {next = 0x7fc1a00068b0}, _tail = 0x7fc1a00068b0}, 
      _carriage = 0x7fc1bae0d180 <ceph::buffer::v15_2_0::list::always_empty_bptr>, _len = 0, _num = 0, static always_empty_bptr = {<ceph::buffer::v15_2_0::ptr_hook> = {
          next = 0x0}, <ceph::buffer::v15_2_0::ptr> = {_raw = 0x0, _off = 0, _len = 0}, <No data fields>}}, second = std::unique_ptr<librados::v14_2_0::AioCompletion> = {
      get() = 0x7fc1a0000be0}}, {first = {_buffers = {_root = {next = 0x7fc1a00068d8}, _tail = 0x7fc1a00068d8}, _carriage = 0x7fc1bae0d180 <ceph::buffer::v15_2_0::list::always_empty_bptr>, 
      _len = 0, _num = 0, static always_empty_bptr = {<ceph::buffer::v15_2_0::ptr_hook> = {next = 0x0}, <ceph::buffer::v15_2_0::ptr> = {_raw = 0x0, _off = 0, _len = 0}, <No data fields>}}, 
    second = std::unique_ptr<librados::v14_2_0::AioCompletion> = {get() = 0x55d3978dc290}}}

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fc1b6d7c700 (LWP 481006))]
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
599     ./src/include/buffer.h: No such file or directory.
(gdb) bt
#0  0x00007fc1bacfac5b in ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55d3974d3ed0) at ./src/include/buffer.h:599
#1  ceph::buffer::v15_2_0::list::buffers_t::operator= (other=..., this=0x55d3974d3ed0) at ./src/include/buffer.h:505
#2  ceph::buffer::v15_2_0::list::operator= (this=0x55d3974d3ed0, other=...) at ./src/include/buffer.h:970
#3  0x00007fc1bad35606 in Message::claim_data (bl=..., this=0x7fc1a001b680) at ./src/msg/Message.h:431
#4  Objecter::handle_osd_op_reply (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:3509
#5  0x00007fc1bad37064 in Objecter::ms_dispatch (this=0x55d397435b90, m=0x7fc1a001b680) at ./src/osdc/Objecter.cc:984
#6  0x00007fc1bad3add6 in non-virtual thunk to Objecter::ms_fast_dispatch(Message*) () at ./src/osdc/Objecter.h:2659
#7  0x00007fc1ba588b00 in Messenger::ms_fast_dispatch (m=..., this=<optimized out>) at ./src/msg/DispatchQueue.cc:75
#8  DispatchQueue::fast_dispatch (this=0x55d3974356a0, m=...) at ./src/msg/DispatchQueue.cc:74
#9  0x00007fc1ba68b794 in DispatchQueue::fast_dispatch (m=0x7fc1a001b680, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:67
#10 ProtocolV2::handle_message (this=<optimized out>) at ./src/msg/async/ProtocolV2.cc:1496
#11 0x00007fc1ba6a17f0 in ProtocolV2::handle_read_frame_dispatch (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1157
#12 0x00007fc1ba6a1ae5 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55d3974ee540) at ./src/msg/async/ProtocolV2.cc:1347
#13 0x00007fc1ba6a2e14 in ProtocolV2::handle_read_frame_epilogue_main (this=0x55d3974ee540, buffer=..., r=0) at ./src/msg/async/ProtocolV2.cc:1324
#14 0x00007fc1ba686b79 in ProtocolV2::run_continuation (this=0x55d3974ee540, continuation=...) at ./src/msg/async/ProtocolV2.cc:49
#15 0x00007fc1ba649504 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55d3973b9c08)
    at /usr/include/c++/9/bits/std_function.h:683
#16 AsyncConnection::process (this=0x55d3973b9870) at ./src/msg/async/AsyncConnection.cc:454
#17 0x00007fc1ba6ac185 in EventCenter::process_events (this=this@entry=0x55d3974392a0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7fc1b6d7be28) at /usr/include/c++/9/bits/basic_ios.h:282
#18 0x00007fc1ba6b464b in NetworkStack::<lambda()>::operator() (__closure=<optimized out>, __closure=<optimized out>) at ./src/msg/async/Stack.cc:50
#19 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#20 0x00007fc1b9f6bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007fc1bb404609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fc1bb323133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) print *this
$2 = {_root = {next = 0x0}, _tail = 0x55d39730c010}

relevant cephsqlite code is here:

https://github.com/ceph/ceph/blob/v17.2.3/src/SimpleRADOSStriper.cc#L491-L501

I'm not seeing anything obviously wrong in terms of an API violation. It looks like a bug in librados?


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #57372: quincy: segfault in librados via libcephsqliteResolvedMatan BreizmanActions
Copied to RADOS - Backport #57373: pacific: segfault in librados via libcephsqliteResolvedMatan BreizmanActions
Actions

Also available in: Atom PDF