Bug #49963: Crash in OSD::ms_fast_dispatch due to call to null vtable function - RADOS - Ceph

Actions

Copy link

Bug #49963

closed

Crash in OSD::ms_fast_dispatch due to call to null vtable function

Added by Brad Hubbard about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Kefu Chai

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Yes

Severity:

Reviewed:

Affected Versions:

Ceph - v17.0.0

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

40394

Crash signature (v1):

Crash signature (v2):

Description

/a/sage-2021-03-24_06:13:24-upgrade:octopus-x-wip-sage-testing-2021-03-23-2309-distro-basic-smithi/5993446

#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x000055e42b5b53ca in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:87
#2  handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:332
#3  <signal handler called>
#4  0x0000000000000000 in ?? ()
#5  0x000055e42af9f587 in OSD::ms_fast_dispatch (this=0x55e42fefa000, m=0x55e43026cb00) at ./src/osd/OSD.cc:7186
#6  0x000055e42b954323 in Dispatcher::ms_fast_dispatch2 (m=..., this=0x55e42fefa000) at ./src/msg/Dispatcher.h:84
#7  Messenger::ms_fast_dispatch (m=..., this=<optimized out>) at ./src/msg/Messenger.h:685
#8  DispatchQueue::fast_dispatch (this=0x55e42fd72c28, m=...) at ./src/msg/DispatchQueue.cc:74
#9  0x000055e42b98758b in DispatchQueue::fast_dispatch (m=0x55e43026cb00, this=<optimized out>) at ./src/msg/DispatchQueue.h:203
#10 ProtocolV2::handle_message (this=this@entry=0x55e43257c500) at ./src/msg/async/ProtocolV2.cc:1482
#11 0x000055e42b998d90 in ProtocolV2::handle_read_frame_dispatch (this=this@entry=0x55e43257c500) at ./src/msg/async/ProtocolV2.cc:1140
#12 0x000055e42b998ee9 in ProtocolV2::_handle_read_frame_epilogue_main (this=this@entry=0x55e43257c500) at ./src/msg/async/ProtocolV2.cc:1328
#13 0x000055e42b99a77e in ProtocolV2::handle_read_frame_epilogue_main (this=0x55e43257c500, buffer=..., r=0) at ./src/msg/async/ProtocolV2.cc:1303
#14 0x000055e42b982a74 in ProtocolV2::run_continuation (this=0x55e43257c500, continuation=...) at ./src/msg/async/ProtocolV2.cc:47
#15 0x000055e42b95bda8 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55e4324457a0) at /usr/include/c++/7/bits/std_function.h:706
#16 AsyncConnection::process (this=0x55e432445400) at ./src/msg/async/AsyncConnection.cc:454
#17 0x000055e42b7a29ed in EventCenter::process_events (this=this@entry=0x55e42f0cf4c0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f40f17de668)
    at ./src/msg/async/Event.cc:422
#18 0x000055e42b7a7b80 in NetworkStack::<lambda()>::operator() (__closure=0x55e42f1a3238) at ./src/msg/async/Stack.cc:52
#19 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316
#20 0x00007f40f53536df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007f40f5c706db in start_thread (arg=0x7f40f17e1700) at pthread_create.c:463
#22 0x00007f40f4a1071f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

#4  0x0000000000000000 in ?? ()

That's a telltale that we called a null function.

7185      // note sender epoch, min req's epoch
7186      op->sent_epoch = static_cast<MOSDFastDispatchOp*>(m)->get_map_epoch();
   0x000055e42af9f57e <+670>:   mov    (%rbx),%rax
   0x000055e42af9f581 <+673>:   mov    %rbx,%rdi
   0x000055e42af9f584 <+676>:   callq  *0x48(%rax) <--------------- HERE

./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:
199     ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp: No such file or directory.
=> 0x000055e42af9f587 <+679>:   mov    -0xa8(%rbp),%rdx
   0x000055e42af9f58e <+686>:   test   %rdx,%rdx
   0x000055e42af9f591 <+689>:   je     0x55e42af9f93d <OSD::ms_fast_dispatch(Message*)+1629>

The issue is happening in the instruction before the instruction pointer at +676 which is common in my experience.

(gdb) info reg rax
rax            0x55e42c879cf8   94438487989496
(gdb) p/x 0x55e42c879cf8+0x48
$10 = 0x55e42c879d40
(gdb) x/a 0x55e42c879d40
0x55e42c879d40 <_ZTV12MOSDPGCreate>:    0x0

$ c++filt _ZTV12MOSDPGCreate
vtable for MOSDPGCreate

So at least that section of the vtable for the message, which was an MOSDPGCreate, was zeroed out.

It's curious that $rax hints at MOSDPGInfo.

(gdb) x/a $rax
0x55e42c879cf8 <_ZTV10MOSDPGInfo+16>:   0x55e42b747710 <MOSDPGInfo::~MOSDPGInfo()>

Actions

Copy link

Updated by Brad Hubbard about 3 years ago

m->header.type is 85

#define MSG_OSD_PG_INFO        85

So the message actually was an MOSDPGInfo.

Neha pointed out https://github.com/ceph/ceph/pull/40056 which would explain a lot since this was an upgrade test and "2021-03-24T12:12:27.461 INFO:teuthology.task.print:**** done ceph.restart of all mons and 2/3 of osds" is seen before osd.10 which sent the pg_info was octopus.

Actions

Copy link

Updated by Neha Ojha about 3 years ago

/a/sage-2021-03-24_06:13:24-upgrade:octopus-x-wip-sage-testing-2021-03-23-2309-distro-basic-smithi/5993446

  roles:
  - - mon.a
    - mgr.x
    - osd.0
    - osd.1
    - osd.2
    - osd.3
  - - mon.b
    - osd.4
    - osd.5
    - osd.6
    - osd.7
  - - mon.c
  - - osd.8
    - osd.9
    - osd.10
    - osd.11
  - - client.0

2021-03-24T12:12:27.461 INFO:teuthology.task.print:**** done ceph.restart of all mons and 2/3 of osds
2021-03-24T12:12:27.463 INFO:teuthology.run_tasks:Running task parallel...
2021-03-24T12:12:27.550 INFO:teuthology.task.parallel:starting parallel...
...
2021-03-24T12:13:35.737 INFO:tasks.ceph.osd.2.smithi082.stderr:*** Caught signal (Segmentation fault) **
2021-03-24T12:13:35.737 INFO:tasks.ceph.osd.2.smithi082.stderr: in thread 7f40f17e1700 thread_name:msgr-worker-2

This means osd.10 falls in the last 1/3rd.

2021-03-24T12:13:35.741+0000 7f40f17e1700  1 -- [v2:172.21.15.82:6818/16256,v1:172.21.15.82:6819/16256] <== osd.10 v2:172.21.15.106:6818/12975 122 ==== pg_info((query:119 sent:119 1.d( v 20'3 (0'0,20'3] local-lis/les=118/119 n=0 ec=118
/15 lis/c=118/112 les/c/f=119/113/0 sis=118) ([112,117] all_participants=7,11 intervals=([112,117] acting 7,11))) epoch 119) v6 ==== 1107+0+0 (crc 0 0 0) 0x55e43026cb00 con 0x55e432445400
2021-03-24T12:13:35.741+0000 7f40f17e1700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f40f17e1700 thread_name:msgr-worker-2

Actions

Copy link

Updated by Neha Ojha about 3 years ago

Status changed from New to Triaged

Actions

Copy link

Updated by Brad Hubbard about 3 years ago

Assignee set to Greg Farnum
Source deleted (~~Q/A~~)
Severity deleted (~~3 - minor~~)

So the issue is we never should have gone this deep into the OSD::ms_fast_dispatch function.

Previously we would have exited here.

case MSG_OSD_PG_INFO:
  return handle_fast_pg_info(static_cast<MOSDPGInfo*>(m));

That code was removed by https://github.com/ceph/ceph/pull/40056/commits/b37e959238e75690d38873744515fa54413f4393 so we dropped further down the function into uncharted/untested waters.

get_map_epoch() is a pure virtual function in MOSDFastDispatchOp so we will try to access a get_map_epoch in the vtable of the underlying base pointer. While there are message types that implement get_map_epoch() MOSDPGInfo is not one of them so we go off into the weeds looking for it in the vtable.

Actions

Copy link