Bug #49963
closedCrash in OSD::ms_fast_dispatch due to call to null vtable function
0%
Description
/a/sage-2021-03-24_06:13:24-upgrade:octopus-x-wip-sage-testing-2021-03-23-2309-distro-basic-smithi/5993446
#0 raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x000055e42b5b53ca in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:87 #2 handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:332 #3 <signal handler called> #4 0x0000000000000000 in ?? () #5 0x000055e42af9f587 in OSD::ms_fast_dispatch (this=0x55e42fefa000, m=0x55e43026cb00) at ./src/osd/OSD.cc:7186 #6 0x000055e42b954323 in Dispatcher::ms_fast_dispatch2 (m=..., this=0x55e42fefa000) at ./src/msg/Dispatcher.h:84 #7 Messenger::ms_fast_dispatch (m=..., this=<optimized out>) at ./src/msg/Messenger.h:685 #8 DispatchQueue::fast_dispatch (this=0x55e42fd72c28, m=...) at ./src/msg/DispatchQueue.cc:74 #9 0x000055e42b98758b in DispatchQueue::fast_dispatch (m=0x55e43026cb00, this=<optimized out>) at ./src/msg/DispatchQueue.h:203 #10 ProtocolV2::handle_message (this=this@entry=0x55e43257c500) at ./src/msg/async/ProtocolV2.cc:1482 #11 0x000055e42b998d90 in ProtocolV2::handle_read_frame_dispatch (this=this@entry=0x55e43257c500) at ./src/msg/async/ProtocolV2.cc:1140 #12 0x000055e42b998ee9 in ProtocolV2::_handle_read_frame_epilogue_main (this=this@entry=0x55e43257c500) at ./src/msg/async/ProtocolV2.cc:1328 #13 0x000055e42b99a77e in ProtocolV2::handle_read_frame_epilogue_main (this=0x55e43257c500, buffer=..., r=0) at ./src/msg/async/ProtocolV2.cc:1303 #14 0x000055e42b982a74 in ProtocolV2::run_continuation (this=0x55e43257c500, continuation=...) at ./src/msg/async/ProtocolV2.cc:47 #15 0x000055e42b95bda8 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55e4324457a0) at /usr/include/c++/7/bits/std_function.h:706 #16 AsyncConnection::process (this=0x55e432445400) at ./src/msg/async/AsyncConnection.cc:454 #17 0x000055e42b7a29ed in EventCenter::process_events (this=this@entry=0x55e42f0cf4c0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f40f17de668) at ./src/msg/async/Event.cc:422 #18 0x000055e42b7a7b80 in NetworkStack::<lambda()>::operator() (__closure=0x55e42f1a3238) at ./src/msg/async/Stack.cc:52 #19 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316 #20 0x00007f40f53536df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #21 0x00007f40f5c706db in start_thread (arg=0x7f40f17e1700) at pthread_create.c:463 #22 0x00007f40f4a1071f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
#4 0x0000000000000000 in ?? ()
That's a telltale that we called a null function.
7185 // note sender epoch, min req's epoch 7186 op->sent_epoch = static_cast<MOSDFastDispatchOp*>(m)->get_map_epoch(); 0x000055e42af9f57e <+670>: mov (%rbx),%rax 0x000055e42af9f581 <+673>: mov %rbx,%rdi 0x000055e42af9f584 <+676>: callq *0x48(%rax) <--------------- HERE ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp: 199 ./obj-x86_64-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp: No such file or directory. => 0x000055e42af9f587 <+679>: mov -0xa8(%rbp),%rdx 0x000055e42af9f58e <+686>: test %rdx,%rdx 0x000055e42af9f591 <+689>: je 0x55e42af9f93d <OSD::ms_fast_dispatch(Message*)+1629>
The issue is happening in the instruction before the instruction pointer at +676 which is common in my experience.
(gdb) info reg rax rax 0x55e42c879cf8 94438487989496 (gdb) p/x 0x55e42c879cf8+0x48 $10 = 0x55e42c879d40 (gdb) x/a 0x55e42c879d40 0x55e42c879d40 <_ZTV12MOSDPGCreate>: 0x0
$ c++filt _ZTV12MOSDPGCreate vtable for MOSDPGCreate
So at least that section of the vtable for the message, which was an MOSDPGCreate, was zeroed out.
It's curious that $rax hints at MOSDPGInfo.
(gdb) x/a $rax 0x55e42c879cf8 <_ZTV10MOSDPGInfo+16>: 0x55e42b747710 <MOSDPGInfo::~MOSDPGInfo()>
Updated by Brad Hubbard about 3 years ago
m->header.type is 85
#define MSG_OSD_PG_INFO 85
So the message actually was an MOSDPGInfo.
Neha pointed out https://github.com/ceph/ceph/pull/40056 which would explain a lot since this was an upgrade test and "2021-03-24T12:12:27.461 INFO:teuthology.task.print:**** done ceph.restart of all mons and 2/3 of osds" is seen before osd.10 which sent the pg_info was octopus.
Updated by Neha Ojha about 3 years ago
/a/sage-2021-03-24_06:13:24-upgrade:octopus-x-wip-sage-testing-2021-03-23-2309-distro-basic-smithi/5993446
roles: - - mon.a - mgr.x - osd.0 - osd.1 - osd.2 - osd.3 - - mon.b - osd.4 - osd.5 - osd.6 - osd.7 - - mon.c - - osd.8 - osd.9 - osd.10 - osd.11 - - client.0
2021-03-24T12:12:27.461 INFO:teuthology.task.print:**** done ceph.restart of all mons and 2/3 of osds 2021-03-24T12:12:27.463 INFO:teuthology.run_tasks:Running task parallel... 2021-03-24T12:12:27.550 INFO:teuthology.task.parallel:starting parallel... ... 2021-03-24T12:13:35.737 INFO:tasks.ceph.osd.2.smithi082.stderr:*** Caught signal (Segmentation fault) ** 2021-03-24T12:13:35.737 INFO:tasks.ceph.osd.2.smithi082.stderr: in thread 7f40f17e1700 thread_name:msgr-worker-2
This means osd.10 falls in the last 1/3rd.
2021-03-24T12:13:35.741+0000 7f40f17e1700 1 -- [v2:172.21.15.82:6818/16256,v1:172.21.15.82:6819/16256] <== osd.10 v2:172.21.15.106:6818/12975 122 ==== pg_info((query:119 sent:119 1.d( v 20'3 (0'0,20'3] local-lis/les=118/119 n=0 ec=118 /15 lis/c=118/112 les/c/f=119/113/0 sis=118) ([112,117] all_participants=7,11 intervals=([112,117] acting 7,11))) epoch 119) v6 ==== 1107+0+0 (crc 0 0 0) 0x55e43026cb00 con 0x55e432445400 2021-03-24T12:13:35.741+0000 7f40f17e1700 -1 *** Caught signal (Segmentation fault) ** in thread 7f40f17e1700 thread_name:msgr-worker-2
Updated by Brad Hubbard about 3 years ago
- Assignee set to Greg Farnum
- Source deleted (
Q/A) - Severity deleted (
3 - minor)
So the issue is we never should have gone this deep into the OSD::ms_fast_dispatch function.
Previously we would have exited here.
case MSG_OSD_PG_INFO: return handle_fast_pg_info(static_cast<MOSDPGInfo*>(m));
That code was removed by https://github.com/ceph/ceph/pull/40056/commits/b37e959238e75690d38873744515fa54413f4393 so we dropped further down the function into uncharted/untested waters.
get_map_epoch() is a pure virtual function in MOSDFastDispatchOp so we will try to access a get_map_epoch in the vtable of the underlying base pointer. While there are message types that implement get_map_epoch() MOSDPGInfo is not one of them so we go off into the weeds looking for it in the vtable.
Updated by Kefu Chai about 3 years ago
- Status changed from Triaged to Fix Under Review
- Assignee changed from Greg Farnum to Kefu Chai
- Regression changed from No to Yes
Updated by Sage Weil about 3 years ago
- Status changed from Fix Under Review to Resolved