Bug #53454
nautilus: MInfoRec in Started/ToDelete/WaitDeleteReseved causes state machine crash
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2021-11-27 18:46:53.337 7fda55a79700 1 -- [v2:172.21.15.134:6802/49787,v1:172.21.15.134:6803/49787] <== osd.9 v2:172.21.15.134:6826/64047 322 ==== osd pg remove(epoch 4160; pg2.18; ) v3 ==== 32+0+0 (crc 0 0 0) 0x55dae4a94d00 con 0x55d ae72b2400 ... 2021-11-27 18:46:53.337 7fda55a79700 1 -- [v2:172.21.15.134:6802/49787,v1:172.21.15.134:6803/49787] <== osd.9 v2:172.21.15.134:6826/64047 323 ==== pg_info((query:4160 sent:4160 2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 e c=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159))=([4134,4158] intervals=([4134,4158] acting 8,9)) epoch 4160) v5 ==== 1055+0+0 (crc 0 0 0) 0x55dae4a94d00 con 0x55dae72b2400 2021-11-27 18:46:53.337 7fda55a79700 7 osd.8 4160 handle_fast_pg_info pg_info((query:4160 sent:4160 2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159))=([4134,4158] int ervals=([4134,4158] acting 8,9)) epoch 4160) v5 from osd.9 ... 2021-11-27 18:46:53.337 7fda3082c700 5 osd.8 pg_epoch: 4160 pg[2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159) [9,6] r=-1 lpr=4159 pi=[4134,4159)/1 crt=3272'1 unknow n NOTIFY mbc={}] enter Started/ToDelete/WaitDeleteReseved ... 2021-11-27 18:46:53.337 7fda3082c700 10 osd.8 pg_epoch: 4160 pg[2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159) [9,6] r=-1 lpr=4159 pi=[4134,4159)/1 crt=3272'1 unknow n NOTIFY mbc={}] do_peering_event: epoch_sent: 4160 epoch_requested: 4160 MInfoRec from 9 info: 2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159) ... 2021-11-27 18:46:53.337 7fda3082c700 5 osd.8 pg_epoch: 4160 pg[2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159) [9,6] r=-1 lpr=4159 pi=[4134,4159)/1 crt=3272'1 unknown NOTIFY mbc={}] enter Crashed ... 2021-11-27 18:46:53.405 7fda3082c700 -1 *** Caught signal (Aborted) ** in thread 7fda3082c700 thread_name:tp_osd_tp ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable) 1: (()+0x12980) [0x7fda59541980] 2: (gsignal()+0xc7) [0x7fda581f3fb7] 3: (abort()+0x141) [0x7fda581f5921] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b0) [0x55dad6811595] 5: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xd9) [0x55dad696cfa9] 6: (()+0x622286) [0x55dad698c286] 7: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x24b) [0x55dad69ceb1b] 8: (boost::statechart::simple_state<PG::RecoveryState::WaitDeleteReserved, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc1) [0x55dad69c96d1] 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55dad699c2eb] 10: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0x55dad698baf2] 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55dad68bf004] 12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x55dad6b4e9e0] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55dad68b2835] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55dad6ed23ec] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55dad6ed55b0] 16: (()+0x76db) [0x7fda595366db] 17: (clone()+0x3f) [0x7fda582d671f]
/a/yuriw-2021-11-27_16:51:40-upgrade:nautilus-x-pacific-16.2.7_RC1-distro-basic-smithi/6530251
This seems to be a caused by a race. Both osd.8 and osd.9 are running 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351), so this is not related to running mixed versions.