Project

General

Profile

Bug #53454

nautilus: MInfoRec in Started/ToDelete/WaitDeleteReseved causes state machine crash

Added by Neha Ojha over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-11-27 18:46:53.337 7fda55a79700  1 -- [v2:172.21.15.134:6802/49787,v1:172.21.15.134:6803/49787] <== osd.9 v2:172.21.15.134:6826/64047 322 ==== osd pg remove(epoch 4160; pg2.18; ) v3 ==== 32+0+0 (crc 0 0 0) 0x55dae4a94d00 con 0x55d
ae72b2400
...
2021-11-27 18:46:53.337 7fda55a79700  1 -- [v2:172.21.15.134:6802/49787,v1:172.21.15.134:6803/49787] <== osd.9 v2:172.21.15.134:6826/64047 323 ==== pg_info((query:4160 sent:4160 2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 e
c=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159))=([4134,4158] intervals=([4134,4158] acting 8,9)) epoch 4160) v5 ==== 1055+0+0 (crc 0 0 0) 0x55dae4a94d00 con 0x55dae72b2400
2021-11-27 18:46:53.337 7fda55a79700  7 osd.8 4160 handle_fast_pg_info pg_info((query:4160 sent:4160 2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159))=([4134,4158] int
ervals=([4134,4158] acting 8,9)) epoch 4160) v5 from osd.9
...

2021-11-27 18:46:53.337 7fda3082c700  5 osd.8 pg_epoch: 4160 pg[2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159) [9,6] r=-1 lpr=4159 pi=[4134,4159)/1 crt=3272'1 unknow
n NOTIFY mbc={}] enter Started/ToDelete/WaitDeleteReseved
...
2021-11-27 18:46:53.337 7fda3082c700 10 osd.8 pg_epoch: 4160 pg[2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159) [9,6] r=-1 lpr=4159 pi=[4134,4159)/1 crt=3272'1 unknow
n NOTIFY mbc={}] do_peering_event: epoch_sent: 4160 epoch_requested: 4160 MInfoRec from 9 info: 2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159)
...
2021-11-27 18:46:53.337 7fda3082c700  5 osd.8 pg_epoch: 4160 pg[2.18( v 3272'1 (0'0,3272'1] local-lis/les=4134/4135 n=1 ec=22/22 lis/c 4134/4134 les/c/f 4135/4135/0 4159/4159/4159) [9,6] r=-1 lpr=4159 pi=[4134,4159)/1 crt=3272'1 unknown NOTIFY mbc={}] enter Crashed
...
2021-11-27 18:46:53.405 7fda3082c700 -1 *** Caught signal (Aborted) **
 in thread 7fda3082c700 thread_name:tp_osd_tp

 ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
 1: (()+0x12980) [0x7fda59541980]
 2: (gsignal()+0xc7) [0x7fda581f3fb7]
 3: (abort()+0x141) [0x7fda581f5921]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b0) [0x55dad6811595]
 5: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xd9) [0x55dad696cfa9]
 6: (()+0x622286) [0x55dad698c286]
 7: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x24b) [0x55dad69ceb1b]
 8: (boost::statechart::simple_state<PG::RecoveryState::WaitDeleteReserved, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc1) [0x55dad69c96d1]
 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55dad699c2eb]
 10: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0x55dad698baf2]
 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55dad68bf004]
 12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x55dad6b4e9e0]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55dad68b2835]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55dad6ed23ec]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55dad6ed55b0]
 16: (()+0x76db) [0x7fda595366db]
 17: (clone()+0x3f) [0x7fda582d671f]

/a/yuriw-2021-11-27_16:51:40-upgrade:nautilus-x-pacific-16.2.7_RC1-distro-basic-smithi/6530251

This seems to be a caused by a race. Both osd.8 and osd.9 are running 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351), so this is not related to running mixed versions.

Also available in: Atom PDF