Bug #51168
ceph-osd state machine crash during peering process
0%
Description
[root@ceph-128 ~]# ceph crash info 2021-06-10_15:30:01.935970Z_f5d8ab5b-8613-408b-ac22-75425c5cf672 { "os_version_id": "7", "assert_condition": "abort", "utsname_release": "3.10.0-1127.el7.x86_64", "os_name": "CentOS Linux", "entity_name": "osd.43", "assert_file": "/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc", "timestamp": "2021-06-10 15:30:01.935970Z", "process_name": "ceph-osd", "utsname_machine": "x86_64", "assert_line": 7274, "utsname_sysname": "Linux", "os_version": "7 (Core)", "os_id": "centos", "assert_thread_name": "tp_osd_tp", "utsname_version": "#1 SMP Tue Mar 31 23:36:51 UTC 2020", "backtrace": [ "(()+0xf630) [0x7f0101be1630]", "(gsignal()+0x37) [0x7f01009d4387]", "(abort()+0x148) [0x7f01009d5a78]", "(ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x1a5) [0x55859f824b98]", "(PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc3) [0x55859f9d5643]", "(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x36) [0x55859fa2b4e6]", "(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1d6) [0x55859fa2be26]", "(boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd3) [0x55859fa2b483]", "(PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x55859f9efe0d]", "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55859f92c2c4]", "(PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55859fb94951]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x914) [0x55859f920d84]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55859fedcfe6]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55859fedfb00]", "(()+0x7ea5) [0x7f0101bd9ea5]", "(clone()+0x6d) [0x7f0100a9c96d]" ], "utsname_hostname": "ceph-138", "assert_msg": "/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f00daa4b700 time 2021-06-10 23:30:01.927379\n/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: 7274: ceph_abort_msg(\"we got a bad state machine event\")\n", "crash_id": "2021-06-10_15:30:01.935970Z_f5d8ab5b-8613-408b-ac22-75425c5cf672", "assert_func": "PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)", "ceph_version": "14.2.18" }
History
#1 Updated by Yao Ning over 2 years ago
- Project changed from Ceph to RADOS
- Category changed from OSD to Peering
- Component(RADOS) OSD added
#2 Updated by Neha Ojha over 2 years ago
- Status changed from New to Need More Info
Can you please attach the osd log for this crash?
#3 Updated by Neha Ojha over 2 years ago
- Subject changed from ceph-osd crash during peering process to ceph-osd state machine crash during peering process
#4 Updated by Neha Ojha over 1 year ago
- Priority changed from High to Normal
#5 Updated by Yao Ning about 1 year ago
- File ceph-osd.56.log-20210618.gz added
ceph-osd log on crashed osd uploaded
#6 Updated by Neha Ojha about 1 year ago
- Description updated (diff)
#7 Updated by Radoslaw Zarzynski about 1 year ago
2021-06-17 11:39:15.500 7f9e210a0700 1 osd.56 pg_epoch: 132192 pg[4.3f1cs7( v 132191'66294 (112885'63200,132191'66294] lb 4:38fee09b:::1000018f792.00001704:head (bitwise) local-lis/les=75975/75976 n=65906 ec=1740/1630 lis/c 75975/73795 les/c/f 75976/73837/0 131099/131099/131062) [1320,1151,1202,743,370,699,294,56,172,1035,194,236,980,82,139,299,963,442,226,876]/[2147483647,734,2147483647,17,63,450,689,351,172,28,194,707,295,158,139,373,522,442,226,105]p734(1) r=-1 lpr=131099 pi=[73795,131099)/36 luod=0'0 crt=132191'66294 active+remapped mbc={}] do_scan: Canceling backfill: Full. 2021-06-17 11:39:15.503 7f9e1d098700 -1 /builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f9e1d098700 time 2021-06-17 11:39:15.501327 /builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: 7274: ceph_abort_msg("we got a bad state machine event") ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x55e189ae5ad0] 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc3) [0x55e189c96643] 3: (boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x36) [0x55e189cec4e6] 4: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1d6) [0x55e189cece26] 5: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd3) [0x55e189cec483] 6: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x55e189cb0e0d] 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55e189bed2c4] 8: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55e189e55951] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x914) [0x55e189be1d84] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e18a19dfe6] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e18a1a0b00] 12: (()+0x7ea5) [0x7f9e44226ea5] 13: (clone()+0x6d) [0x7f9e430e996d]
#8 Updated by Radoslaw Zarzynski about 1 year ago
The PG was in ReplicaActive
so we shouldn't see any backfill activity. A delayed event maybe?
#9 Updated by Neha Ojha about 1 year ago
Nautilus is EOL now and it is also possible that we may have fixed such a bug after 14.2.18.
Can you tell me the PR you think fixed it? I will try this on Pacific, I don't think this is fixed now.
#10 Updated by Yao Ning about 1 year ago
Radoslaw Zarzynski wrote:
The PG was in
ReplicaActive
so we shouldn't see any backfill activity. A delayed event maybe?
data does not backfill to this osd, data on this osd is completed before it goes down. And it should be at most do recovery after it rebooting.
#11 Updated by Neha Ojha about 1 year ago
Yao Ning wrote:
Radoslaw Zarzynski wrote:
The PG was in
ReplicaActive
so we shouldn't see any backfill activity. A delayed event maybe?data does not backfill to this osd, data on this osd is completed before it goes down. And it should be at most do recovery after it rebooting.
Are you saying that backfill had completed on this OSD before it crashed? That would align with what Radek suggested, that this OSD received a delayed event after backfill was actually over. If you have some more logs from this OSD, it will be helpful for us to validate our theory.
#12 Updated by Yao Ning about 1 year ago
- File ceph-osd.30.zip added
Radoslaw Zarzynski wrote:
The PG was in
ReplicaActive
so we shouldn't see any backfill activity. A delayed event maybe?
the new updated log is much more straight forward to see what happened. pg 14.206s1 have already entered Started/ReplicaActive/RepNotRecovering. but then in do_scan: Canceling backfill: Full , issue a PeeringState::BackfillTooFull event so that current state cannot handle it.
it seems the pg in the Reprecovering State, but the pg got some event to transit to RepNotRecovering State (however, the pg recovery threading still run the recover_ops and eventually in do_scan, a BackfillTooFull event is sent so that the osd crashed.
#13 Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from Need More Info to New
- Assignee set to Radoslaw Zarzynski
I plan to work on this one and combine with implementing the backfill cancellation in crimson. However, not a terribly high priority.