Project

General

Profile

Actions

Bug #51168

open

ceph-osd state machine crash during peering process

Added by Yao Ning almost 3 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Category:
Peering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[root@ceph-128 ~]# ceph crash info 2021-06-10_15:30:01.935970Z_f5d8ab5b-8613-408b-ac22-75425c5cf672
{
    "os_version_id": "7", 
    "assert_condition": "abort", 
    "utsname_release": "3.10.0-1127.el7.x86_64", 
    "os_name": "CentOS Linux", 
    "entity_name": "osd.43", 
    "assert_file": "/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc", 
    "timestamp": "2021-06-10 15:30:01.935970Z", 
    "process_name": "ceph-osd", 
    "utsname_machine": "x86_64", 
    "assert_line": 7274, 
    "utsname_sysname": "Linux", 
    "os_version": "7 (Core)", 
    "os_id": "centos", 
    "assert_thread_name": "tp_osd_tp", 
    "utsname_version": "#1 SMP Tue Mar 31 23:36:51 UTC 2020", 
    "backtrace": [
        "(()+0xf630) [0x7f0101be1630]", 
        "(gsignal()+0x37) [0x7f01009d4387]", 
        "(abort()+0x148) [0x7f01009d5a78]", 
        "(ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x1a5) [0x55859f824b98]", 
        "(PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc3) [0x55859f9d5643]", 
        "(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x36) [0x55859fa2b4e6]", 
        "(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1d6) [0x55859fa2be26]", 
        "(boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd3) [0x55859fa2b483]", 
        "(PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x55859f9efe0d]", 
        "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55859f92c2c4]", 
        "(PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55859fb94951]", 
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x914) [0x55859f920d84]", 
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55859fedcfe6]", 
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55859fedfb00]", 
        "(()+0x7ea5) [0x7f0101bd9ea5]", 
        "(clone()+0x6d) [0x7f0100a9c96d]" 
    ], 
    "utsname_hostname": "ceph-138", 
    "assert_msg": "/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f00daa4b700 time 2021-06-10 23:30:01.927379\n/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: 7274: ceph_abort_msg(\"we got a bad state machine event\")\n", 
    "crash_id": "2021-06-10_15:30:01.935970Z_f5d8ab5b-8613-408b-ac22-75425c5cf672", 
    "assert_func": "PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)", 
    "ceph_version": "14.2.18" 
}


Files

ceph-osd.56.log-20210618.gz (319 KB) ceph-osd.56.log-20210618.gz Yao Ning, 07/06/2022 10:47 AM
ceph-osd.30.zip (505 KB) ceph-osd.30.zip Yao Ning, 08/22/2022 06:10 PM
Actions #1

Updated by Yao Ning almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to Peering
  • Component(RADOS) OSD added
Actions #2

Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Need More Info

Can you please attach the osd log for this crash?

Actions #3

Updated by Neha Ojha almost 3 years ago

  • Subject changed from ceph-osd crash during peering process to ceph-osd state machine crash during peering process
Actions #4

Updated by Neha Ojha over 2 years ago

  • Priority changed from High to Normal
Actions #5

Updated by Yao Ning almost 2 years ago

ceph-osd log on crashed osd uploaded

Actions #6

Updated by Neha Ojha almost 2 years ago

  • Description updated (diff)
Actions #7

Updated by Radoslaw Zarzynski almost 2 years ago

2021-06-17 11:39:15.500 7f9e210a0700  1 osd.56 pg_epoch: 132192 pg[4.3f1cs7( v 132191'66294 (112885'63200,132191'66294] lb 4:38fee09b:::1000018f792.00001704:head (bitwise) local-lis/les=75975/75976 n=65906 ec=1740/1630 lis/c 75975/73795 les/c/f 75976/73837/0 131099/131099/131062) [1320,1151,1202,743,370,699,294,56,172,1035,194,236,980,82,139,299,963,442,226,876]/[2147483647,734,2147483647,17,63,450,689,351,172,28,194,707,295,158,139,373,522,442,226,105]p734(1) r=-1 lpr=131099 pi=[73795,131099)/36 luod=0'0 crt=132191'66294 active+remapped mbc={}] do_scan: Canceling backfill: Full.
2021-06-17 11:39:15.503 7f9e1d098700 -1 /builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f9e1d098700 time 2021-06-17 11:39:15.501327
/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: 7274: ceph_abort_msg("we got a bad state machine event")

 ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x55e189ae5ad0]
 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc3) [0x55e189c96643]
 3: (boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x36) [0x55e189cec4e6]
 4: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1d6) [0x55e189cece26]
 5: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd3) [0x55e189cec483]
 6: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x55e189cb0e0d]
 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55e189bed2c4]
 8: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55e189e55951]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x914) [0x55e189be1d84]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e18a19dfe6]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e18a1a0b00]
 12: (()+0x7ea5) [0x7f9e44226ea5]
 13: (clone()+0x6d) [0x7f9e430e996d]
Actions #8

Updated by Radoslaw Zarzynski almost 2 years ago

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

Actions #9

Updated by Neha Ojha almost 2 years ago

Nautilus is EOL now and it is also possible that we may have fixed such a bug after 14.2.18.

Can you tell me the PR you think fixed it? I will try this on Pacific, I don't think this is fixed now.

Actions #10

Updated by Yao Ning almost 2 years ago

Radoslaw Zarzynski wrote:

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

data does not backfill to this osd, data on this osd is completed before it goes down. And it should be at most do recovery after it rebooting.

Actions #11

Updated by Neha Ojha almost 2 years ago

Yao Ning wrote:

Radoslaw Zarzynski wrote:

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

data does not backfill to this osd, data on this osd is completed before it goes down. And it should be at most do recovery after it rebooting.

Are you saying that backfill had completed on this OSD before it crashed? That would align with what Radek suggested, that this OSD received a delayed event after backfill was actually over. If you have some more logs from this OSD, it will be helpful for us to validate our theory.

Actions #12

Updated by Yao Ning over 1 year ago

Radoslaw Zarzynski wrote:

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

the new updated log is much more straight forward to see what happened. pg 14.206s1 have already entered Started/ReplicaActive/RepNotRecovering. but then in do_scan: Canceling backfill: Full , issue a PeeringState::BackfillTooFull event so that current state cannot handle it.

it seems the pg in the Reprecovering State, but the pg got some event to transit to RepNotRecovering State (however, the pg recovery threading still run the recover_ops and eventually in do_scan, a BackfillTooFull event is sent so that the osd crashed.

Actions #13

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from Need More Info to New
  • Assignee set to Radoslaw Zarzynski

I plan to work on this one and combine with implementing the backfill cancellation in crimson. However, not a terribly high priority.

Actions

Also available in: Atom PDF