Bug #51168: ceph-osd state machine crash during peering process - RADOS - Ceph

Actions

Copy link

Bug #51168

open

ceph-osd state machine crash during peering process

Added by Yao Ning almost 3 years ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Radoslaw Zarzynski

Category:

Peering

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.18

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

[root@ceph-128 ~]# ceph crash info 2021-06-10_15:30:01.935970Z_f5d8ab5b-8613-408b-ac22-75425c5cf672
{
    "os_version_id": "7", 
    "assert_condition": "abort", 
    "utsname_release": "3.10.0-1127.el7.x86_64", 
    "os_name": "CentOS Linux", 
    "entity_name": "osd.43", 
    "assert_file": "/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc", 
    "timestamp": "2021-06-10 15:30:01.935970Z", 
    "process_name": "ceph-osd", 
    "utsname_machine": "x86_64", 
    "assert_line": 7274, 
    "utsname_sysname": "Linux", 
    "os_version": "7 (Core)", 
    "os_id": "centos", 
    "assert_thread_name": "tp_osd_tp", 
    "utsname_version": "#1 SMP Tue Mar 31 23:36:51 UTC 2020", 
    "backtrace": [
        "(()+0xf630) [0x7f0101be1630]", 
        "(gsignal()+0x37) [0x7f01009d4387]", 
        "(abort()+0x148) [0x7f01009d5a78]", 
        "(ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x1a5) [0x55859f824b98]", 
        "(PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc3) [0x55859f9d5643]", 
        "(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x36) [0x55859fa2b4e6]", 
        "(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1d6) [0x55859fa2be26]", 
        "(boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd3) [0x55859fa2b483]", 
        "(PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x55859f9efe0d]", 
        "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55859f92c2c4]", 
        "(PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55859fb94951]", 
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x914) [0x55859f920d84]", 
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55859fedcfe6]", 
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55859fedfb00]", 
        "(()+0x7ea5) [0x7f0101bd9ea5]", 
        "(clone()+0x6d) [0x7f0100a9c96d]" 
    ], 
    "utsname_hostname": "ceph-138", 
    "assert_msg": "/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f00daa4b700 time 2021-06-10 23:30:01.927379\n/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: 7274: ceph_abort_msg(\"we got a bad state machine event\")\n", 
    "crash_id": "2021-06-10_15:30:01.935970Z_f5d8ab5b-8613-408b-ac22-75425c5cf672", 
    "assert_func": "PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)", 
    "ceph_version": "14.2.18" 
}

Files

Download all files

ceph-osd.56.log-20210618.gz (319 KB) ceph-osd.56.log-20210618.gz		Yao Ning, 07/06/2022 10:47 AM
ceph-osd.30.zip (505 KB) ceph-osd.30.zip		Yao Ning, 08/22/2022 06:10 PM

Actions

Copy link

Updated by Yao Ning almost 3 years ago

Project changed from Ceph to RADOS
Category changed from OSD to Peering
Component(RADOS) OSD added

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Status changed from New to Need More Info

Can you please attach the osd log for this crash?

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Subject changed from ceph-osd crash during peering process to ceph-osd state machine crash during peering process

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by Yao Ning almost 2 years ago

File ceph-osd.56.log-20210618.gz ceph-osd.56.log-20210618.gz added

ceph-osd log on crashed osd uploaded

Actions

Copy link

Updated by Neha Ojha almost 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

2021-06-17 11:39:15.500 7f9e210a0700  1 osd.56 pg_epoch: 132192 pg[4.3f1cs7( v 132191'66294 (112885'63200,132191'66294] lb 4:38fee09b:::1000018f792.00001704:head (bitwise) local-lis/les=75975/75976 n=65906 ec=1740/1630 lis/c 75975/73795 les/c/f 75976/73837/0 131099/131099/131062) [1320,1151,1202,743,370,699,294,56,172,1035,194,236,980,82,139,299,963,442,226,876]/[2147483647,734,2147483647,17,63,450,689,351,172,28,194,707,295,158,139,373,522,442,226,105]p734(1) r=-1 lpr=131099 pi=[73795,131099)/36 luod=0'0 crt=132191'66294 active+remapped mbc={}] do_scan: Canceling backfill: Full.
2021-06-17 11:39:15.503 7f9e1d098700 -1 /builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f9e1d098700 time 2021-06-17 11:39:15.501327
/builddir/build/BUILD/ceph-14.2.18/src/osd/PG.cc: 7274: ceph_abort_msg("we got a bad state machine event")

 ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x55e189ae5ad0]
 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc3) [0x55e189c96643]
 3: (boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x36) [0x55e189cec4e6]
 4: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1d6) [0x55e189cece26]
 5: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd3) [0x55e189cec483]
 6: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x55e189cb0e0d]
 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55e189bed2c4]
 8: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55e189e55951]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x914) [0x55e189be1d84]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e18a19dfe6]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e18a1a0b00]
 12: (()+0x7ea5) [0x7f9e44226ea5]
 13: (clone()+0x6d) [0x7f9e430e996d]

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

Actions

Copy link

Updated by Neha Ojha almost 2 years ago

Nautilus is EOL now and it is also possible that we may have fixed such a bug after 14.2.18.

Can you tell me the PR you think fixed it? I will try this on Pacific, I don't think this is fixed now.

Actions

Copy link

#10

Updated by Yao Ning almost 2 years ago

Radoslaw Zarzynski wrote:

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

data does not backfill to this osd, data on this osd is completed before it goes down. And it should be at most do recovery after it rebooting.

Actions

Copy link

#11

Updated by Neha Ojha almost 2 years ago

Yao Ning wrote:

Radoslaw Zarzynski wrote:

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

data does not backfill to this osd, data on this osd is completed before it goes down. And it should be at most do recovery after it rebooting.

Are you saying that backfill had completed on this OSD before it crashed? That would align with what Radek suggested, that this OSD received a delayed event after backfill was actually over. If you have some more logs from this OSD, it will be helpful for us to validate our theory.

Actions

Copy link

#12

Updated by Yao Ning over 1 year ago

File ceph-osd.30.zip ceph-osd.30.zip added

Radoslaw Zarzynski wrote:

The PG was in ReplicaActive so we shouldn't see any backfill activity. A delayed event maybe?

the new updated log is much more straight forward to see what happened. pg 14.206s1 have already entered Started/ReplicaActive/RepNotRecovering. but then in do_scan: Canceling backfill: Full , issue a PeeringState::BackfillTooFull event so that current state cannot handle it.

it seems the pg in the Reprecovering State, but the pg got some event to transit to RepNotRecovering State (however, the pg recovery threading still run the recover_ops and eventually in do_scan, a BackfillTooFull event is sent so that the osd crashed.

Actions

Copy link

#13

Updated by Radoslaw Zarzynski over 1 year ago

Status changed from Need More Info to New
Assignee set to Radoslaw Zarzynski

I plan to work on this one and combine with implementing the backfill cancellation in crimson. However, not a terribly high priority.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #51168

ceph-osd state machine crash during peering process

Updated by Yao Ning almost 3 years ago

Updated by Neha Ojha almost 3 years ago

Updated by Neha Ojha almost 3 years ago

Updated by Neha Ojha over 2 years ago

Updated by Yao Ning almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Yao Ning almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Yao Ning over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago