Bug #39145: luminous: jewel-x-singleton: FAILED assert(0 == "we got a bad state machine event") - RADOS - Ceph

Actions

Copy link

Bug #39145

open

luminous: jewel-x-singleton: FAILED assert(0 == "we got a bad state machine event")

Added by Neha Ojha about 5 years ago. Updated over 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2019-04-04T01:31:20.003 INFO:tasks.ceph.osd.3.smithi089.stderr:osd/PG.cc: 5935: FAILED assert(0 == "we got a bad state machine event")
2019-04-04T01:31:20.003 INFO:tasks.ceph.osd.3.smithi089.stderr: ceph version 10.2.11-12-g44538b3 (44538b38d1dcd39a635c6b36ba98bbc3ab51002a)
2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x558d04923250]
2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc7) [0x558d04286757]
2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 3: (()+0x4e90c6) [0x558d042c10c6]
2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 4: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x2d6) [0x558d042f0de6]
2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 5: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb7) [0x558d042f1c87]
2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x558d042d1159]
2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x395) [0x558d042a4bd5]
2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 8: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2d4) [0x558d041ed994]
2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 9: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x25) [0x558d042368e5]
2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x558d04915421]
2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 11: (ThreadPool::WorkThread::entry()+0x10) [0x558d04916520]
2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 12: (()+0x76ba) [0x7efe43d836ba]
2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 13: (clone()+0x6d) [0x7efe41dfb41d]
2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/yuriw-2019-04-04_00:00:53-rados-luminous-distro-basic-smithi/3806075/

Actions

Copy link

Updated by David Zafman about 5 years ago

From the osd log including the thread before the crash.

2019-04-04 01:31:19.998750 7efe22bbe700 10 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] handle_peering_event: epoch_sent: 1659 epoch_requested: 1659 RemoteReservationRejected
2019-04-04 01:31:19.998764 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive/RepNotRecovering 1.919486 4 0.000025
2019-04-04 01:31:19.998772 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive 28.242477 0 0.000000
2019-04-04 01:31:19.998780 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started 41.168727 0 0.000000
2019-04-04 01:31:19.998785 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] enter Crashed
2019-04-04 01:31:20.001191 7efe22bbe700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7efe22bbe700 time 2019-04-04 01:31:19.998789
osd/PG.cc: 5935: FAILED assert(0 == "we got a bad state machine event")

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Status changed from New to 12

Fails in 1 out of 20 runs http://pulpito.ceph.com/nojha-2019-04-09_17:54:07-rados:upgrade:jewel-x-singleton-luminous-distro-basic-smithi/

This happens when the primary has already upgraded to luminous, while a backfill peer has not.

osd.3 - backfill peer finished backfill, sends an ack

2019-04-04 01:31:18.079173 7efe1dbb4700 10 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] lb 9:cffe4052:::benchmark_data_smithi105_13437_object84095:head (bitwise) local-les=1641 n=2772 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] do_backfill pg_backfill(finish 9.13 e 1659/1659 lb MAX) v3
2019-04-04 01:31:18.079198 7efe1dbb4700  1 -- 172.21.15.89:6809/24369 --> 172.21.15.191:6805/51694 -- pg_backfill(finish_ack 9.13 e 1660/1660 lb MIN) v3 -- ?+0 0x558d10f5f900 con 0x558d12aa0c00

after which the following happens

2019-04-04 01:31:18.079260 7efe233bf700 10 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] handle_peering_event: epoch_sent: 1660 epoch_requested: 1660 RecoveryDone
2019-04-04 01:31:18.079271 7efe233bf700  5 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive/RepRecovering 7.901447 16 0.000126
2019-04-04 01:31:18.079278 7efe233bf700  5 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] enter Started/ReplicaActive/RepNotRecovering

the primary receives the ack

2019-04-04 01:31:18.079299 7f24f1fa4700 10 osd.0 1659 dequeue_op 0x562e98ef19c0 prio 3 cost 0 latency 0.000134 pg_backfill(finish_ack 9.13 e 1660/1660 lb MIN) v3 pg pg[9.13( v 1659'4230 (1606'2719,1659'4230] loc
al-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+
backfilling mbc={255={}}]
2019-04-04 01:31:18.079312 7f24f1fa4700 20 osd.0 1659 share_map osd.3 172.21.15.89:6809/24369 1660
2019-04-04 01:31:18.079315 7f24f1fa4700 20 osd.0 1659 should_share_map osd.3 172.21.15.89:6809/24369 1660
2019-04-04 01:31:18.079319 7f24f1fa4700 20 osd.0 pg_epoch: 1659 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0
,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+backfilling mbc={255={}}] do_request min 1660, queue on waiting_for_map osd.3

2019-04-04 01:31:19.930217 7f250e937700 10 osd.0 1659 send_pg_stats - 64 pgs updated
2019-04-04 01:31:19.930282 7f250e937700 25 osd.0 1659  sending 9.13 1659:12640

Looks like the problem is caused by this REJECT

2019-04-04 01:31:19.998411 7f24f7fb0700 10 osd.0 pg_epoch: 1659 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+backfilling mbc={255={}}] handle_peering_event: epoch_sent: 1646 epoch_requested: 1646 DeferBackfill: delay 0
2019-04-04 01:31:19.998446 7f24f7fb0700 10 osd.0 pg_epoch: 1659 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+backfilling mbc={255={}}] state<Started/Primary/Active/Backfilling>: defer backfill, retry delay 0
2019-04-04 01:31:19.998471 7f24f7fb0700  1 -- 172.21.15.191:6805/51694 --> 172.21.15.89:6809/24369 -- MBackfillReserve REJECT  pgid: 9.13, query_epoch: 1659 v3 -- 0x562ea11c5c00 con 0
2019-04-04 01:31:19.998501 7f24f7fb0700  1 -- 172.21.15.191:6805/51694 --> 172.21.15.89:6813/4088 -- MBackfillReserve REJECT  pgid: 9.13, query_epoch: 1659 v3 -- 0x562e98e5b400 con 0

This backfill peer crashes after receiving it, probably because it wasn't expecting it.

2019-04-04 01:31:19.998694 7efe2abce700  1 -- 172.21.15.89:6809/24369 <== osd.0 172.21.15.191:6805/51694 54109 ==== MBackfillReserve REJECT  pgid: 9.13, query_epoch: 1659 v3 ==== 30+0+0 (565952711 0 0) 0x558d12cace00 con 0x558d12aa0c00

2019-04-04 01:31:19.998750 7efe22bbe700 10 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] handle_peering_event: epoch_sent: 1659 epoch_requested: 1659 RemoteReservationRejected
2019-04-04 01:31:19.998764 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive/RepNotRecovering 1.919486 4 0.000025
2019-04-04 01:31:19.998772 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive 28.242477 0 0.000000
2019-04-04 01:31:19.998780 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started 41.168727 0 0.000000
2019-04-04 01:31:19.998785 7efe22bbe700  5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] enter Crashed

Actions

Copy link