Actions
Bug #39145
openluminous: jewel-x-singleton: FAILED assert(0 == "we got a bad state machine event")
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2019-04-04T01:31:20.003 INFO:tasks.ceph.osd.3.smithi089.stderr:osd/PG.cc: 5935: FAILED assert(0 == "we got a bad state machine event") 2019-04-04T01:31:20.003 INFO:tasks.ceph.osd.3.smithi089.stderr: ceph version 10.2.11-12-g44538b3 (44538b38d1dcd39a635c6b36ba98bbc3ab51002a) 2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x558d04923250] 2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc7) [0x558d04286757] 2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 3: (()+0x4e90c6) [0x558d042c10c6] 2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 4: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x2d6) [0x558d042f0de6] 2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 5: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb7) [0x558d042f1c87] 2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x558d042d1159] 2019-04-04T01:31:20.004 INFO:tasks.ceph.osd.3.smithi089.stderr: 7: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x395) [0x558d042a4bd5] 2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 8: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2d4) [0x558d041ed994] 2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 9: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x25) [0x558d042368e5] 2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x558d04915421] 2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 11: (ThreadPool::WorkThread::entry()+0x10) [0x558d04916520] 2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 12: (()+0x76ba) [0x7efe43d836ba] 2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: 13: (clone()+0x6d) [0x7efe41dfb41d] 2019-04-04T01:31:20.005 INFO:tasks.ceph.osd.3.smithi089.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
/a/yuriw-2019-04-04_00:00:53-rados-luminous-distro-basic-smithi/3806075/
Updated by David Zafman about 5 years ago
From the osd log including the thread before the crash.
2019-04-04 01:31:19.998750 7efe22bbe700 10 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] handle_peering_event: epoch_sent: 1659 epoch_requested: 1659 RemoteReservationRejected 2019-04-04 01:31:19.998764 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive/RepNotRecovering 1.919486 4 0.000025 2019-04-04 01:31:19.998772 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive 28.242477 0 0.000000 2019-04-04 01:31:19.998780 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started 41.168727 0 0.000000 2019-04-04 01:31:19.998785 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] enter Crashed 2019-04-04 01:31:20.001191 7efe22bbe700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7efe22bbe700 time 2019-04-04 01:31:19.998789 osd/PG.cc: 5935: FAILED assert(0 == "we got a bad state machine event")
Updated by Neha Ojha about 5 years ago
- Status changed from New to 12
Fails in 1 out of 20 runs http://pulpito.ceph.com/nojha-2019-04-09_17:54:07-rados:upgrade:jewel-x-singleton-luminous-distro-basic-smithi/
This happens when the primary has already upgraded to luminous, while a backfill peer has not.
osd.3 - backfill peer finished backfill, sends an ack
2019-04-04 01:31:18.079173 7efe1dbb4700 10 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] lb 9:cffe4052:::benchmark_data_smithi105_13437_object84095:head (bitwise) local-les=1641 n=2772 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] do_backfill pg_backfill(finish 9.13 e 1659/1659 lb MAX) v3 2019-04-04 01:31:18.079198 7efe1dbb4700 1 -- 172.21.15.89:6809/24369 --> 172.21.15.191:6805/51694 -- pg_backfill(finish_ack 9.13 e 1660/1660 lb MIN) v3 -- ?+0 0x558d10f5f900 con 0x558d12aa0c00
after which the following happens
2019-04-04 01:31:18.079260 7efe233bf700 10 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] handle_peering_event: epoch_sent: 1660 epoch_requested: 1660 RecoveryDone 2019-04-04 01:31:18.079271 7efe233bf700 5 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive/RepRecovering 7.901447 16 0.000126 2019-04-04 01:31:18.079278 7efe233bf700 5 osd.3 pg_epoch: 1660 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] enter Started/ReplicaActive/RepNotRecovering
the primary receives the ack
2019-04-04 01:31:18.079299 7f24f1fa4700 10 osd.0 1659 dequeue_op 0x562e98ef19c0 prio 3 cost 0 latency 0.000134 pg_backfill(finish_ack 9.13 e 1660/1660 lb MIN) v3 pg pg[9.13( v 1659'4230 (1606'2719,1659'4230] loc al-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+ backfilling mbc={255={}}] 2019-04-04 01:31:18.079312 7f24f1fa4700 20 osd.0 1659 share_map osd.3 172.21.15.89:6809/24369 1660 2019-04-04 01:31:18.079315 7f24f1fa4700 20 osd.0 1659 should_share_map osd.3 172.21.15.89:6809/24369 1660 2019-04-04 01:31:18.079319 7f24f1fa4700 20 osd.0 pg_epoch: 1659 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0 ,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+backfilling mbc={255={}}] do_request min 1660, queue on waiting_for_map osd.3
2019-04-04 01:31:19.930217 7f250e937700 10 osd.0 1659 send_pg_stats - 64 pgs updated 2019-04-04 01:31:19.930282 7f250e937700 25 osd.0 1659 sending 9.13 1659:12640
Looks like the problem is caused by this REJECT
2019-04-04 01:31:19.998411 7f24f7fb0700 10 osd.0 pg_epoch: 1659 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+backfilling mbc={255={}}] handle_peering_event: epoch_sent: 1646 epoch_requested: 1646 DeferBackfill: delay 0 2019-04-04 01:31:19.998446 7f24f7fb0700 10 osd.0 pg_epoch: 1659 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-lis/les=1633/1641 n=2784 ec=1608/1608 lis/c 1633/1609 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=0 lpr=1633 pi=[1608,1633)/2 rops=2 bft=3,5 crt=1659'4230 lcod 1659'4229 mlcod 1656'4228 active+remapped+backfilling mbc={255={}}] state<Started/Primary/Active/Backfilling>: defer backfill, retry delay 0 2019-04-04 01:31:19.998471 7f24f7fb0700 1 -- 172.21.15.191:6805/51694 --> 172.21.15.89:6809/24369 -- MBackfillReserve REJECT pgid: 9.13, query_epoch: 1659 v3 -- 0x562ea11c5c00 con 0 2019-04-04 01:31:19.998501 7f24f7fb0700 1 -- 172.21.15.191:6805/51694 --> 172.21.15.89:6813/4088 -- MBackfillReserve REJECT pgid: 9.13, query_epoch: 1659 v3 -- 0x562e98e5b400 con 0
This backfill peer crashes after receiving it, probably because it wasn't expecting it.
2019-04-04 01:31:19.998694 7efe2abce700 1 -- 172.21.15.89:6809/24369 <== osd.0 172.21.15.191:6805/51694 54109 ==== MBackfillReserve REJECT pgid: 9.13, query_epoch: 1659 v3 ==== 30+0+0 (565952711 0 0) 0x558d12cace00 con 0x558d12aa0c00 2019-04-04 01:31:19.998750 7efe22bbe700 10 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] handle_peering_event: epoch_sent: 1659 epoch_requested: 1659 RemoteReservationRejected 2019-04-04 01:31:19.998764 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive/RepNotRecovering 1.919486 4 0.000025 2019-04-04 01:31:19.998772 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started/ReplicaActive 28.242477 0 0.000000 2019-04-04 01:31:19.998780 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] exit Started 41.168727 0 0.000000 2019-04-04 01:31:19.998785 7efe22bbe700 5 osd.3 pg_epoch: 1661 pg[9.13( v 1659'4230 (1606'2719,1659'4230] local-les=1641 n=2784 ec=1608 les/c/f 1641/1609/0 1632/1633/1633) [3,5]/[0,2] r=-1 lpr=1633 pi=1608-1632/2 luod=0'0 crt=1659'4230 active+remapped] enter Crashed
Updated by Neha Ojha about 5 years ago
Ah, that's because a jewel osd does not know how to deal with this REJECT in the Started/ReplicaActive/RepNotRecovering state.
Actions