Project

General

Profile

Bug #20804

CancelRecovery event in NotRecovering state

Added by Sage Weil 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
07/27/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):

Description

   -17> 2017-07-27 19:15:02.980770 7f255c296700 10 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+recovering+degraded+remapped m=745 u=745] handle_peering_event: epoch_sent: 433 epoch_requested: 433 CancelRecovery
   -16> 2017-07-27 19:15:02.982136 7f255c296700  1 -- 172.21.15.97:6805/350272 --> 172.21.15.97:6801/347616 -- MRecoveryReserve(2.2s2 RELEASE e434) v2 -- ?+0 0x557868588f40 con 0x557867b434e0
   -15> 2017-07-27 19:15:02.982151 7f255c296700  1 -- 172.21.15.97:6805/350272 --> 172.21.15.97:6813/354522 -- MRecoveryReserve(2.2s1 RELEASE e434) v2 -- ?+0 0x55786a7a0640 con 0x557869747de0
   -14> 2017-07-27 19:15:02.982165 7f255c296700  5 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+degraded+remapped m=745 u=745] exit Started/Primary/Active/Recovering 40.300073 67 0.307948
   -13> 2017-07-27 19:15:02.983357 7f255c296700  5 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+degraded+remapped m=745 u=745] enter Started/Primary/Active/NotRecovering
   -12> 2017-07-27 19:15:02.985677 7f255c296700 15 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+degraded+remapped m=745 u=745] publish_stats_to_osd 434:1929
   -11> 2017-07-27 19:15:02.986834 7f255c296700 10 log is not dirty
   -10> 2017-07-27 19:15:02.986857 7f255c296700 10 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+degraded+remapped m=745 u=745] handle_peering_event: epoch_sent: 434 epoch_requested: 434 CancelRecovery
    -9> 2017-07-27 19:15:02.988013 7f255c296700  5 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+degraded+remapped m=745 u=745] exit Started/Primary/Active/NotRecovering 0.004655 1 0.004705
    -8> 2017-07-27 19:15:02.989168 7f255c296700  5 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+degraded+remapped m=745 u=745] exit Started/Primary/Active 42.540389 0 0.000000
    -7> 2017-07-27 19:15:02.990340 7f255c296700 20 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+remapped m=745 u=745] agent_stop
    -6> 2017-07-27 19:15:02.991492 7f255c296700  5 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+remapped m=745 u=745] exit Started/Primary 42.745919 0 0.000000
    -5> 2017-07-27 19:15:02.992645 7f255c296700 10 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+remapped m=745 u=745] clear_primary_state
    -4> 2017-07-27 19:15:02.995032 7f255c296700 10 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 luod=0'0 rops=1 crt=334'1377 mlcod 0'0 active+remapped m=745] release_backoffs [2:40000000::::head,2:80000000::::head)
    -3> 2017-07-27 19:15:02.995051 7f255c296700 20 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 luod=0'0 rops=1 crt=334'1377 mlcod 0'0 active+remapped m=745] agent_stop
    -2> 2017-07-27 19:15:02.995068 7f255c296700  5 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 luod=0'0 rops=1 crt=334'1377 mlcod 0'0 active+remapped m=745] exit Started 42.749548 0 0.000000
    -1> 2017-07-27 19:15:02.995079 7f255c296700  5 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 luod=0'0 rops=1 crt=334'1377 mlcod 0'0 active+remapped m=745] enter Crashed
     0> 2017-07-27 19:15:03.001236 7f255c296700 -1 /build/ceph-12.1.1-763-g93d27e3/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f255c296700 time 2017-07-27 19:15:02.995097
/build/ceph-12.1.1-763-g93d27e3/src/osd/PG.cc: 5951: FAILED assert(0 == "we got a bad state machine event")

 ceph version 12.1.1-763-g93d27e3 (93d27e38d4b3159749add990fa1800c51bd74781) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55785c008432]
 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0x135) [0x55785bac1165]
 3: (()+0x5d6276) [0x55785bb02276]
 4: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x14d) [0x55785bb4204d]
 5: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x149) [0x55785bb3b4b9]
 6: (boost::statechart::simple_state<PG::RecoveryState::NotRecovering, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x5e) [0x55785bb3c8ee]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x55785bb15f29]
 8: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x38d) [0x55785badfdfd]

/a/sage-2017-07-27_18:08:03-rados-wip-sage-testing-distro-basic-smithi/1453461

This seems to be because two CancelRecoery events are in the queue. The first takes us from Recovering to NotRecovering, the second crashes us.

They come from

 -3238> 2017-07-27 19:15:01.597538 7f255aa93700 20 osd.0 pg_epoch: 433 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+recovering+degraded+remapped m=745 u=745] discover_all_missing: osd.5(2): we already have pg_missing_t
 -3237> 2017-07-27 19:15:01.598027 7f255aa93700 10 osd.0 434 do_recovery: no luck, giving up on this pg for now (in recovery)
...
   -21> 2017-07-27 19:15:02.980291 7f255aa93700 20 osd.0 pg_epoch: 434 pg[2.2s0( v 334'1377 lc 109'499 (0'0,334'1377] local-lis/les=410/411 n=1202 ec=17/17 lis/c 410/203 les/c/f 411/247/0 410/410/311) [0,2147483647,1]/[0,2,1] r=0 lpr=410 pi=[203,410)/6 rops=1 crt=334'1377 mlcod 0'0 active+recovering+degraded+remapped m=745 u=745] discover_all_missing: osd.5(2): we already have pg_missing_t
   -20> 2017-07-27 19:15:02.980733 7f255aa93700 10 osd.0 435 do_recovery: no luck, giving up on this pg for now (in recovery)

History

#1 Updated by Sage Weil 5 months ago

Easy fix is to make CancelRecovery from NotRecovering a no-op.

Unsure whether this could happen in other states besides NotRecovering.. I'm thinking not?

#2 Updated by Sage Weil 5 months ago

  • Status changed from Verified to Need Review

#3 Updated by Sage Weil 5 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF