Project

General

Profile

Bug #44248

Receiving RemoteBackfillReserved in WaitLocalBackfillReserved can cause the osd to crash

Added by Neha Ojha about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

 -1036> 2020-02-22 06:57:01.317 7fe1c59fd700 10 osd.1 4194 promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000
 -1035> 2020-02-22 06:57:01.296 7fe1adfc8700 -1 *** Caught signal (Aborted) **
 in thread 7fe1adfc8700 thread_name:tp_osd_tp

 ceph version 14.2.7-761-ga1c7b3f (a1c7b3fefab0e35f18ac472c35fa69a7aa0edc16) nautilus (stable)
 1: (()+0xf5f0) [0x7fe1ce9b45f0]
 2: (gsignal()+0x37) [0x7fe1cd7a8337]
 3: (abort()+0x148) [0x7fe1cd7a9a28]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x1a5) [0x5606370a4194]
 5: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc3) [0x5606372523f3]
 6: (boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::deep_construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>&)+0x36) [0x5606372a81e6]
 7: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x169) [0x5606372a8d59]
 8: (boost::statechart::simple_state<PG::RecoveryState::WaitLocalBackfillReserved, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x30) [0x56063729e0e0]
 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x56063726c4ed]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x5606371a8cb4]
 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x560637411961]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x56063719d98f]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x560637742056]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560637744b70]
 15: (()+0x7e65) [0x7fe1ce9ace65]
 16: (clone()+0x6d) [0x7fe1cd87088d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/nojha-2020-02-22_02:37:54-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4789209/


Related issues

Copied to RADOS - Backport #44324: nautilus: Receiving RemoteBackfillReserved in WaitLocalBackfillReserved can cause the osd to crash Resolved

History

#1 Updated by Neha Ojha about 4 years ago

2020-02-22 06:40:21.130 7fe1adfc8700 10 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] do_peering_event: epoch_sent: 4030 epoch_requested: 4030 RequestBackfill
2020-02-22 06:40:21.130 7fe1adfc8700  5 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] exit Started/Primary/Active/NotBackfilling 30.000102 2 0.000354
2020-02-22 06:40:21.130 7fe1adfc8700  5 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] enter Started/Primary/Active/WaitLocalBackfillReserved
2020-02-22 06:40:21.130 7fe1adfc8700 20 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped+backfill_wait mbc={} trimq=137] get_backfill_priority backfill priority is 110
.
.
2020-02-22 06:56:41.102 7fe1adfc8700 10 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped+backfill_wait mbc={} trimq=137] do_peering_event: epoch_sent: 4184 epoch_requested: 4184 RemoteBackfillReserved
2020-02-22 06:56:41.102 7fe1adfc8700  5 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped+backfill_wait mbc={} trimq=137] exit Started/Primary/Active/WaitLocalBackfillReserved 979.972263 200 0.148866
2020-02-22 06:56:41.102 7fe1adfc8700  5 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped+backfill_wait mbc={} trimq=137] exit Started/Primary/Active 4144.796728 0 0.000000
2020-02-22 06:56:41.102 7fe1adfc8700 20 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] agent_stop
2020-02-22 06:56:41.102 7fe1adfc8700  5 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] exit Started/Primary 4180.777693 0 0.000000
2020-02-22 06:56:41.102 7fe1adfc8700 10 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] clear_primary_state
2020-02-22 06:56:41.102 7fe1adfc8700 10 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 luod=0'0 crt=3470'437 lcod 3470'436 mlcod 0'0 active+remapped mbc={}] release_backoffs [43:84000000::::0,43:88000000::::head)
2020-02-22 06:56:41.102 7fe1adfc8700 20 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 luod=0'0 crt=3470'437 lcod 3470'436 mlcod 0'0 active+remapped mbc={}] agent_stop
2020-02-22 06:56:41.102 7fe1adfc8700  5 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 luod=0'0 crt=3470'437 lcod 3470'436 mlcod 0'0 active+remapped mbc={}] exit Started 4180.777799 0 0.000000
2020-02-22 06:56:41.103 7fe1adfc8700  5 osd.1 pg_epoch: 4194 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 luod=0'0 crt=3470'437 lcod 3470'436 mlcod 0'0 active+remapped mbc={}] enter Crashed

The problem seems to be that we are starting to process a RemoteBackfillReserved while in Started/Primary/Active/WaitLocalBackfillReserved

#2 Updated by Neha Ojha about 4 years ago

This is when 4184 RemoteBackfillReserved was enqueued

2020-02-22 06:37:13.927 7fe1a9fc0700 10 osd.1 pg_epoch: 3990 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped+backfill_wait mbc={} trimq=129] state<Started/Primary/Active/WaitRemoteBackfillReserved>: react num_bytes 4574262
2020-02-22 06:37:13.927 7fe1a9fc0700 20 osd.1 pg_epoch: 3990 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped+backfill_wait mbc={} trimq=129] get_backfill_priority backfill priority is 110
20-02-22 06:37:13.927 7fe1cb287700  1 -- [v2:172.21.15.136:6818/353732,v1:172.21.15.136:6820/353732] <== osd.8 v1:172.21.15.70:6813/164622 1600 ==== MBackfillReserve(43.21 GRANT e4184/4184) v4 ==== 30+0+0 (unknown 3284015754 0 0) 0x560665343860 con 0x560645c48000
2020-02-22 06:37:13.927 7fe1cb287700 15 osd.1 3990 enqueue_peering_evt 43.21 epoch_sent: 4184 epoch_requested: 4184 RemoteBackfillReserved
.
.
2020-02-22 06:37:13.994 7fe1adfc8700 20 osd.1 op_wq(1) _process OpQueueItem(43.21 PGPeeringEvent(epoch_sent: 4184 epoch_requested: 4184 RemoteBackfillReserved) prio 255 cost 10 e4184) queued
2020-02-22 06:37:13.994 7fe1adfc8700 20 osd.1 op_wq(1) _process 43.21 to_process <OpQueueItem(43.21 PGPeeringEvent(epoch_sent: 4184 epoch_requested: 4184 RemoteBackfillReserved) prio 255 cost 10 e4184)> waiting <> waiting_peering {4030=<OpQueueItem(43.21 PGPeeringEvent(epoch_sent: 4030 epoch_requested: 4030 RemoteReservationRevoked) prio 255 cost 10 e4030)>}
2020-02-22 06:37:13.994 7fe1adfc8700 20 osd.1 op_wq(1) _process OpQueueItem(43.21 PGPeeringEvent(epoch_sent: 4184 epoch_requested: 4184 RemoteBackfillReserved) prio 255 cost 10 e4184) pg 0x5606458c8000
2020-02-22 06:37:13.994 7fe1adfc8700 20 osd.1 op_wq _add_slot_waiter 43.21 peering, item epoch is 4184, will wait on OpQueueItem(43.21 PGPeeringEvent(epoch_sent: 4184 epoch_requested: 4184 RemoteBackfillReserved) prio 255 cost 10 e4184)

Then we process 4030 RemoteReservationRevoked

2020-02-22 06:39:51.128 7fe1a9fc0700 10 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped+backfill_wait mbc={} trimq=137] do_peering_event: epoch_sent: 4030 epoch_requested: 4030 RemoteReservationRevoked
2020-02-22 06:39:51.128 7fe1a9fc0700  1 -- [v2:172.21.15.136:6818/353732,v1:172.21.15.136:6820/353732] --> v1:172.21.15.62:6809/69320 -- MBackfillReserve(43.21 RELEASE e4030/4030) v5 -- 0x5606678005a0 con 0x56064dceb800

2020-02-22 06:39:51.128 7fe1a9fc0700  1 -- [v2:172.21.15.136:6818/353732,v1:172.21.15.136:6820/353732] --> v1:172.21.15.70:6813/164622 -- MBackfillReserve(43.21 RELEASE e4030/4030) v5 -- 0x560664cad4a0 con 0x560645c48000
2020-02-22 06:39:51.128 7fe1a9fc0700  5 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] exit Started/Primary/Active/WaitRemoteBackfillReserved 779.230461 211 0.043093

2020-02-22 06:39:51.128 7fe1a9fc0700  5 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] enter Started/Primary/Active/NotBackfilling

This should have released all local and remote reservations.

I'm pretty sure that we entered the following as a part of schedule_backfill_retry() and the shouldn't 4184 RemoteBackfillReserved have been cleared by now

2020-02-22 06:40:21.130 7fe1adfc8700  5 osd.1 pg_epoch: 4030 pg[43.21( v 3470'437 (2494'370,3470'437] local-lis/les=2993/3009 n=3 ec=2313/2055 lis/c 2993/2638 les/c/f 3009/2667/0 2987/2993/2993) [8,7]/[1,10] backfill=[7,8] r=0 lpr=2993 pi=[2638,2993)/2 crt=3470'437 lcod 3470'436 mlcod 3470'436 active+remapped mbc={} trimq=137] enter Started/Primary/Active/WaitLocalBackfillReserved

#3 Updated by Neha Ojha about 4 years ago

The problem is that though osd.1 sent a RELEASE to osd.8, we still ended up de-queueing "4184 RemoteBackfillReserved", which was received a lot earlier as a reply to

2020-02-22 06:37:13.931 7f18b3a92700  1 -- 172.21.15.70:6813/164622 --> 172.21.15.136:6820/353732 -- MBackfillReserve(43.21 GRANT e4184/4184) v4 -- 0x55863a5c6800 con 0

Currently, we do not have a RELEASE_ACK mechanism that makes the primary wait before starting a new backfill in schedule_backfill_retry(), a simple band-aid fix would be to consider "RemoteBackfillReserved" a no-op when we are still in WaitLocalBackfillReserved

#4 Updated by Neha Ojha about 4 years ago

  • Subject changed from nautilus: upgrade/mimic-x/stress-split - state machine crash to Receiving RemoteBackfillReserved in WaitLocalBackfillReserved can cause the osd to crash
  • Assignee set to Neha Ojha
  • Backport set to nautilus

#5 Updated by Neha Ojha about 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 33525

#6 Updated by Sage Weil about 4 years ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #44324: nautilus: Receiving RemoteBackfillReserved in WaitLocalBackfillReserved can cause the osd to crash added

#8 Updated by Nathan Cutler almost 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF