Bug #58496: osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty()) - RADOS - Ceph

Actions

Copy link

Bug #58496

open

osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty())

Added by Laura Flores over 1 year ago. Updated about 1 year ago.

Status:

Pending Backport

Priority:

Urgent

Assignee:

Ronen Friedman

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

reef

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

50283

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2023-01-12_20:11:41-rados-main-distro-default-smithi/7138659

2023-01-13T14:56:19.348 INFO:tasks.ceph.osd.4.smithi120.stderr:/build/ceph-18.0.0-1762-gcb17f286/src/osd/PeeringState.cc: In function 'void PeeringState::update_calc_stats()' thread 7fc997d11700 time 2023-01-13T14:56:19.348816+0000
2023-01-13T14:56:19.349 INFO:tasks.ceph.osd.4.smithi120.stderr:/build/ceph-18.0.0-1762-gcb17f286/src/osd/PeeringState.cc: 3553: FAILED ceph_assert(!acting_recovery_backfill.empty())
2023-01-13T14:56:19.364 INFO:tasks.ceph.osd.4.smithi120.stderr: ceph version 18.0.0-1762-gcb17f286 (cb17f286272f7ae9dbdf8117ca7b077b0a5cf650) reef (dev)
2023-01-13T14:56:19.364 INFO:tasks.ceph.osd.4.smithi120.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x128) [0x55d542083f11]
2023-01-13T14:56:19.364 INFO:tasks.ceph.osd.4.smithi120.stderr: 2: ceph-osd(+0xcd10cd) [0x55d5420840cd]
2023-01-13T14:56:19.365 INFO:tasks.ceph.osd.4.smithi120.stderr: 3: (PeeringState::update_calc_stats()+0x1c42) [0x55d54248ad32]
2023-01-13T14:56:19.365 INFO:tasks.ceph.osd.4.smithi120.stderr: 4: (PeeringState::prepare_stats_for_publish(std::optional<pg_stat_t> const&, object_stat_collection_t const&)+0xf8) [0x55d54248b2f8]
2023-01-13T14:56:19.365 INFO:tasks.ceph.osd.4.smithi120.stderr: 5: (PG::publish_stats_to_osd()+0xf7) [0x55d5421f3287]
2023-01-13T14:56:19.365 INFO:tasks.ceph.osd.4.smithi120.stderr: 6: (PgScrubber::clear_pgscrub_state()+0x99) [0x55d54237f4c9]
2023-01-13T14:56:19.365 INFO:tasks.ceph.osd.4.smithi120.stderr: 7: (PgScrubber::scrub_clear_state()+0x42) [0x55d54237f632]
2023-01-13T14:56:19.366 INFO:tasks.ceph.osd.4.smithi120.stderr: 8: (PgScrubber::on_primary_change(std::basic_string_view<char, std::char_traits<char> >, requested_scrub_t const&)+0x207) [0x55d542380f77]
2023-01-13T14:56:19.366 INFO:tasks.ceph.osd.4.smithi120.stderr: 9: (PeeringState::on_new_interval()+0x257) [0x55d54247af67]
2023-01-13T14:56:19.366 INFO:tasks.ceph.osd.4.smithi120.stderr: 10: (PeeringState::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ceph::os::Transaction&)+0x59f) [0x55d54247c67f]
2023-01-13T14:56:19.367 INFO:tasks.ceph.osd.4.smithi120.stderr: 11: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x26b) [0x55d54248eeeb]
2023-01-13T14:56:19.367 INFO:tasks.ceph.osd.4.smithi120.stderr: 12: (boost::statechart::simple_state<PeeringState::Reset, PeeringState::PeeringMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1e1) [0x55d5424db091]
2023-01-13T14:56:19.367 INFO:tasks.ceph.osd.4.smithi120.stderr: 13: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xd3) [0x55d5424c9db3]
2023-01-13T14:56:19.367 INFO:tasks.ceph.osd.4.smithi120.stderr: 14: (PeeringState::advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x29c) [0x55d54247996c]
2023-01-13T14:56:19.367 INFO:tasks.ceph.osd.4.smithi120.stderr: 15: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0xfd) [0x55d5422042ed]
2023-01-13T14:56:19.368 INFO:tasks.ceph.osd.4.smithi120.stderr: 16: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x364) [0x55d5421653f4]
2023-01-13T14:56:19.368 INFO:tasks.ceph.osd.4.smithi120.stderr: 17: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x299) [0x55d5421678f9]
2023-01-13T14:56:19.368 INFO:tasks.ceph.osd.4.smithi120.stderr: 18: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x59) [0x55d54244d519]
2023-01-13T14:56:19.368 INFO:tasks.ceph.osd.4.smithi120.stderr: 19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xb28) [0x55d54215f3c8]
2023-01-13T14:56:19.369 INFO:tasks.ceph.osd.4.smithi120.stderr: 20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x434) [0x55d5428a1934]
2023-01-13T14:56:19.369 INFO:tasks.ceph.osd.4.smithi120.stderr: 21: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55d5428a4a24]
2023-01-13T14:56:19.369 INFO:tasks.ceph.osd.4.smithi120.stderr: 22: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fc9bc9a1609]
2023-01-13T14:56:19.369 INFO:tasks.ceph.osd.4.smithi120.stderr: 23: clone()

The teuthology log for this failure was huge, but I managed to view it by copying a portion of it into a new file (cp teuthology.log ~/teuthology.log).

There is a coredump available at /a/yuriw-2023-01-12_20:11:41-rados-main-distro-default-smithi/7138659/remote/smithi120/coredump.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha over 1 year ago

Assignee set to Ronen Friedman
Priority changed from Normal to Urgent

 -14> 2023-01-13T14:56:19.343+0000 7fc997d11700 20 osd.4 pg_epoch: 191 pg[3.1a( v 174'400 (0'0,174'400] local-lis/les=133/134 n=26 ec=133/19 lis/c=133/133 les/c/f=134/134/0 sis=191 pruub=15.550573349s) [4,1] r=0 lpr=191 pi=[133,191)/1 luod=0'0 lua=132'370 crt=174'400 lcod 174'398 mlcod 0'0 active pruub 215.163299561s@ mbc={} ps=[66~2,69~6,70~2,75~3,7f~3,84~2,87~3,8e~1,94~1]] on_new_interval scrubbing flags:
   -13> 2023-01-13T14:56:19.343+0000 7fc997d11700 15 osd.4 scrub-queue::determine_scrub_time suggested: 2023-01-13T14:56:19.348 hist: 2023-01-13T14:53:15.278 v:true/true must:y pool-min:0 (plnd:)
   -12> 2023-01-13T14:56:19.343+0000 7fc997d11700 20 osd.4 scrub-queue::register_with_osd pg[3.1a] state at entry: <not registered>
   -11> 2023-01-13T14:56:19.343+0000 7fc997d11700 10 osd.4 update_schedule: pg[3.1a] adjusted: 2023-01-13T14:56:19.348 (not-queued)
   -10> 2023-01-13T14:56:19.343+0000 7fc997d11700 10 osd.4 scrub-queue::register_with_osd pg[3.1a] sched-state changed from <not registered> to <registered> (@2023-01-13T14:56:19.348)
    -9> 2023-01-13T14:56:19.343+0000 7fc997d11700 10 osd.4 pg_epoch: 191 pg[3.1a( v 174'400 (0'0,174'400] local-lis/les=133/134 n=26 ec=133/19 lis/c=133/133 les/c/f=134/134/0 sis=191 pruub=15.550573349s) [4,1] r=0 lpr=191 pi=[133,191)/1 luod=0'0 lua=132'370 crt=174'400 lcod 174'398 mlcod 0'0 active pruub 215.163299561s@ mbc={} ps=[66~2,69~6,70~2,75~3,7f~3,84~2,87~3,8e~1,94~1]] scrubber<ActiveReplica>: on_primary_change: interval changed (133 -> 191). Aborting active scrub.
    -8> 2023-01-13T14:56:19.343+0000 7fc997d11700 10 osd.4 pg_epoch: 191 pg[3.1a( v 174'400 (0'0,174'400] local-lis/les=133/134 n=26 ec=133/19 lis/c=133/133 les/c/f=134/134/0 sis=191 pruub=15.550573349s) [4,1] r=0 lpr=191 pi=[133,191)/1 luod=0'0 lua=132'370 crt=174'400 lcod 174'398 mlcod 0'0 active pruub 215.163299561s@ mbc={} ps=[66~2,69~6,70~2,75~3,7f~3,84~2,87~3,8e~1,94~1]] scrubber<ActiveReplica>: scrub_clear_state
    -7> 2023-01-13T14:56:19.343+0000 7fc997d11700 10 osd.4 pg_epoch: 191 pg[3.1a( v 174'400 (0'0,174'400] local-lis/les=133/134 n=26 ec=133/19 lis/c=133/133 les/c/f=134/134/0 sis=191 pruub=15.550573349s) [4,1] r=0 lpr=191 pi=[133,191)/1 luod=0'0 lua=132'370 crt=174'400 lcod 174'398 mlcod 0'0 active pruub 215.163299561s@ mbc={} ps=[66~2,69~6,70~2,75~3,7f~3,84~2,87~3,8e~1,94~1]] scrubber<ActiveReplica>: clear_pgscrub_state
    -6> 2023-01-13T14:56:19.343+0000 7fc997d11700 10 osd.4 pg_epoch: 191 pg[3.1a( v 174'400 (0'0,174'400] local-lis/les=133/134 n=26 ec=133/19 lis/c=133/133 les/c/f=134/134/0 sis=191 pruub=15.550573349s) [4,1] r=0 lpr=191 pi=[133,191)/1 luod=0'0 lua=132'370 crt=174'400 lcod 174'398 mlcod 0'0 active pruub 215.163299561s@ mbc={} ps=[66~2,69~6,70~2,75~3,7f~3,84~2,87~3,8e~1,94~1]] scrubber<ActiveReplica>: clear_scrub_reservations
    -5> 2023-01-13T14:56:19.343+0000 7fc997d11700 20 osd.4 scrub-queue::dec_scrubs_remote : 1 -> 0 (max 3, local 0)
    -4> 2023-01-13T14:56:19.343+0000 7fc997d11700 20 osd.4 pg_epoch: 191 pg[3.1a( v 174'400 (0'0,174'400] local-lis/les=133/134 n=26 ec=133/19 lis/c=133/133 les/c/f=134/134/0 sis=191 pruub=15.550573349s) [4,1] r=0 lpr=191 pi=[133,191)/1 luod=0'0 lua=132'370 crt=174'400 lcod 174'398 mlcod 0'0 active pruub 215.163299561s@ mbc={} ps=[66~2,69~6,70~2,75~3,7f~3,84~2,87~3,8e~1,94~1]] update_calc_stats actingset 1,4 upset 1,4 acting_recovery_backfill
    -3> 2023-01-13T14:56:19.359+0000 7fc997d11700 -1 /build/ceph-18.0.0-1762-gcb17f286/src/osd/PeeringState.cc: In function 'void PeeringState::update_calc_stats()' thread 7fc997d11700 time 2023-01-13T14:56:19.348816+0000
/build/ceph-18.0.0-1762-gcb17f286/src/osd/PeeringState.cc: 3553: FAILED ceph_assert(!acting_recovery_backfill.empty())

 ceph version 18.0.0-1762-gcb17f286 (cb17f286272f7ae9dbdf8117ca7b077b0a5cf650) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x128) [0x55d542083f11]
 2: ceph-osd(+0xcd10cd) [0x55d5420840cd]
 3: (PeeringState::update_calc_stats()+0x1c42) [0x55d54248ad32]
 4: (PeeringState::prepare_stats_for_publish(std::optional<pg_stat_t> const&, object_stat_collection_t const&)+0xf8) [0x55d54248b2f8]
 5: (PG::publish_stats_to_osd()+0xf7) [0x55d5421f3287]
 6: (PgScrubber::clear_pgscrub_state()+0x99) [0x55d54237f4c9]
 7: (PgScrubber::scrub_clear_state()+0x42) [0x55d54237f632]
 8: (PgScrubber::on_primary_change(std::basic_string_view<char, std::char_traits<char> >, requested_scrub_t const&)+0x207) [0x55d542380f77]
 9: (PeeringState::on_new_interval()+0x257) [0x55d54247af67]
 10: (PeeringState::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ceph::os::Transaction&)+0x59f) [0x55d54247c67f]
 11: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x26b) [0x55d54248eeeb]

start_peering_interval() clears acting_recovery_backfill, but with https://github.com/ceph/ceph/pull/48545/commits/1295b3752c8305cb57f69676beb22de07d575e01 we added a call to scrub_clear_state, which in turn results in calling publish_stats_to_osd and update_calc_stats, and we end up hitting this assert.

Actions

Copy link

Updated by Laura Flores over 1 year ago

Assignee deleted (~~Ronen Friedman~~)
Priority changed from Urgent to Normal

(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000055d54284b1ba in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:88
#2  handle_oneshot_fatal_signal (signum=6) at ./src/global/signal_handler.cc:363
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5  0x00007fc9bc60d859 in __GI_abort () at abort.c:79
#6  0x000055d542083f6b in ceph::__ceph_assert_fail (assertion=0x7fc997d0be20 "", file=0x7fc997d0c0c0 "h\f\260C\325U", line=<optimized out>, func=<optimized out>) at ./src/common/assert.cc:75
#7  0x000055d5420840cd in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:80
#8  0x000055d54248ad32 in PeeringState::update_calc_stats (this=0x55d54d309f78) at ./src/log/Entry.h:35
#9  0x000055d54248b2f8 in PeeringState::prepare_stats_for_publish (this=this@entry=0x55d54d309f78, pg_stats_publish=std::optional<pg_stat_t> [no contained value], unstable_stats=...) at ./src/osd/PeeringState.cc:3835
#10 0x000055d5421f3287 in PG::publish_stats_to_osd (this=0x55d54d309000) at ./src/osd/PG.cc:825
#11 0x000055d54237f4c9 in PgScrubber::clear_pgscrub_state (this=0x55d5486bf800) at ./src/osd/scrubber/pg_scrubber.cc:2318
#12 0x000055d54237f632 in PgScrubber::scrub_clear_state (this=0x55d5486bf800) at ./src/osd/scrubber/pg_scrubber.cc:2300
#13 0x000055d542380f77 in PgScrubber::on_primary_change (this=0x55d5486bf800, caller="on_new_interval", request_flags=...) at ./src/osd/scrubber/pg_scrubber.cc:527
#14 0x000055d54247af67 in PeeringState::on_new_interval (this=0x55d54d309f78) at ./src/osd/PeeringState.cc:807
#15 0x000055d54247c67f in PeeringState::start_peering_interval (this=0x55d54d309f78, lastmap=std::shared_ptr<const OSDMap> (use count 11, weak count 1) = {...}, newup=..., new_up_primary=4, newacting=..., new_acting_primary=4, t=...)
    at ./src/osd/PeeringState.cc:669
#16 0x000055d54248eeeb in PeeringState::Reset::react (this=0x55d549ba9d10, advmap=...) at /usr/include/c++/11/ext/atomicity.h:109
#17 0x000055d5424db091 in boost::statechart::custom_reaction<PeeringState::AdvMap>::react<PeeringState::Reset, boost::statechart::event_base, void const*> (eventType=<synthetic pointer>: <optimized out>, evt=..., stt=...)
    at ./obj-x86_64-linux-gnu/boost/include/boost/polymorphic_cast.hpp:98

... etc.

(gdb) f 10
#10 0x000055d5421f3287 in PG::publish_stats_to_osd (this=0x55d54d309000) at ./src/osd/PG.cc:825
825      auto stats =
(gdb) info args
this = 0x55d54d309000
(gdb) p * this
..... lots of output .....
{ .... acting_recovery_backfill = std::set with 0 elements ... }

I'm not sure at which point `acting_recovery_backfill` should have been filled, but I can see in all the relevant frames (frame 10 pictured here), that acting_recovery_backfill is empty.

Actions

Copy link

Updated by Laura Flores over 1 year ago

Assignee set to Ronen Friedman
Priority changed from Normal to Urgent

Didn't mean to change those fields.

Actions

Copy link

Updated by Nitzan Mordechai about 1 year ago

/a/yuriw-2023-01-23_17:16:25-rados-wip-yuri6-testing-2023-01-22-0750-distro-default-smithi/7134021

Actions

Copy link

Updated by Ronen Friedman about 1 year ago

I think I know where the bug is. Will update.

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Status changed from New to In Progress

Per the prev comment.

Actions

Copy link

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-01-27_01:09:38-rados-wip-yuri2-testing-2023-01-26-1532-distro-default-smithi/7140509

Actions

Copy link

Updated by Samuel Just about 1 year ago

https://github.com/ceph/ceph/pull/49959

Actions

Copy link

Updated by Laura Flores about 1 year ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 49959

Actions

Copy link

#10

Updated by Neha Ojha about 1 year ago

Pull request ID changed from 49959 to 50088

Actions

Copy link

#11

Updated by Samuel Just about 1 year ago

2023-01-13T14:56:19.349
INFO:tasks.ceph.osd.4.smithi120.stderr:/build/ceph-18.0.0-1762-gcb17f286/src/osd/PeeringState.cc:
3553: FAILED ceph_assert(!acting_recovery_backfill.empty())

Below is what I consider the relevant code expanded around the callchain from
the backtrace:

- PeeringState::start_peering_interval

  ...
  pg_shard_t old_acting_primary = get_primary();
  pg_shard_t old_up_primary = up_primary;
  bool was_old_primary = is_primary();
  bool was_old_nonprimary = is_nonprimary();

  acting.swap(oldacting);
  up.swap(oldup);
  init_primary_up_acting(
    newup,
    newacting,
    new_up_primary,
    new_acting_primary);
  ...
  if (up != acting)
    state_set(PG_STATE_REMAPPED);
  else
    state_clear(PG_STATE_REMAPPED);
  ...
  // did acting, up, primary|acker change?
  if (!lastmap) {
    ...
  } else {
    ...
    if (new_interval) {
      ...
      info.history.same_interval_since = osdmap->get_epoch();
      ...
      }
    }
  }

  ...
  on_new_interval();
  ...
  state_clear(PG_STATE_ACTIVE);
  state_clear(PG_STATE_PEERED);
  state_clear(PG_STATE_PREMERGE);
  state_clear(PG_STATE_DOWN);
  state_clear(PG_STATE_RECOVERY_WAIT);
  state_clear(PG_STATE_RECOVERY_TOOFULL);
  state_clear(PG_STATE_RECOVERING);

  peer_purged.clear();
  acting_recovery_backfill.clear();

- PeeringState::on_new_interval
- PG::on_new_interval
- PgScrubber::on_primary_change

  ...
  // is there an interval change we should respond to?
  if (is_primary() && is_scrub_active()) {
    if (m_interval_start < m_pg->get_same_interval_since()) {
      dout(10) << fmt::format(
        "{}: interval changed ({} -> {}). Aborting active scrub.",
        __func__, m_interval_start, m_pg->get_same_interval_since())
         << dendl;
      scrub_clear_state();
    }
  ...

- PgScrubber::scrub_clear_state()
- PgScrubber::clear_pgscrub_state()
- PG::publish_stats_to_osd()
- PeeringState::prepare_stats_for_publish()
- PeeringState::update_calc_stats()

  ...
  if ((is_remapped() || is_undersized() || !is_clean()) &&
      (is_peered()|| is_activating())) {
    psdout(20) << __func__ << " actingset " << actingset << " upset " 
         << upset << " acting_recovery_backfill " <<
acting_recovery_backfill << dendl;

    ceph_assert(!acting_recovery_backfill.empty());
  ...

Other relevant methods:

PgScrubber:

  [[nodiscard]] bool is_primary() const final
  {
    return m_pg->recovery_state.is_primary();
  }

  [[nodiscard]] bool is_scrub_active() const final { return m_active; }

PeeringState:

  bool is_primary() const {
    return pg_whoami == primary;
  }

  bool is_peered() const {
    return state_test(PG_STATE_ACTIVE) || state_test(PG_STATE_PEERED);
  }

void PeeringState::Active::exit()
{
  ...
  ps->state_clear(PG_STATE_ACTIVATING);
  ...
  ps->state_clear(PG_STATE_UNDERSIZED);
  ...
}

void PeeringState::Clean::exit()
{
  ...
  ps->state_clear(PG_STATE_CLEAN);
  ...
}

PG:

  epoch_t get_same_interval_since() const {
    return info.history.same_interval_since;
  }

============ Why we shouldn't use publish_stats_to_osd here ============

The relevant constraint on publish_stats_to_osd here is implicit -- the caller
must be the current primary for the PG after the first few steps of peering.
The purpose, after all, is to update the stats structure to be sent to the
monitor, and only the primary does that. Even then, it needs to have completed
peering far enough to have a valid info to send. The acting_backfill_recovery
assert is because the portion after that needs to calculate degraded/misplaced
object counts which requires comparing stats on the PG's current replicas --
this is state only the primary has. Incidentally, acting_backfill_recovery is
always populated on the primary once peering has gotten far enough -- when not
doing async recovery or backfill it just matches active (See
PeeringState::choose_acting).

The core problem, therefore, is that PgScrubber is invoking
publish_stats_to_osd() in the scrub_clear_state() path, which gets called from
PeeringState::on_new_interval via PgScrubber::on_primary_change as well as
directly from PrimaryLogPG::on_change. The pg instance may no longer be
primary, may not have been a primary to begin with, and certainly
won't be active,
so this isn't a valid place to call publish_stats_to_osd().

Beyond this specific crash, invoking publish_stats_to_osd from an OSD with
a non-authoritative info is wrong -- it will result in the monitor receiving
outdated or incorrect stats for the PG tagged with the current osdmap epoch.

========== Why we don't need to use publish_stats_to_osd here ==========

Next, we need to answer the question of whether we need to be invoking
publish_stats_to_osd here. I claim that there isn't any reason to invoke
publish_stats_to_osd in scrub_clear_state(). AFAICT, scrub has two things
to inform users of via stats:
1) changes to scheduling properties
2) stats changes resulting from scrub (progress, repaired object counts)

Neither of the above should be occurring as a result of clearing scrub
state, so we don't need this call.

=============== Explanation of this specific crash =====================

First, let's make some observations about the state necessary to reach that
crash:
1. It was a primary in the new interval. The is_primary() check in
PgScrubber:on_primary_change would have happened against the newly
reset PeeringState::primary (see above, up, acting, primary, up_primary
are reset prior to the call into on_new_interval()).
2. There was an active scrub at the time of the interval change
(is_scrub_active(), so m_active was true). This may be true of a replica.
3. !is_clean() && is_peered() -- see below, we can rule out the others
a. is_undersized() must be false -- we clear it in Active::exit() and
PeeringState::start_peering_interval is called from Reset.
b. is_clean() must be false. We clear it in Clean::exit() and
it is only set by Clean::Clean() via try_mark_clean(). There are other
calls to state_clear(PG_STATE_CLEAN), but they appear to be dead code.
c. is_remapped() may be true depending entirely on whether up=acting
in the new interval (see above). This is independent of whether the
instance is on the primary, a replica, or a stray. Because is_clean()
is false, this one is irrelevant.
d. is_activating() must be false -- we clear it in Active::exit() and
PeeringState::start_peering_interval is called from Reset.
e. is_peered() may be true if the pg in the prior instance was either active
or peered. Either is possible for either a primary or a
replica. We don't
clear them until after the call to on_new_interval().
4. It was not a primary in the previous interval. Because
acting_recovery_backfill
is empty, we know that either it wasn't a primary or it didn't get far enough
into peering to invoke choose_acting. From 2, we know that it must have
activated as a primary, a stray, or a replica. Thus, we know that
it must have
been a stray or replica.
5. m_interval_start < m_pg->get_same_interval_since() This should be true for
any interval change since info.history.same_interval_since is
updated prior to
the call to on_new_interval.

1 && 3 && 4 && 5 would be any active/peered replica or recovery target stray
which becomes the primary in the next interval -- this isn't particularly
uncommon. The intersection with 2 requiring an active replica scrub at the
time of the interval change is probably why the crash is relatively rare.

Actions

Copy link

#12