Project

General

Profile

Actions

Bug #56030

open

frequently down and up a osd may cause recovery not in asynchronous

Added by zhouyue zhou almost 2 years ago. Updated almost 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
Peering
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version: octopus 15.2.13

in my test cluster, have 6 osds, 3 for bucket index pool,3 for other pools, there are already many objects in the cluster, continuous write 0byte object in background, executing command ceph osd down to a bucket index osd every 30s, can observe pg not in async recovery sometime.

i try to modify PeeringStat.h/PeeringStat.cc want to fix this problem:
in PeeringStat.cc PeeringState::Active::react(const MInfoRec& infoevt) when the lastest peer active msg back, call choose_acting again to make sure object missing be identified, next if want_acting != acting will jump to NeedActingChange event, pg will do peering again.
in PeeringStat.h struct Primary, add <NeedActingChange, WaitActingChange> transition, if not state machine will Crash.

question 1: is my ceph version is too backward?
question 2: if my ceph version is ok, above modification are reasonable and safe?

PeeringState.cc
boost::statechart::result PeeringState::Active::react(const MInfoRec& infoevt)
{
  DECLARE_LOCALS;
  ceph_assert(ps->is_primary());

  ceph_assert(!ps->acting_recovery_backfill.empty());
  if (infoevt.lease_ack) {
    ps->proc_lease_ack(infoevt.from.osd, *infoevt.lease_ack);
  }
  // don't update history (yet) if we are active and primary; the replica
  // may be telling us they have activated (and committed) but we can't
  // share that until _everyone_ does the same.
  if (ps->is_acting_recovery_backfill(infoevt.from) &&
      ps->peer_activated.count(infoevt.from) == 0) {
    psdout(10) << " peer osd." << infoevt.from
           << " activated and committed" << dendl;
    ps->peer_activated.insert(infoevt.from);
    ps->blocked_by.erase(infoevt.from.shard);
    pl->publish_stats_to_osd();
    /*
    if (ps->peer_activated.size() == ps->acting_recovery_backfill.size()) {
      all_activated_and_committed();
    }
    */
    psdout(10) << "Active: pg->peer_activated " << ps->peer_activated << " acting_recovery_backfill " << ps->acting_recovery_backfill << dendl;
    if (ps->peer_activated.size() == ps->acting_recovery_backfill.size()) {
      psdout(10) << "Active: peer osd." << infoevt.from << " choose_acting again" << dendl;
      pg_shard_t auth_log_shard;
      bool history_les_bound = false;
      ps->choose_acting(auth_log_shard, false, &history_les_bound);
      psdout(10) << "Active: peer osd." << infoevt.from << " choose_acting want_acting " << ps->want_acting << " ,acting " << ps->acting << dendl;
      if (!ps->want_acting.empty() && ps->want_acting != ps->acting) {
        psdout(10) << "Active: got MInfoRec from osd." << infoevt.from << ", requesting pg_temp change" << dendl;
        post_event(NeedActingChange());
      } else {
        psdout(10) << "Active: peer osd." << infoevt.from << " all_activated_and_committed begin" << dendl;
        all_activated_and_committed();
        psdout(10) << "Active: peer osd." << infoevt.from << " all_activated_and_committed end" << dendl;
      }
    }
  }
  return discard_event();
}

PeeringState.h
struct Primary : boost::statechart::state< Primary, Started, Peering >, NamedState {
  explicit Primary(my_context ctx);
  void exit();

  typedef boost::mpl::list <
    boost::statechart::custom_reaction< ActMap >,
    boost::statechart::custom_reaction< MNotifyRec >,
    boost::statechart::transition< NeedActingChange, WaitActingChange >,
    boost::statechart::custom_reaction<SetForceRecovery>,
    boost::statechart::custom_reaction<UnsetForceRecovery>,
    boost::statechart::custom_reaction<SetForceBackfill>,
    boost::statechart::custom_reaction<UnsetForceBackfill>,
    boost::statechart::custom_reaction<RequestScrub>
    > reactions;
  boost::statechart::result react(const ActMap&);
  boost::statechart::result react(const MNotifyRec&);
  boost::statechart::result react(const SetForceRecovery&);
  boost::statechart::result react(const UnsetForceRecovery&);
  boost::statechart::result react(const SetForceBackfill&);
  boost::statechart::result react(const UnsetForceBackfill&);
  boost::statechart::result react(const RequestScrub&);
};

Files

osd.4.log.pg6.11.recovering.gz (128 KB) osd.4.log.pg6.11.recovering.gz zhouyue zhou, 06/14/2022 03:43 AM
osd.4.log.pg6.11-2.gz (942 KB) osd.4.log.pg6.11-2.gz part2 zhouyue zhou, 06/20/2022 06:47 AM
osd.4.log.pg6.11-1.gz (952 KB) osd.4.log.pg6.11-1.gz part1 zhouyue zhou, 06/20/2022 06:47 AM
osd.4.log.pg6.11-3.gz (643 KB) osd.4.log.pg6.11-3.gz part3 zhouyue zhou, 06/20/2022 06:47 AM
Actions #1

Updated by zhouyue zhou almost 2 years ago

i set osd_async_recovery_min_cost = 0 hope async recovery anyway

Actions #3

Updated by Neha Ojha almost 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46708
Actions #4

Updated by Ilya Dryomov almost 2 years ago

  • Affected Versions v15.2.13 added
  • Affected Versions deleted ()
Actions #5

Updated by Ilya Dryomov almost 2 years ago

  • Target version deleted (v15.2.16)
Actions

Also available in: Atom PDF