Bug #56030
openfrequently down and up a osd may cause recovery not in asynchronous
0%
Description
ceph version: octopus 15.2.13
in my test cluster, have 6 osds, 3 for bucket index pool,3 for other pools, there are already many objects in the cluster, continuous write 0byte object in background, executing command ceph osd down to a bucket index osd every 30s, can observe pg not in async recovery sometime.
i try to modify PeeringStat.h/PeeringStat.cc want to fix this problem:
in PeeringStat.cc PeeringState::Active::react(const MInfoRec& infoevt) when the lastest peer active msg back, call choose_acting again to make sure object missing be identified, next if want_acting != acting will jump to NeedActingChange event, pg will do peering again.
in PeeringStat.h struct Primary, add <NeedActingChange, WaitActingChange> transition, if not state machine will Crash.
question 1: is my ceph version is too backward?
question 2: if my ceph version is ok, above modification are reasonable and safe?
PeeringState.cc boost::statechart::result PeeringState::Active::react(const MInfoRec& infoevt) { DECLARE_LOCALS; ceph_assert(ps->is_primary()); ceph_assert(!ps->acting_recovery_backfill.empty()); if (infoevt.lease_ack) { ps->proc_lease_ack(infoevt.from.osd, *infoevt.lease_ack); } // don't update history (yet) if we are active and primary; the replica // may be telling us they have activated (and committed) but we can't // share that until _everyone_ does the same. if (ps->is_acting_recovery_backfill(infoevt.from) && ps->peer_activated.count(infoevt.from) == 0) { psdout(10) << " peer osd." << infoevt.from << " activated and committed" << dendl; ps->peer_activated.insert(infoevt.from); ps->blocked_by.erase(infoevt.from.shard); pl->publish_stats_to_osd(); /* if (ps->peer_activated.size() == ps->acting_recovery_backfill.size()) { all_activated_and_committed(); } */ psdout(10) << "Active: pg->peer_activated " << ps->peer_activated << " acting_recovery_backfill " << ps->acting_recovery_backfill << dendl; if (ps->peer_activated.size() == ps->acting_recovery_backfill.size()) { psdout(10) << "Active: peer osd." << infoevt.from << " choose_acting again" << dendl; pg_shard_t auth_log_shard; bool history_les_bound = false; ps->choose_acting(auth_log_shard, false, &history_les_bound); psdout(10) << "Active: peer osd." << infoevt.from << " choose_acting want_acting " << ps->want_acting << " ,acting " << ps->acting << dendl; if (!ps->want_acting.empty() && ps->want_acting != ps->acting) { psdout(10) << "Active: got MInfoRec from osd." << infoevt.from << ", requesting pg_temp change" << dendl; post_event(NeedActingChange()); } else { psdout(10) << "Active: peer osd." << infoevt.from << " all_activated_and_committed begin" << dendl; all_activated_and_committed(); psdout(10) << "Active: peer osd." << infoevt.from << " all_activated_and_committed end" << dendl; } } } return discard_event(); } PeeringState.h struct Primary : boost::statechart::state< Primary, Started, Peering >, NamedState { explicit Primary(my_context ctx); void exit(); typedef boost::mpl::list < boost::statechart::custom_reaction< ActMap >, boost::statechart::custom_reaction< MNotifyRec >, boost::statechart::transition< NeedActingChange, WaitActingChange >, boost::statechart::custom_reaction<SetForceRecovery>, boost::statechart::custom_reaction<UnsetForceRecovery>, boost::statechart::custom_reaction<SetForceBackfill>, boost::statechart::custom_reaction<UnsetForceBackfill>, boost::statechart::custom_reaction<RequestScrub> > reactions; boost::statechart::result react(const ActMap&); boost::statechart::result react(const MNotifyRec&); boost::statechart::result react(const SetForceRecovery&); boost::statechart::result react(const UnsetForceRecovery&); boost::statechart::result react(const SetForceBackfill&); boost::statechart::result react(const UnsetForceBackfill&); boost::statechart::result react(const RequestScrub&); };
Files
Updated by zhouyue zhou almost 2 years ago
i set osd_async_recovery_min_cost = 0 hope async recovery anyway
Updated by zhouyue zhou almost 2 years ago
- File osd.4.log.pg6.11-1.gz osd.4.log.pg6.11-1.gz added
- File osd.4.log.pg6.11-2.gz osd.4.log.pg6.11-2.gz added
- File osd.4.log.pg6.11-3.gz osd.4.log.pg6.11-3.gz added
add more log
Updated by Neha Ojha almost 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 46708
Updated by Ilya Dryomov almost 2 years ago
- Affected Versions v15.2.13 added
- Affected Versions deleted (
)