Project

General

Profile

Feature #18053

Minimize log-based recovery in the acting set

Added by Samuel Just over 7 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

In general, we maintain pretty long logs (3k by default) to ensure that we detect duplicate ops correctly. This has an unfortunate side effect, if an osd is down for long enough for 3k ops to be processed by the other replicas for a pg, when it comes back, it will have to recover each of those (potentially) 3k objects before it can be written. This can create unacceptable write latencies by promoting a 4k write to a 4MB recovery. Generally, we don't want to do this. If it were 3k + 1 ops, we'd have done backfill instead. Anecdotally, users have reported that backfill has a far smaller impact on latency simply because we can write to objects that haven't yet been backfilled since backfill targets are not in the acting set. This suggests a rather natural solution: remove replicas with a lot of degraded objects from the acting set and do log based recovery without blocking writes on those objects.

So, are there still times when we want to do synchronous recovery (I suggest that synchronous/asynchronous be the new vocabulary for in-acting-set vs out-of-acting-set)? Yeah, there are. I claim that we never want to exclude from the acting set any up osd which was active in the most recent interval which could have accepted writes (i.e., the interval containing the max found history.last_epoch_started -- see doc/dev/osd_internals/last_epoch_started.rst). That is: we want to recover objects degraded due to divergent writes as quickly as possible (explanation of divergent writes below).

We also clearly don't want to exclude an otherwise useful peer if it would put us below min_size.

Beyond that, the line is fuzzy. There is overhead associated with requesting a pg_temp change (both before going active and once recovery is complete) and it could potentially impact latency more than recovering the degraded objects if the number is small (this is ameliorated somewhat by the pg_temp priming behavior in the mons for small clusters since that logic will tend to prime the pg_temp for us in those cases -- we just currently flip it back if the excluded osd doesn't actually require backfill). At first, I think we want a simple per-pool threshold. Testing may suggest a smarter heuristic -- but that can be a later task.

There are a bunch of specific issues that will need to be addressed (in no particular order):
0) Add a document to doc/dev/osd_internals explaining the surrounding concepts and the design. At a minimum, make sure we have explanations of the difference between log-based recovery and backfill, the new distinction between synchronous and asynchronous recovery, and the choose_acting behavior (what and why).
1) Add pg_pool_t param setting the threshold and determine sane default.
2) Update choose_acting to actually request the pg_temp change based on the rules/heuristic
3) Update the recovery machinery not block writes on objects which are only missing on non-actingset osds (see PG::is_degraded_or_backfilling_object). As part of this, figure out what is_degraded_or_backfilling_object actually implements and rename it and associated lists and methods to fit (is_writable_object, wait_for_writeable_object, waiting_for_writeable?).
4) Update the backends to appropriately avoid sending populated transactions to shards missing the object (just as we currently do with IOs to objects past the last_backfill line)
5) Handle updating the missing set when we get a write on an object we are missing (it would have to be log entries only due to 4).
6) Fix the resulting bugs I haven't thought of

Divergent objects:
When a new map is received by an OSD which results in a new interval for a PG (up or acting changed, other stuff, see pg_interval_t::is_new_interval), we discard any new messages (client ops to a primary which didn't change are a special case) marked with an epoch in a prior interval. In practice, this can mean that when there is an interval change, a replica might discard subops which the primary applied. This can mean that the primary and the replicas can disagree about the end of the log even if there is no OSD failure. This seems bad (and it kind of is, but that's a different task), but it parallels the case where an OSD actually restarted. In general, we tolerate the divergent log entries in order to avoid a tpc for replicated pools (ec pools will always roll back to the same version, no problem there). Thus, we want to resolve partially applied transactions from the most recent rw interval synchronously since we are probably degraded below min_size.

History

#1 Updated by Samuel Just over 7 years ago

  • Description updated (diff)

#3 Updated by Vikhyat Umrao almost 6 years ago

  • Status changed from New to Resolved
  • Assignee changed from Josh Durgin to Neha Ojha
  • Target version set to v13.0.0

Neha Ojha wrote:

Async recovery: https://github.com/ceph/ceph/pull/19811
Doc: https://github.com/ceph/ceph/pull/21051

Thanks Neha for the update.

Also available in: Atom PDF