Project

General

Profile

Feature #8195

shorten window of highest risk during recovery

Added by Alexandre Oliva almost 10 years ago. Updated almost 10 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Say a 3-sized PG experienced failure of two OSDs, the second one failing when the first replacement was part-way through recovery, like:

1st: 0123456789
2nd: 012_______
3rd: __

AFAICT, backfilling of 3rd will start at 0 until it catches up with backfilling of 2nd, and then both will proceed concurrently.

This means objects 3 to 9 remain longer without a second replica, while objects 0 to 2, that already have two replicas in the cluster, are further replicated.

I believe it would be wiser to start backfilling 3rd (along with 2nd) at 3 all the way to 9, and then, once 2nd is done, backfilling of 3rd wraps around and finishes 0 to 2.

OSDs might fail and come back during multi-OSD backfilling. Maintaining per-OSD info on recovery windows to start/skip backfilling might make sense, but a much strategy would amount to resetting the backfill start/end point to the current backfill point every time an OSD needing backfilling joins the PG. This might go over already-backfilled portions of some OSDs more than once until all OSDs remain up throughout a complete cycle. Say, consider that 2nd fails after joint backfill of objects 3 and 4, and rejoins when 3rd has already got objects 5 and 6:

1st: 0123456789
2nd: 01234_____
3rd: 3456

In this simplified backfilling proposal, we'd set the begin/end point between objects 6 and 7, so that joint backfilling starts at 7, and both 2nd and 3rd will be regarded as fully-backfilled if both remain up after iterating over 7 to 9 and then 0 to 6. Objects that are already up-to-date (say 0 to 4 in 2nd and 3 to 6 in 3rd) will be quickly skipped, in the same way they are when backfilling an OSD rolled back to an old snapshot.

History

#1 Updated by David Zafman almost 10 years ago

In the current scheme since the primary runs through the objects in a hashed order it allows new writes before or after the window currently being worked on to proceed. This is critical for client responsiveness during recovery. Also, if there are say 1 billion objects, the recovery mechanism examines a small region of the object hash as it performs recovery. The on disk objects in the object store are the persistent state that recovery uses.

Your proposal is actually more complicated. If I understand you proposal, recovery would have to create a list of completed regions within the hash for all backfill targets. If it pre-scanned the entire hash space it could do this without requiring state that would have to be persisted to disk on the acting set, to protect against primary crashes losing this information. A pre-scan of the object space would require potentially a large in-memory footprint with a large object store and delay recovery which increases the data risk. An incoming client write that precedes the current last_backfill would have to check the object hash against the recovery regions to determine if that object has already been backfilled or if recovery will circle back to handle it.

#2 Updated by Alexandre Oliva almost 10 years ago

My proposal is much simpler than that, actually. In the simplest implementation possible, we'd just change the starting point from hash 00000000 to whatever we last pushed to any replicas. You can then tell whether all replicas already have an object testing whether the hash is in the range current < starting_point ? [starting_point,0xffffffff]|[0x00000000,current] : [starting_point,current].

Sorry that the formatting of the per-osd states didn't come out nicely. I didn't realize underline was formatting markup. Also, sorry that I didn't mention the objects were in hash order; I remember thinking of pointing that out, but I ended up forgetting it.

Also available in: Atom PDF