Feature #8195
openshorten window of highest risk during recovery
0%
Description
Say a 3-sized PG experienced failure of two OSDs, the second one failing when the first replacement was part-way through recovery, like:
1st: 0123456789
2nd: 012_______
3rd: __
AFAICT, backfilling of 3rd will start at 0 until it catches up with backfilling of 2nd, and then both will proceed concurrently.
This means objects 3 to 9 remain longer without a second replica, while objects 0 to 2, that already have two replicas in the cluster, are further replicated.
I believe it would be wiser to start backfilling 3rd (along with 2nd) at 3 all the way to 9, and then, once 2nd is done, backfilling of 3rd wraps around and finishes 0 to 2.
OSDs might fail and come back during multi-OSD backfilling. Maintaining per-OSD info on recovery windows to start/skip backfilling might make sense, but a much strategy would amount to resetting the backfill start/end point to the current backfill point every time an OSD needing backfilling joins the PG. This might go over already-backfilled portions of some OSDs more than once until all OSDs remain up throughout a complete cycle. Say, consider that 2nd fails after joint backfill of objects 3 and 4, and rejoins when 3rd has already got objects 5 and 6:
1st: 0123456789
2nd: 01234_____
3rd: 3456
In this simplified backfilling proposal, we'd set the begin/end point between objects 6 and 7, so that joint backfilling starts at 7, and both 2nd and 3rd will be regarded as fully-backfilled if both remain up after iterating over 7 to 9 and then 0 to 6. Objects that are already up-to-date (say 0 to 4 in 2nd and 3 to 6 in 3rd) will be quickly skipped, in the same way they are when backfilling an OSD rolled back to an old snapshot.