Project

General

Profile

Feature #39339

prioritize backfill of metadata pools, automatically

Added by Ben England 5 months ago. Updated about 2 months ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Neha Ojna suggested filing this feature request.

One relatively easy way to minimize damage in a double-failure scenario (loss of 2 devices or nodes at different times) is to prioritize repair of metadata pools. Examples of this are Cephfs and RGW. It is crucial that in most cases the user should not have to specify this prioritization, and Ceph should be able to automatically come up with a reasonable prioritization that works in almost all of the use cases, since a sysadmin may be unavailable when the problem occurs.

Motivation: Ceph is now being considered for small-footprint configurations where there are only 3 nodes where backfilling is impossible with traditional replica-3 pools, and for economic reasons there is pressure to consider replica-2 pools (e.g. with NVM SSD). In such pools, it is critical that backfilling minimize the possible damage if a second failure occurs. But even with 3-way replication, if a PG loses 2 of its OSDs, then it becomes unwritable and hence unavailable (not lost), so we still want to minimize probability of metadata unavailability.

Cephfs has 1 metadata pool per filesystem, and this pool is orders of magnitude smaller than the data pool(s). So in a backfill situation, it's really important that the metadata pool be repaired before the data pool. If the reverse was to happen and the metadata pool was lost, the data pool would effectively also be lost (i.e. the directory structure and file attributes would be gone).

RGW has many pools but most of them are tiny, and typically there is one large data pool (at least in my limited experience). Certainly the bucket index pool is orders of magnitude smaller than the data pool, but is vital for navigating to the data.

One other possible optimization that would have a similar effect - prioritize pools in reverse order of size. This does not require any classification of pools as metadata.
If size is difficult to determine, PG count might be a proxy for size - typically the largest pools have higher PG counts.


Related issues

Related to RADOS - Documentation #39011: Document how get_recovery_priority() and get_backfill_priority() impacts recovery order Resolved 03/28/2019
Related to RADOS - Bug #39099: Give recovery for inactive PGs a higher priority Resolved 04/03/2019
Related to RADOS - Documentation #23999: osd_recovery_priority is not documented (but osd_recovery_op_priority is) Resolved 05/03/2018

History

#1 Updated by Sage Weil 5 months ago

ceph osd pool set <pool> recovery_priority <value>

I think a value of 1 or 2 makes sense (default if unset is 0).

#2 Updated by Ben England 5 months ago

is backfill any different than recovery priority? If not, should it be? By "backfill" I mean the emergency situation where you lose replicas of an object , whereas by "recovery" I mean that you restore an OSD to operational state and bring data back onto it, but the data is already at proper level of replication.

#3 Updated by Ben England 5 months ago

Also, this ceph command requires the operator to do it, the point of the tracker is that this should be default behavior, does anyone disagree with that? If people agree, where does this get implemented? For example, rook.io seems like the wrong place, because anything that isn't a kubernetes cluster won't benefit and this default has nothing to do with Kubernetes.

#4 Updated by David Zafman 5 months ago

  • Related to Documentation #39011: Document how get_recovery_priority() and get_backfill_priority() impacts recovery order added

#5 Updated by David Zafman 5 months ago

Recovery is also about restoring objects to the right level of replication. Because the log is known to represent a complete picture of the contents, it is used to identify the objects that need recovery. Backfill is considered another form of recovery. In that case the log isn't enough and we must iterate all objects on all replicas to find the objects to be restored.

In the code PG::get_recovery_priority() and PG::get_backfill_priority() compute the value based on multiple factors. A basic recovery is prioritized over backfill presumably because it can quickly get PGs active+clean the quickest. In the case were objects are below min_size, client I/O is blocked and data is more at risk than simply degraded, the priority even is higher.

It isn't totally clear how all these factors should interact with pools that store metadata. I understand that metadata pools should have priority, but how much? Should they override all other considerations? Should they boost priority the same way the pool recovery priority does? Since the code adds priority for how many missing replicas there are, what priority should be used for a data pool which is down more replicas than a metadata pool?

#6 Updated by David Zafman 5 months ago

  • Related to Bug #39099: Give recovery for inactive PGs a higher priority added

#7 Updated by David Zafman 5 months ago

  • Related to Documentation #23999: osd_recovery_priority is not documented (but osd_recovery_op_priority is) added

#8 Updated by David Zafman 5 months ago

I forgot that it is possible that backfill/recovery could be moving data around for several reasons. In those cases the lowest priority is appropriate without needing a boost for metadata pools.

#10 Updated by Neha Ojha about 2 months ago

  • Assignee set to Sage Weil
  • Backport set to nautilus

#11 Updated by Neha Ojha about 2 months ago

  • Status changed from Need Review to Pending Backport

One backport for nautilus: https://github.com/ceph/ceph/pull/29275

#12 Updated by Neha Ojha about 2 months ago

  • Status changed from Pending Backport to In Progress

#13 Updated by Nathan Cutler about 2 months ago

  • Backport deleted (nautilus)

since this is only going to be backported to nautilus and since there are two PRs involved, and since one of those PRs already has a backport PR open, I suggest we handle the backporting right here in the master issue. I.e. let's not set the status to Pending Backport because that will cause a backport issue to be opened, which won't add any value in this case but instead just muddy the water.

Also available in: Atom PDF