Project

General

Profile

Actions

Feature #39339

open

prioritize backfill of metadata pools, automatically

Added by Ben England about 5 years ago. Updated about 3 years ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Neha Ojna suggested filing this feature request.

One relatively easy way to minimize damage in a double-failure scenario (loss of 2 devices or nodes at different times) is to prioritize repair of metadata pools. Examples of this are Cephfs and RGW. It is crucial that in most cases the user should not have to specify this prioritization, and Ceph should be able to automatically come up with a reasonable prioritization that works in almost all of the use cases, since a sysadmin may be unavailable when the problem occurs.

Motivation: Ceph is now being considered for small-footprint configurations where there are only 3 nodes where backfilling is impossible with traditional replica-3 pools, and for economic reasons there is pressure to consider replica-2 pools (e.g. with NVM SSD). In such pools, it is critical that backfilling minimize the possible damage if a second failure occurs. But even with 3-way replication, if a PG loses 2 of its OSDs, then it becomes unwritable and hence unavailable (not lost), so we still want to minimize probability of metadata unavailability.

Cephfs has 1 metadata pool per filesystem, and this pool is orders of magnitude smaller than the data pool(s). So in a backfill situation, it's really important that the metadata pool be repaired before the data pool. If the reverse was to happen and the metadata pool was lost, the data pool would effectively also be lost (i.e. the directory structure and file attributes would be gone).

RGW has many pools but most of them are tiny, and typically there is one large data pool (at least in my limited experience). Certainly the bucket index pool is orders of magnitude smaller than the data pool, but is vital for navigating to the data.

One other possible optimization that would have a similar effect - prioritize pools in reverse order of size. This does not require any classification of pools as metadata.
If size is difficult to determine, PG count might be a proxy for size - typically the largest pools have higher PG counts.


Related issues 3 (0 open3 closed)

Related to RADOS - Documentation #39011: Document how get_recovery_priority() and get_backfill_priority() impacts recovery orderResolvedDavid Zafman03/28/2019

Actions
Related to RADOS - Bug #39099: Give recovery for inactive PGs a higher priorityResolvedDavid Zafman04/03/2019

Actions
Related to RADOS - Documentation #23999: osd_recovery_priority is not documented (but osd_recovery_op_priority is)ResolvedDavid Zafman05/03/2018

Actions
Actions #1

Updated by Sage Weil about 5 years ago

ceph osd pool set <pool> recovery_priority <value>

I think a value of 1 or 2 makes sense (default if unset is 0).

Actions #2

Updated by Ben England about 5 years ago

is backfill any different than recovery priority? If not, should it be? By "backfill" I mean the emergency situation where you lose replicas of an object , whereas by "recovery" I mean that you restore an OSD to operational state and bring data back onto it, but the data is already at proper level of replication.

Actions #3

Updated by Ben England about 5 years ago

Also, this ceph command requires the operator to do it, the point of the tracker is that this should be default behavior, does anyone disagree with that? If people agree, where does this get implemented? For example, rook.io seems like the wrong place, because anything that isn't a kubernetes cluster won't benefit and this default has nothing to do with Kubernetes.

Actions #4

Updated by David Zafman about 5 years ago

  • Related to Documentation #39011: Document how get_recovery_priority() and get_backfill_priority() impacts recovery order added
Actions #5

Updated by David Zafman about 5 years ago

Recovery is also about restoring objects to the right level of replication. Because the log is known to represent a complete picture of the contents, it is used to identify the objects that need recovery. Backfill is considered another form of recovery. In that case the log isn't enough and we must iterate all objects on all replicas to find the objects to be restored.

In the code PG::get_recovery_priority() and PG::get_backfill_priority() compute the value based on multiple factors. A basic recovery is prioritized over backfill presumably because it can quickly get PGs active+clean the quickest. In the case were objects are below min_size, client I/O is blocked and data is more at risk than simply degraded, the priority even is higher.

It isn't totally clear how all these factors should interact with pools that store metadata. I understand that metadata pools should have priority, but how much? Should they override all other considerations? Should they boost priority the same way the pool recovery priority does? Since the code adds priority for how many missing replicas there are, what priority should be used for a data pool which is down more replicas than a metadata pool?

Actions #6

Updated by David Zafman about 5 years ago

  • Related to Bug #39099: Give recovery for inactive PGs a higher priority added
Actions #7

Updated by David Zafman about 5 years ago

  • Related to Documentation #23999: osd_recovery_priority is not documented (but osd_recovery_op_priority is) added
Actions #8

Updated by David Zafman about 5 years ago

I forgot that it is possible that backfill/recovery could be moving data around for several reasons. In those cases the lowest priority is appropriate without needing a boost for metadata pools.

Actions #10

Updated by Neha Ojha over 4 years ago

  • Assignee set to Sage Weil
  • Backport set to nautilus
Actions #11

Updated by Neha Ojha over 4 years ago

  • Status changed from Fix Under Review to Pending Backport

One backport for nautilus: https://github.com/ceph/ceph/pull/29275

Actions #12

Updated by Neha Ojha over 4 years ago

  • Status changed from Pending Backport to In Progress
Actions #13

Updated by Nathan Cutler over 4 years ago

  • Backport deleted (nautilus)

since this is only going to be backported to nautilus and since there are two PRs involved, and since one of those PRs already has a backport PR open, I suggest we handle the backporting right here in the master issue. I.e. let's not set the status to Pending Backport because that will cause a backport issue to be opened, which won't add any value in this case but instead just muddy the water.

Actions #14

Updated by David Zafman about 3 years ago

I think this tracker can be marked resolved since pull request 29181 merged.

Actions

Also available in: Atom PDF