Project

General

Profile

Actions

Feature #39339

open

prioritize backfill of metadata pools, automatically

Added by Ben England about 5 years ago. Updated about 3 years ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Neha Ojna suggested filing this feature request.

One relatively easy way to minimize damage in a double-failure scenario (loss of 2 devices or nodes at different times) is to prioritize repair of metadata pools. Examples of this are Cephfs and RGW. It is crucial that in most cases the user should not have to specify this prioritization, and Ceph should be able to automatically come up with a reasonable prioritization that works in almost all of the use cases, since a sysadmin may be unavailable when the problem occurs.

Motivation: Ceph is now being considered for small-footprint configurations where there are only 3 nodes where backfilling is impossible with traditional replica-3 pools, and for economic reasons there is pressure to consider replica-2 pools (e.g. with NVM SSD). In such pools, it is critical that backfilling minimize the possible damage if a second failure occurs. But even with 3-way replication, if a PG loses 2 of its OSDs, then it becomes unwritable and hence unavailable (not lost), so we still want to minimize probability of metadata unavailability.

Cephfs has 1 metadata pool per filesystem, and this pool is orders of magnitude smaller than the data pool(s). So in a backfill situation, it's really important that the metadata pool be repaired before the data pool. If the reverse was to happen and the metadata pool was lost, the data pool would effectively also be lost (i.e. the directory structure and file attributes would be gone).

RGW has many pools but most of them are tiny, and typically there is one large data pool (at least in my limited experience). Certainly the bucket index pool is orders of magnitude smaller than the data pool, but is vital for navigating to the data.

One other possible optimization that would have a similar effect - prioritize pools in reverse order of size. This does not require any classification of pools as metadata.
If size is difficult to determine, PG count might be a proxy for size - typically the largest pools have higher PG counts.


Related issues 3 (0 open3 closed)

Related to RADOS - Documentation #39011: Document how get_recovery_priority() and get_backfill_priority() impacts recovery orderResolvedDavid Zafman03/28/2019

Actions
Related to RADOS - Bug #39099: Give recovery for inactive PGs a higher priorityResolvedDavid Zafman04/03/2019

Actions
Related to RADOS - Documentation #23999: osd_recovery_priority is not documented (but osd_recovery_op_priority is)ResolvedDavid Zafman05/03/2018

Actions
Actions

Also available in: Atom PDF