prioritize backfill of metadata pools, automatically
Neha Ojna suggested filing this feature request.
One relatively easy way to minimize damage in a double-failure scenario (loss of 2 devices or nodes at different times) is to prioritize repair of metadata pools. Examples of this are Cephfs and RGW. It is crucial that in most cases the user should not have to specify this prioritization, and Ceph should be able to automatically come up with a reasonable prioritization that works in almost all of the use cases, since a sysadmin may be unavailable when the problem occurs.
Motivation: Ceph is now being considered for small-footprint configurations where there are only 3 nodes where backfilling is impossible with traditional replica-3 pools, and for economic reasons there is pressure to consider replica-2 pools (e.g. with NVM SSD). In such pools, it is critical that backfilling minimize the possible damage if a second failure occurs. But even with 3-way replication, if a PG loses 2 of its OSDs, then it becomes unwritable and hence unavailable (not lost), so we still want to minimize probability of metadata unavailability.
Cephfs has 1 metadata pool per filesystem, and this pool is orders of magnitude smaller than the data pool(s). So in a backfill situation, it's really important that the metadata pool be repaired before the data pool. If the reverse was to happen and the metadata pool was lost, the data pool would effectively also be lost (i.e. the directory structure and file attributes would be gone).
RGW has many pools but most of them are tiny, and typically there is one large data pool (at least in my limited experience). Certainly the bucket index pool is orders of magnitude smaller than the data pool, but is vital for navigating to the data.
One other possible optimization that would have a similar effect - prioritize pools in reverse order of size. This does not require any classification of pools as metadata.
If size is difficult to determine, PG count might be a proxy for size - typically the largest pools have higher PG counts.
#2 Updated by Ben England 5 months ago
is backfill any different than recovery priority? If not, should it be? By "backfill" I mean the emergency situation where you lose replicas of an object , whereas by "recovery" I mean that you restore an OSD to operational state and bring data back onto it, but the data is already at proper level of replication.
#3 Updated by Ben England 5 months ago
Also, this ceph command requires the operator to do it, the point of the tracker is that this should be default behavior, does anyone disagree with that? If people agree, where does this get implemented? For example, rook.io seems like the wrong place, because anything that isn't a kubernetes cluster won't benefit and this default has nothing to do with Kubernetes.
#5 Updated by David Zafman 5 months ago
Recovery is also about restoring objects to the right level of replication. Because the log is known to represent a complete picture of the contents, it is used to identify the objects that need recovery. Backfill is considered another form of recovery. In that case the log isn't enough and we must iterate all objects on all replicas to find the objects to be restored.
In the code PG::get_recovery_priority() and PG::get_backfill_priority() compute the value based on multiple factors. A basic recovery is prioritized over backfill presumably because it can quickly get PGs active+clean the quickest. In the case were objects are below min_size, client I/O is blocked and data is more at risk than simply degraded, the priority even is higher.
It isn't totally clear how all these factors should interact with pools that store metadata. I understand that metadata pools should have priority, but how much? Should they override all other considerations? Should they boost priority the same way the pool recovery priority does? Since the code adds priority for how many missing replicas there are, what priority should be used for a data pool which is down more replicas than a metadata pool?
#13 Updated by Nathan Cutler about 2 months ago
- Backport deleted (
since this is only going to be backported to nautilus and since there are two PRs involved, and since one of those PRs already has a backport PR open, I suggest we handle the backporting right here in the master issue. I.e. let's not set the status to Pending Backport because that will cause a backport issue to be opened, which won't add any value in this case but instead just muddy the water.