Project

General

Profile

Actions

Feature #10865

open

Handle delete log entries in merge_log out of band without blocking peering

Added by Samuel Just about 9 years ago. Updated about 9 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

If we get a log with 2k deletes, we send 2k deletes in a single transaction to the filestore. That transaction must complete before peering can progress. Worse, it's likely that most of the other pgs on the osd have the same problem and also pushed a similarly sized transaction onto the queue. This is not a good thing. Instead, we need to delete these objects lazily without flooding the queue or blocking peering.

pg_log_t:
/// deletes > lazy_deletes_competed_to may not be reflected in the store
eversion_t lazy_deletes_completed_to;

We need to maintain a record of how far our lazy deletion has gotten. Of course, once it catches up to head, we simply update it with each additional op.

Is it better to have a look-aside set of pending lazy deletes than a single lazy_deletes_completed_to line persisted along with the pg log? That would avoid after peering considering deletes which we did complete as pending lazy simply because they are after the line. Could instead be a flag on the pg_log_entry_t, but we don't want to send it on the wire (irrelevant to the recipient, and might be erroneously not wiped before the recipient persists it).

We also need to be careful about trimming log entries which represent uncompleted pending lazy deletes. That would be a virtue of the look-aside set, we can trim the log entry without removing the object from the lazy delete set. Or, we could atomically delete the object along with the log entry trim (assuming such transactions only trim a small enough bounded set -- which I think they don't).

PG:
virtual ObjectContextRef create_pin_lazy_delete(const hobject_t &hoid) = 0;
virtual void release_pinned_lazy_delete(const hobject_t &hoid) = 0;

ObjectContext:
bool marked_for_lazy_delete;

ReplicatedPG:
map<hobject_t, ObjectContextRef> pending_lazy_deletes;

During activation, we need to prepopulate the above with state for each object which needs to be lazily deleted. When we get a write on such an object (marked_for_lazy_delete = true), we prepend a delete to the resulting operation buffer (setting marked_for_lazy_delete to false if it gets submitted).

While there lazy_deletes_completed_to != last_update (even for non-active pgs!), we also lazily (using the unified op queue) work in log order N deletes at a time through the log starting at and atomically updating lazy_deletes_completed_to as we go. If the osd is active for the pt, as each lazy deletion is submitted to the filestore, we update marked_for_lazy_delete to false, unpin the obc, and take a write lock.

Note, the distinction between active and non-active above requires some care around peering and activation to avoid queueing a lazy delete before we are tracking object contexts, but after the usual activation flush has been requested.

In general, the above framework differs from the usual background machinery in one annoying way: it's basically independent of interval. Even if we go through an interval change and are no longer active, we should still finish the lazy deletes (at a lower priority?).

Another challenge is that the in-memory state above is necessary for active replicas as well since the primary neither knows nor cares about the replica's lazy deletion status. The replica needs to be sure to prepend primary writes with a delete in the event that the object has not yet been lazily deleted. This seems to overlap with the requirements for replica side locking for replica reads and should probably use the same structure (distinguish part of the object context as valid/required on the replica?).

Since, again, the primary neither knows nor cares about the replica's lazy deletion state, it makes sense to go clean independent of lazy deletion. Scrub, therefore, needs to be smart about filtering out objects which it sees which are in the lazy deletion state.


Related issues 2 (1 open1 closed)

Blocked by Ceph - Feature #10866: replicas need to track unstable objects to properly support replica readsNew02/12/2015

Actions
Blocked by Ceph - Feature #8635: add scrub, snap trimming, should be items in the OpWQ with cost/priorityResolvedSamuel Just06/20/2014

Actions
Actions #1

Updated by Samuel Just about 9 years ago

  • Description updated (diff)
Actions #2

Updated by Samuel Just about 9 years ago

  • Target version deleted (v0.94)
Actions

Also available in: Atom PDF