Project

General

Profile

Osd - erasure coding pool overwrite support

Summary

Currently, objects in an EC pool are append only. This allows us to perform EC object updates with one round trip + commit for each replica. Allowing mutations other than append requires some additional work. The main questions are:
1) Do we actually want to support non-append writes on ec pools?
2) How do we do it?

Owners

  • Sam Just (RedHat)
  • Name (Affiliation)
  • Name

Interested Parties

  • Loic Dachary (Red Hat)
  • Name (Affiliation)
  • Name

Current Status

Detailed Description

There seem to be three main approaches.

Rollback Log

The current ECBackend maintains what is essentially a rollback log via the extra information stashed in the pg log entries. This works well at the moment because the primary can get the rollback information (stashed object name, old xattrs, and old object size) without needing to ask the replicas to send back data. If we allow mutations of existing data within an object, we must include the old value of the overwritten extent within the rollback log. The rollback extent wouldn't actually be included in the pg log entry, we'd probably want to write it aside to a dedicated rollback log object. From here, I see two basic paths:
1) The primary reads the required extent from the replicas (1 round trip + seek?) and encodes the rollback entry in the repop sent back to the replicas. This allows us to support partial stripe overwrites as well since we need to read in some of the object anyway.
2) The primary indicates in the pg log entry that there should be rollback entry stored on the replica for the changed extent. Each replica when it gets the message reads the on-disk extent (seek?) and writes its own rollback entry to its own rollback log atomically with the update. Not sending the information to the primary means no partial stripe overwrites.

Both of these approaches encounter a difficulty for pipelined writes. Where previously we could simply issue pipelined writes in order to the backend, we now need to be able to read the logical current object state prior to each write. In either case, we'd need to keep unstable extents in memory and require the objectstore implementation to have well defined semantics (read old or new, don't tear) for reads on objects with pending writes.

Both approaches also require a double write (on top of whatever write amplification the objectstore is doing).

2pc

Another option would be to write another backend which performs commits via 2pc. Each replica has a writeahead log, and the primary maintains last_update_preparing and last_update_committed. Peering would have to be extended to handle the slightly more nuanced pg state. We can respond to the client that the write is persisted after the prepare portion commits and possibly batch the commit in with a pending prepare if one happens to be handy. This approach also does not require a read prior to each write, so that's a small win. A catch is that the extent won't be readable until the commit clears, but we can compensate by allowing reads of unstable objects as above and buffering unstable extents.

As above, this requires a double write.

A significant piece of uncertainty here is that I haven't gamed out the required changes to peering to deal with the last_update_prepared vs last_update_committed addition.

This approach requires a full stripe aligned write to avoid an RMW.

Don't support overwrite at all

An issue for both approaches is that EC pools basically cannot really by deep scrubbed by comparing the shards. We get around that with the existing implementation by maintaining a crc as the object grows on the objects, and verifying each shard's contents against that during deep scrub reporting to the primary only whether the shard is ok. Allowing random mutations means keeping more granular crcs (and always updating at least that much of the object at a time) or giving up on deep scrub. The double write is also a bummer.

Generally, it seems like adding mutation is complicated, so can we get by without it? It seems to me like the main user for such an interface is rbd. However, both approaches above increase update latency vs a replicated backend, so it's not clear that the latency would be tolerable. If rbd random writes are turned into ec backend appends though, we might have a fighting chance.

If rbd implements a 4k write by appending a mutation description to the end of the block, we can lazily coalesce the block at a later time (hopefully amortizing the block rewrite over a whole bunch of random writes). Doing this efficiently probably requires the primary to handle coalescing the block via an object class, which would require adapting the object class machinery to work with async reads.

The obvious problem with this path is that it complicates all rados users, but we may be able to push the complexity into a library + object class which both cephfs and rbd can use.

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3