Osd - Transactions


Multi object transactions would be nice.


  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Current Status

Detailed Description

transaction is essentially:

struct MultiObjectTransaction {
map<hobject_t, ObjectWriteOperation> object_ops;
hobject_t master;

each osd/pg has a way to persist in-progress transactions that does not touch the actual object in question. only when we know that the txn is persisted and can always roll forward in the event of peering or failure do we commit and modify the real objects.

deadlock detection or avoidance? rgw doesn’t need either, but other users will.

txns: C -> M -> S (disk) -> M (disk) [-> C … -> S (disk) -> M (disk) ]
now: C -> S (disk) -> C -> M (disk) -> C [ -> S (disk) -> C]

model 2
- client sends full txn to master
- master holds txn in memory, sends PREPAREs to slaves
- slaves persist PREPARE on the side, send PREPARE_ACK
- master collects all PREPARE_ACKs and applies the txn and marks txn COMMITTING
- once persisted, master send COMMITs
- master replies to client
- slaves get COMMIT and apply, reply with COMMIT_ACK
- master collection COMMIT_ACK and closes out txn record
- closes out txn record

- on pg active:
- send NOTIFY to txn masters for fate of prepared txns
- master replies with COMMIT or ROLLBACK, perhaps with delay

- resend PREPARE if the slave pg changes

clients should make the osd with the largest write the master, so that we avoid the prepare cost of writing twice (once for preprare, once to the object)

it might make sense to have the primary delay the ROLLBACK message with the expectation that the client will retry the transaction soon.

the transactions are referenced in the pg metadata on both master and slave so they are pulled into memory on osd restart, and the ObjectContext lock state is always in place

big pieces

pg metadata gets an index of in-flight txns
we add somewhere to persist them
peering needs to exchange list of in-flight txns and their state
some simple logic to roll forward/back

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3