Project

General

Profile

Rados - multi-object transaction support » History » Version 25

Patrick McGarry, 06/17/2015 05:49 PM

1 1 Li Wang
h1. Multi-object transaction support
2 2 Li Wang
3
*Summary*
4 20 Li Wang
This is for multi-object transaction support
5 25 Patrick McGarry
(previous blueprint: https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%3A_Transactions )
6 2 Li Wang
7
*Owners*
8
9
Li Wang (Ubuntukylin)
10 3 Li Wang
Yunchuan Wen (Ubuntukylin)
11 2 Li Wang
Name
12
13
*Interested Parties*
14
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
15
Name (Affiliation)
16
Name (Affiliation)
17
Name
18
19
*Current Status*
20
Please describe the current status of Ceph as it relates to this blueprint.  Is there something that this replaces?  Are there current features that are related?
21
22
*Detailed Description*
23 8 Li Wang
24
Algorithm
25
26 1 Li Wang
(1) Allow client to define the following struct, 
27
struct MultiObjectTransaction {
28 8 Li Wang
> map<hobject_t, ObjectWriteOperation> object_ops;
29
> hobject_t master;
30 1 Li Wang
};
31 7 Li Wang
based on this, client could send a group of ops through MOSDOp to the PG coressponding to master object, call it MASTER
32 5 Li Wang
33 7 Li Wang
(2) Master receive MOSDOp from client, extract ObjectWriteOperations by PGs, and send them to corresponding PGs, call them SLAVE
34 1 Li Wang
35 21 Li Wang
(3) SLAVE receive MOSDOp from MASTER, if there are pending conventional single-object transactions with operations on the same object, wait until them finished; If there are pending multi-object transactions with the operations on the same object, return EAGAIN (we can not wait here, otherwise may dead lock, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/23783); Otherwise, contruct the transaction in the conventional way, return MASTER PREPARE_ACK, if error occured, for example, a 'read-then-comparation' failed, return MASTER ERROR
36 1 Li Wang
37 22 Li Wang
(4) If MASTER receive ERROR from SLAVE, it send ROLLBACK to all SLAVES, cancel the transaction, return client ERROR
38 6 Li Wang
39
(5) MASTER collect all PREPARE_ACK, send SLAVES PREPARE_COMMIT
40
41
(6) SLAVE receive PREPARE_COMMIT from MASTER, write pending transaction into PG metadata, return MASTER PREPARE_COMMIT_ACK
42
43 7 Li Wang
(7) MASTER collect all PREPARE_COMMIT_ACK, construct a transaction, including a write to PG metadata to indicat the transaction COMMITTING, and the writes operations on its own data objects, submit the transaction to ObjectStore, then return client ACK, enable client to return from write operation, and send SLAVES COMMIT 
44 6 Li Wang
45 23 Li Wang
(8) SLAVE receive COMMIT, submit the pending transaction to ObjectStore, after the transaction finished (written to the ultimate location), delete the transaction from PG metadata, return MASTER COMMIT_ACK
46 6 Li Wang
47 1 Li Wang
(9) MASTER collect COMMIT_ACK, return client COMMIT, to indicate the transaction has been really finished, enable client to do read, delete the transaction from the PG metadata
48 6 Li Wang
49 8 Li Wang
Error Process
50 1 Li Wang
51 8 Li Wang
(1) SLAVE down
52 14 Li Wang
MASTER could be aware of SLAVE down by OSDMAP, and start a super-peering (inter-PG peering, in constrast to intra-PG peering) to ask SLAVE's state after the SLAVE PG has returned to active-and-clean by conventional peering, and process according to the following different situations,
53 11 Li Wang
> (a) No transaction found in SLAVE's PG metadata
54
> > (1) If MASTER not in COMMITTING, MASTER resend SLAVE MOSDOp, let SLAVE restart from Step (3) as described above;
55
> > (2) If MASTER in COMMITTING, that imply SLAVE has finished Step (6), and finished the submittion of transaction from PG metadata to its ultimate location, then MASTER do nothing
56
> (b) Transaction found in SLAVE's PG metadata
57
> >  (1) MASTER send SLAVE COMMIT, let SLAVE restart from Step (8)
58 14 Li Wang
on the other hand, when SLAVE recover from down, if it found transaction in its PG metadata, then it will also start a super-peering to ask MASTER's state, and process according to the following different situations,
59
> (a) MASTER know nothing about the transaction, which imply that MASTER has just also recovered from down, and lost all the in memory transaction informations, then SLAVE rollback;
60 11 Li Wang
> (b) MASTER know this transaction, then MASTER will direct SLAVE to finish the transaction as described above
61 5 Li Wang
62 13 Li Wang
(2) MASTER down
63 14 Li Wang
SLAVE could be aware of MASTER down by OSDMAP, and start a super-peering to ask MASTER's state after MASTER PG returned to active-and-clean, and process according to the following different situations,
64
> (a) MASTER know nothing about the transaction, imply MASTER has lost all its in memory transaction informations, SLAVE rollback;
65
> (b) MASTER in COMMITTING, then SLAVE restart from Step (8)
66 15 Li Wang
When MASTER recover from down, and found there exists transaction in COMMITTING, then start super-peering to all SLAVES, ask their state, and process according to the following different situations,
67 13 Li Wang
> (a) no transaction found in SLAVE's PG metadata, imply SLAVE has done, do nothing;
68
> (b) transaction found in SLAVE's PG metadata, send SLAVE COMMIT, let SLAVE restart from Step (8) 
69
70 16 Li Wang
Others
71
72 24 Li Wang
(a) A flag could be introduced, to decide whether need a special PREPARE_COMMIT step (Step 3), this step is meant to do a fast deadlock and read-and-comparasion check, if it is not desirable, then Step 3 and 6 could be coalcesed, and Step 5 and 7 could be coalesced, nevertheless, it has no impact on error process described above
73 16 Li Wang
74 18 Li Wang
(b) For Step 8, we could make use of metadata-only journal mode (http://tracker.ceph.com/projects/ceph/wiki/Rados_-_metadata-only_journal_mode) to speedup the write from PG metadata to ultimate location
75 2 Li Wang
76
*Work items*
77
This section should contain a list of work tasks created by this blueprint.  Please include engineering tasks as well as related build/release and documentation work.  If this blueprint requires cleanup of deprecated features, please list those tasks as well.
78
79 1 Li Wang
*Coding tasks*
80 20 Li Wang
81
(1) Revise MOSDOp to allow multi-object transaction ops (nearly done);
82
(2) Revise the call chain from do_op to prepare_transaction to allow transaction pending (ongoing);
83
(3) Operate PG metadata (done);
84 19 Li Wang
(4) Super-peering (ongoing)
85 2 Li Wang
86
*Build / release tasks*
87
Task 1
88
Task 2
89
Task 3
90
91
*Documentation tasks*
92
Task 1
93
Task 2
94
Task 3
95
96
*Deprecation tasks*
97
Task 1
98
Task 2
99
Task 3