Revision 12 - History - Rados - metadata-only journal mode - Ceph - Ceph

Rados - metadata-only journal mode » History » Revision 12

Revision 11 (Li Wang, 06/30/2015 09:42 AM) → Revision 12/16 (Li Wang, 07/01/2015 10:04 AM)

h1. Rados - metadata-only journal mode 

 *Summary* 
 Currently the Ceph community is thinking of eliminating the double write 
 penalty of write ahead logging, newstore is a great design which implements  
 create, append operations in an copy on read way, while maintaining all 
 the original semantics. This makes newstore a general purpose optimization, 
 especially suitable for the write once scenarios. Metadata-only journal mode 
 intends to do in a more aggressive way, that is, not journal object data at all. 
 This applies to two major kinds of situations, one is that the atomicity for  
 object data modification may not need, for example, RBD to simulate a disk 
 in cloud platform. The second is those double journaling situations, for example, 
 cache tiering, while cache pool has already provided the durability, when dirty  
 objects are written back, they theoretically need not go through the journaling  
 process of base pool, since the flusher could always replay the write operation.  
 Metadata-only journal mode, to some extent, resembles the data=ordered journal  
 mode in ext4. With such journal mode is on, object data are written directly to  
 their ultimate location, when data written finished, metadata are written into the  
 journal. It guarantees the consistency in terms of RADOS name space, and the data  
 consistency among object copies. However, the object data may not be correct.  
 Later we will demonstrate that this rarely happens. 

 *Owners* 

 Li Wang (liwang@ubuntukylin.com) 
 Yunchuan Wen (yunchuanwen@ubuntukylin.com) 
 Name 

 *Interested Parties* 
 If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here. 
 Name (Affiliation) 
 Name (Affiliation) 
 Name 

 *Current Status* 
 Please describe the current status of Ceph as it relates to this blueprint.    Is there something that this replaces?    Are there current features that are related? 

 *Detailed Description* 
 We have two options, 
 The first option: 
 The only revision lies in that it does not journal object data, when the transaction algorithm is committing, the data are written into object directly. In most cases, this will not introduce problem relying on the powerful peering and client resent mechanism. The only problematic situation is PG down as a whole, and client also down, in that case, the guest fs in the vm    will possibly recover it to consistent by fsck and journal replaying. So it just to leave scrub to find and fix this by randomly synchronize one of the copy to others. follows, 

 The second option, 

 1 Submit transaction A into journal, add a record for <offset, length> 
 non-journaling data write in omap 
 omap,  
 2 Write data to object 
 3 Submit transaction B into journal, to update the metadata, metadata as well as 
 pglog as usual, and revert the operations of transaction A 

 As long as one osd in the pg has succeeded, the pg will be recovered to 
 a consistent and correct state by peering; If the pg PG down as a whole, 
 there are the following situations,  
 (1) None of the osds finishes step 1, nothing happen;  
 (2) At least one of the osds finishes step 3, journaling and 
 peering will recover the pg to a consistent and correct state;  
 (3) none of the osds has finished step (3), and at least one of the osds has  
 finished step (1), this is the only potentially problematical situation,  
 in this case, we revise peering will synchronize the omap record or scrub to other osds in make them realize the pg.  
 For object read semantics of transaction A, and write, if found the record, it will let the operation wait and 
 start a recovery to randomly choose one osd to synchronize its  
 content of written area to other copies. During scrub, We prefer to leave it will also check the record scrub's job.  
 Since scrub is done asynchronously, and do maybe be scheduled to run late, during  
 this period, client's resend may have recovered the recvery. content to consistent. 

 *Work items* 
 This section should contain a list of work tasks created by this blueprint.    Please include engineering tasks as well as related build/release and documentation work.    If this blueprint requires cleanup of deprecated features, please list those tasks as well. 

 *Coding tasks* 
 Task 1 
 Task 2 
 Task 3 

 *Build / release tasks* 
 Task 1 
 Task 2 
 Task 3 

 *Documentation tasks* 
 Task 1 
 Task 2 
 Task 3 

 *Deprecation tasks* 
 Task 1 
 Task 2 
 Task 3

Project

General

Profile

Ceph

Rados - metadata-only journal mode » History » Revision 12