Project

General

Profile

Rados - metadata-only journal mode » History » Revision 5

Revision 4 (Li Wang, 06/10/2015 11:35 AM) → Revision 5/16 (Li Wang, 06/23/2015 08:17 AM)

h1. Rados - metadata-only journal mode 

 *Summary* 
 This is for metadata-only journal mode. An important usage of Ceph is to  
 integrate with cloud computing platform to provide the storage for VM  
 images and instances. In such scenario, qemu maps RBD as virtual block  
 devices, i.e., disks to a VM, and the guest operating system will format  
 the disks and create file systems on them. In this case, RBD mostly  
 resembles a 'dumb' disk.    In other words, it is enough for RBD to  
 implement exactly the semantics of a disk controller driver. Typically,  
 the disk controller itself does 
 not provide a transactional mechanism to ensure a write operation done 
 atomically. Instead, it is up to the file system, who manages the disk, 
 to adopt some techniques such as journaling to prevent inconsistency, 
 if necessary. Consequently, RBD does not need to provide the 
 atomic mechanism to ensure a data write operation done atomically, 
 since the guest file system will guarantee that its write operations to 
 RBD will remain consistent by using journaling if needed. Another 
 scenario is for the cache tiering, while cache pool has already 
 provided the durability, when dirty objects are written back, they 
 theoretically need not go through the journaling process of base pool, 
 since the flusher could replay the write operation. These motivate us 
 to implement a new journal mode, metadata-only journal mode, which 
 resembles the data=ordered journal mode in ext4. With such journal mode 
 is on, object data are written directly to their ultimate location, 
 when data written finished, metadata are written into the journal, then 
 the write returns to caller. This will avoid the double-write penalty 
 of object data due to the WRITE-AHEAD-LOGGING, potentially greatly 
 improve the RBD and cache tiering performance.  

 *Owners* 

 Li Wang (liwang@ubuntukylin.com) 
 Yunchuan Wen (yunchuanwen@ubuntukylin.com) 
 Name 

 *Interested Parties* 
 If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here. 
 Name (Affiliation) 
 Name (Affiliation) 
 Name 

 *Current Status* 
 Please describe the current status of Ceph as it relates to this blueprint.    Is there something that this replaces?    Are there current features that are related? 

 *Detailed Description* 
 1 This is the big one!    Submit transaction A into journal, mark Please provide a <offset, length> 
 non-journaling data write in pglog (peering) or omap/xattrs(scrub) 
 2 Write data to object 
 3 Submit transaction B into journal, to update detailed description for the metadata as well as 
 pglog as usual  

 As long as one OSD in the PG has succeeded, the PG will be recovered to  
 proposed change.    Where appropriate, include your architectural approach, a consistent and correct state by peering; The only potentially problematical  
 situation is the PG down as a whole, and none list of the OSDs has finished Step (3),  
 systems involved, important consequences, and at least one of the OSDs has finished Step (1). In issues that case,  
 we revise peering or scrub to make them realize the semantics of transaction A,  
 and randomly choose one osd to synchronize its content of written area to other 
 copies. We prefer to leave it the scrub's job. Since scrub is done 
 asynchronously, and maybe be scheduled to run late, during this period,  
 client's resend may have recovered the content to consistent.  

 are still unresolved. 

 *Work items* 
 This section should contain a list of work tasks created by this blueprint.    Please include engineering tasks as well as related build/release and documentation work.    If this blueprint requires cleanup of deprecated features, please list those tasks as well. 

 *Coding tasks* 
 Task 1 
 Task 2 
 Task 3 

 *Build / release tasks* 
 Task 1 
 Task 2 
 Task 3 

 *Documentation tasks* 
 Task 1 
 Task 2 
 Task 3 

 *Deprecation tasks* 
 Task 1 
 Task 2 
 Task 3