Rados - metadata-only journal mode » History » Revision 7
Revision 6 (Li Wang, 06/23/2015 08:22 AM) → Revision 7/16 (Li Wang, 06/30/2015 02:10 AM)
h1. Rados - metadata-only journal mode *Summary* Currently the Ceph community This is thinking of eliminating the double for metadata-only journal mode. In this mode, for write operation, penalty the OSD will journal only metadata, without journaling the written data. An important usage of write ahead logging, newstore Ceph is to integrate with cloud computing platform to provide the storage for VM images and instances. In such scenario, qemu maps RBD as virtual block devices, i.e., disks to a great design which implements VM, and the guest operating system will format create, append operations in an copy the disks and create file systems on read way, while maintaining all the original semantics. This makes newstore them. In this case, RBD mostly resembles a general purpose optimization, especially suitable 'dumb' disk. In other words, it is enough for RBD to implement exactly the write once scenarios. Metadata-only journal mode semantics of a disk controller driver. Typically, the disk controller itself does intends not provide a transactional mechanism to do in ensure a more aggressive way, that is, not journal object data at all. write operation done This applies atomically. Instead, it is up to two major kinds of situations, one is that the atomicity for object data modification may file system, who manages the disk, to adopt some techniques such as journaling to prevent inconsistency, if necessary. Consequently, RBD does not need, for example, RBD need to simulate provide the atomic mechanism to ensure a disk data write operation done atomically, in cloud platform. The second since the guest file system will guarantee that its write operations to RBD will remain consistent by using journaling if needed. Another scenario is those double journaling situations, for example, the cache tiering, while cache pool has already provided the durability, when dirty objects are written back, they theoretically need not go through the journaling process of base pool, since the flusher could always replay the write operation. Metadata-only These motivate us to implement a new journal mode, to some extent, metadata-only journal mode, which resembles the data=ordered journal mode in ext4. With such journal mode is on, object data are written directly to their ultimate location, when data written finished, metadata are written into the journal. It guarantees journal, then the consistency in terms write returns to caller. This will avoid the double-write penalty of RADOS name space, and the object data consistency among object copies. However, due to the object data may not be correct. Later we will demonstrate that this rarely happens. WRITE-AHEAD-LOGGING, potentially greatly improve the RBD and cache tiering performance. *Owners* Li Wang (liwang@ubuntukylin.com) Yunchuan Wen (yunchuanwen@ubuntukylin.com) Name *Interested Parties* If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here. Name (Affiliation) Name (Affiliation) Name *Current Status* Please describe the current status of Ceph as it relates to this blueprint. Is there something that this replaces? Are there current features that are related? *Detailed Description* 1 Submit transaction A into journal, mark a <offset, length> non-journaling data write in pglog (peering) or omap/xattrs(scrub) 2 Write data to object 3 Submit transaction B into journal, to update the metadata as well as pglog as usual As long as one OSD in the PG has succeeded, the PG will be recovered to a consistent and correct state by peering; The only potentially problematical situation is the PG down as a whole, and none of the OSDs has finished Step (3), and at least one of the OSDs has finished Step (1). In that case, we revise peering or scrub to make them realize the semantics of transaction A, and randomly choose one osd to synchronize its content of written area to other copies. We prefer to leave it the scrub's job. Since scrub is done asynchronously, and maybe be scheduled to run late, during this period, client's resend may have recovered the content to consistent. *Work items* This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well. *Coding tasks* Task 1 Task 2 Task 3 *Build / release tasks* Task 1 Task 2 Task 3 *Documentation tasks* Task 1 Task 2 Task 3 *Deprecation tasks* Task 1 Task 2 Task 3