Project

General

Profile

Rados - metadata-only journal mode » History » Version 7

« Previous - Version 7/16 (diff) - Next » - Current version
Li Wang, 06/30/2015 02:10 AM


Rados - metadata-only journal mode

Summary
Currently the Ceph community is thinking of eliminating the double write
penalty of write ahead logging, newstore is a great design which implements
create, append operations in an copy on read way, while maintaining all
the original semantics. This makes newstore a general purpose optimization,
especially suitable for the write once scenarios. Metadata-only journal mode
intends to do in a more aggressive way, that is, not journal object data at all.
This applies to two major kinds of situations, one is that the atomicity for
object data modification may not need, for example, RBD to simulate a disk
in cloud platform. The second is those double journaling situations, for example,
cache tiering, while cache pool has already provided the durability, when dirty
objects are written back, they theoretically need not go through the journaling
process of base pool, since the flusher could always replay the write operation.
Metadata-only journal mode, to some extent, resembles the data=ordered journal
mode in ext4. With such journal mode is on, object data are written directly to
their ultimate location, when data written finished, metadata are written into the
journal. It guarantees the consistency in terms of RADOS name space, and the data
consistency among object copies. However, the object data may not be correct.
Later we will demonstrate that this rarely happens.

Owners

Li Wang ()
Yunchuan Wen ()
Name

Interested Parties
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
Name (Affiliation)
Name (Affiliation)
Name

Current Status
Please describe the current status of Ceph as it relates to this blueprint. Is there something that this replaces? Are there current features that are related?

Detailed Description
1 Submit transaction A into journal, mark a <offset, length>
non-journaling data write in pglog (peering) or omap/xattrs(scrub)
2 Write data to object
3 Submit transaction B into journal, to update the metadata as well as
pglog as usual

As long as one OSD in the PG has succeeded, the PG will be recovered to
a consistent and correct state by peering; The only potentially problematical
situation is the PG down as a whole, and none of the OSDs has finished Step (3),
and at least one of the OSDs has finished Step (1). In that case,
we revise peering or scrub to make them realize the semantics of transaction A,
and randomly choose one osd to synchronize its content of written area to other
copies. We prefer to leave it the scrub's job. Since scrub is done
asynchronously, and maybe be scheduled to run late, during this period,
client's resend may have recovered the content to consistent.

Work items
This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well.

Coding tasks
Task 1
Task 2
Task 3

Build / release tasks
Task 1
Task 2
Task 3

Documentation tasks
Task 1
Task 2
Task 3

Deprecation tasks
Task 1
Task 2
Task 3