Project

General

Profile

Actions

Support #24953

open

Bad performance when enable journaling feature for rbd

Added by liuzhong chen over 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Low
Assignee:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

Phenomenon:
I test ceph-12.2.5 by using fio.
If bs=1M,enable journaling feature will lead to 50% performance degradation.
If bs=4k,enable journaling feature will lead to 70% performance degradation.
Analysis:
In my cluster, journaling and data pool is the same and the performance is the same as SSD.
If enable journaling feature,the IO will write journal first and then write disk.So double write is unavoidable,which lead to 50% performance degradation.
And for bs=4k metadata can not be ignored,so 4k result drop more 1M.
Question:
Is there any method to deduce performance decline.
Thank you!

Actions #1

Updated by Jason Dillaman over 5 years ago

  • Tracker changed from Bug to Support
  • Priority changed from Normal to Low
Actions #2

Updated by liuzhong chen over 5 years ago

@Jason Borden Dillaman I come up with the idea that use rbd cache. But journal just append events not flush before it reply to client.This may lead to data loss.So this rbd cache programme need to change current architecture.
Before try to do this,I wonder if there is any question to do this which lead not do this at initial programme.Or any other suggestions.
Thank you!

Actions #3

Updated by Jason Dillaman over 5 years ago

liuzhong chen wrote:

@Jason Borden Dillaman I come up with the idea that use rbd cache. But journal just append events not flush before it reply to client.This may lead to data loss.So this rbd cache programme need to change current architecture.

I don't think I am following. The current architecture is already tied into the in-memory cache. In writethrough mode, there is no chance for data loss and in writeback mode, there is no chance for data loss after a sync(flush) is issued -- which is the expected behavior for a writeback cache.

Before try to do this,I wonder if there is any question to do this which lead not do this at initial programme.Or any other suggestions.
Thank you!

Actions #4

Updated by Yang Dongsheng over 5 years ago

Jason Dillaman wrote:

liuzhong chen wrote:

@Jason Borden Dillaman I come up with the idea that use rbd cache. But journal just append events not flush before it reply to client.This may lead to data loss.So this rbd cache programme need to change current architecture.

I don't think I am following. The current architecture is already tied into the in-memory cache. In writethrough mode, there is no chance for data loss and in writeback mode, there is no chance for data loss after a sync(flush) is issued -- which is the expected behavior for a writeback cache.

Hi Jason, for example:

We have the following config:
rbd cache = true;

(1) we have a write with O_DIRECT in qemu, then it will call aio_write in librbd.
(2) append_io_event will append this event to rbd journal, but it did not flushed to disk;
(3) bool ObjectCacherObjectDispatch<I>::write() will writex, and set the dispatch_result to io::DISPATCH_RESULT_COMPLETE;
(4) Then we will complete request in aio_completion. and user process in qemu will think this direct write is completed.

If we got an power-cut for all cluster including qemu host and all ceph hosts. Then we will loss the data in this write operation.
Jason, Is that right?

Maybe we can introduce an option named as rbd_cache_complete_io_until_journal_flushed. default to false, If this is true, then we can guarantee the
data will never lost if we return 0 to aio_write() caller.

At the same time, we can improve the latency with journaling enabled, as we only write once in rbd journal before complete request,
the other write in data object will be down in writing back by rbd cache.

Before try to do this,I wonder if there is any question to do this which lead not do this at initial programme.Or any other suggestions.
Thank you!

Actions #5

Updated by Jason Dillaman over 5 years ago

Hi Jason, for example:

We have the following config:
rbd cache = true;

(1) we have a write with O_DIRECT in qemu, then it will call aio_write in librbd.
(2) append_io_event will append this event to rbd journal, but it did not flushed to disk;
(3) bool ObjectCacherObjectDispatch<I>::write() will writex, and set the dispatch_result to io::DISPATCH_RESULT_COMPLETE;
(4) Then we will complete request in aio_completion. and user process in qemu will think this direct write is completed.

That's how a writeback cache works -- it "immediately" ACKs the caller and it will, in the background, write the data to the backing store when it feels like it. A caller can force the writeback cache to ensure the data is safely written by forcing a flush/sync.

If we got an power-cut for all cluster including qemu host and all ceph hosts. Then we will loss the data in this write operation.
Jason, Is that right?

Not necessarily since QEMU will advertise the disk as having a writeback cache, so modern OSs will take advantage of that fact and issue sync/flush calls when needed to ensure data crash consistency. The flush will be ACKed only after the journal events are safely recorded to maintain crash consistency.

Maybe we can introduce an option named as rbd_cache_complete_io_until_journal_flushed. default to false, If this is true, then we can guarantee the
data will never lost if we return 0 to aio_write() caller.

At the same time, we can improve the latency with journaling enabled, as we only write once in rbd journal before complete request,
the other write in data object will be down in writing back by rbd cache.

That would practically be the same as disabling the cache since in a benchmark your in-memory cache will quickly fill and once again become the bottleneck. I think you will find that you won't gain any performance. Feel free to test it out.

Actions #6

Updated by Yang Dongsheng over 5 years ago

Jason Dillaman wrote:

Hi Jason, for example:

We have the following config:
rbd cache = true;

(1) we have a write with O_DIRECT in qemu, then it will call aio_write in librbd.
(2) append_io_event will append this event to rbd journal, but it did not flushed to disk;
(3) bool ObjectCacherObjectDispatch<I>::write() will writex, and set the dispatch_result to io::DISPATCH_RESULT_COMPLETE;
(4) Then we will complete request in aio_completion. and user process in qemu will think this direct write is completed.

That's how a writeback cache works -- it "immediately" ACKs the caller and it will, in the background, write the data to the backing store when it feels like it. A caller can force the writeback cache to ensure the data is safely written by forcing a flush/sync.

Agree, that's what "writeback" means...

If we got an power-cut for all cluster including qemu host and all ceph hosts. Then we will loss the data in this write operation.
Jason, Is that right?

Not necessarily since QEMU will advertise the disk as having a writeback cache, so modern OSs will take advantage of that fact and issue sync/flush calls when needed to ensure data crash consistency. The flush will be ACKed only after the journal events are safely recorded to maintain crash consistency.

... but I want to make the upper layer free to worry about writeback cache. That means, although we used writeback cache in lower layer, but we want to make this transparent to user, and tell them that's same with "rbd cache = false", but faster. :)

Maybe we can introduce an option named as rbd_cache_complete_io_until_journal_flushed. default to false, If this is true, then we can guarantee the
data will never lost if we return 0 to aio_write() caller.

At the same time, we can improve the latency with journaling enabled, as we only write once in rbd journal before complete request,
the other write in data object will be down in writing back by rbd cache.

That would practically be the same as disabling the cache since in a benchmark your in-memory cache will quickly fill and once again become the bottleneck. I think you will find that you won't gain any performance. Feel free to test it out.

Actually, this idea was inspired by filestore. Filestore will not lose data but it ACK when we write pagecache and filestore journal. And then the pagecache will
flush data back to data disk in background. So we can get an good latency. and we can get a stable IOPS if we have an SSD journal disk and SSD data disk.

Yes, I am also concerning about the cache size problem, but as the we have two SSD POOL, one for data object, the other for journal object. So maybe it would be not back, because we have similar performance in writing journal and writing data back to data object.

Anyway, we will test it and paste the result in this issue. I think.

Actions #7

Updated by Yang Dongsheng almost 5 years ago

Hi Jason,
As mentioned, we cooked a patchset to implement the idea about this. In general, there are some patches as below:
(1) introduce an option to make cache_write happened after journal IO safe, that means, IO can ACK after journal appended
and rbd cache write finished.
(2) make event to be 4K aligned when we add a header to IO.
(3) make all librbd share the cache size on a host.
(4) flush cache when the whole cache size on a host larger than threshold.

Besides, we use 50G memory on a host for all librbd.

From the testing, we can get almost same IOPS and latency with image without journaling, (when the librbd cache is not full). But I really don't think that's a general usecase to improve performance, especially we need large memory. and this will make snapshot or shutdown taking a long time to do flush.

So I decide make what we did in this special version, without contribute it to community. that's really not an acceptable design for most cases.

Actions

Also available in: Atom PDF