https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2019-03-21T00:18:55ZCeph bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1323942019-03-21T00:18:55ZМарк Коренбергsocketpair@gmail.com
<ul></ul><p>I want bluestore to be able to buffer(defer), say, 30 seconds of random writes in RocksDB at SSD speed. I expect background writing the data to HDD in, say, 5 minutes without throttling any incoming write requests. Roughly the same as Filestore is able to.</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1327002019-03-25T11:02:56ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Mark, I'm not sure your root cause analysis is 100% valid. And to avoid any speculations I'd prefer to arrange the benchmark in the following manner and collect the following corresponding info for analysis:</p>
<p>1) Before each benchmark remove rbd image and restart OSD<br />2) Collect perf counter dump after each benchmark<br />3) Collect fio report for each benchmark<br />4) set debug bluestore to 10 and repeat steps 1-3</p>
<p>And a question. To what degree did you increase bluestore_max_deferred_txc? Very simplified calculation shows that it should be about 1300 x 30 = 39000 to satisfy your 30 second burst interval. I doubt anybody tested such a threshold so this is just a hypothesis to try. Many other factors might also get into the game and (negatively?) impact the process.</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1327092019-03-25T13:55:12ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>I tried similar thing when Mark asked me. In summary, you <strong>can</strong> enlarge your deferred queue a bit, but you can't make it give you better results on average.</p>
<p>The options required to do so are</p>
<pre>
bluestore_max_deferred_txc = 50000
bluestore_deferred_batch_ops = 10000
bluestore_throttle_cost_per_io_hdd = 100
</pre>
<p>But there are several problems which stop you from wanting to do it:</p>
<p>1) Bluestore doesn't honor the max_deferred_txc parameter. It starts to flush operations as soon as there are bluestore_deferred_batch_ops operations available. This is the biggest problem because you either wait for 10000 operations to accumulate and then flush them all at once which makes OSD just hang for 30 seconds, or you flush when you have 32-64 operations which just makes latency inconsistent like in filestore (700-0-700-0-700-0 iops), but does not provide any sort of a buffer.</p>
<p>2) No kind of background flush is implemented in Bluestore. So when the deferred queue fills up you just wait for the OSD to be restarted. Until then it won't flush anything.</p>
<p>3) Deferred writes live in the RocksDB, so if you have a lot of them they'll migrate to next levels. However I don't know how it will affect the performance. It may be ok.</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1330302019-03-28T14:17:08ZNeha Ojhanojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Need More Info</i></li></ul> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1337432019-04-07T19:33:24ZМарк Коренбергsocketpair@gmail.com
<ul><li><strong>File</strong> <a href="/attachments/download/4066/results.tar.xz">results.tar.xz</a> added</li></ul><p>Igor, I have done what you want. During OSD log inspection, take into account mtime of attached JSON files.</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1337442019-04-07T19:46:49ZМарк Коренбергsocketpair@gmail.com
<ul></ul><p>ceph config dump fragment:</p>
<pre>
osd.11 advanced bluestore_max_deferred_txc 1000000
osd.11 advanced bluestore_throttle_deferred_bytes 13421772800
</pre> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1337452019-04-07T19:51:04ZМарк Коренбергsocketpair@gmail.com
<ul></ul><p>compare:<br />before1.json with after1.json - with debug turned off<br />before2.json with after2.json - with debug turned on</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1338702019-04-09T13:38:39ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>@Mark, thanks for the update.<br />From you perf counter dumps (after?.json) one can see the following small (~4K) write statistics:</p>
<pre><code>"bluestore_write_small": 13945, - total amount of 'small write' requests<br /> "bluestore_write_small_bytes": 52751806,<br /> "bluestore_write_small_unused": 992, - amount of write requests that hit unused block in an existing extent<br /> "bluestore_write_small_deferred": 3469, - amount of write requests that were deferred<br /> "bluestore_write_small_new": 9484, - amount of write requests that were immediately written to new location</code></pre>
<p>So just 3469 of 13945 writes were deferred. The rest were written to HDD! disk immediately.</p>
<p>That's exactly how deferred writing supposed to work in BlueStore - <strong><b>unaligned overwrite</b></strong> (including partial overwrites) are the primary targets for deferred writing. When "unaligned" means lack of alignment with disk block size (=4K).</p>
<p>As a general comment I doubt it's a good idea to use this mechanics as a caching means. IMO this should be achieved either by OS or HW means. Maybe try dm-cache or lvmcache?</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1342192019-04-11T07:58:23ZМарк Коренбергsocketpair@gmail.com
<ul></ul><p>Igor, what about RBD? these writes are always aligned to 4K, so, as far as I understand, such writes will never be deferred. At least, it should be documented which type of write requests can be deferred.</p>
<p>Regarding lvmcache/dm-cache/bcache. Yes, it is possible, but makes OSD setup more complex. It will be nice if OSD could implement also this type of work. Also, scrubbing and especially deep-scrubbing will evict hot data from such caches and pull-up cold data in its place.</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1342242019-04-11T08:52:17ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>I think we are beginning to discuss something different from the original question, but ...</p>
<p>I checked the code and yes, it seems "write_small_new" is immediately written to the new location. If that's always the case, if that code path isn't intercepted by any previous "chunk-aligned deferred overwrite" or so... it's a problem. Does that mean that all small writes into unallocated space are written directly?</p>
<p>What's the point of implementing it like that?</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1354722019-04-25T15:49:40ZSage Weilsage@newdream.net
<ul></ul><p>Igor Fedotov wrote:</p>
<blockquote>
<p>@Mark, thanks for the update.<br />From you perf counter dumps (after?.json) one can see the following small (~4K) write statistics:</p>
<p>"bluestore_write_small": 13945, - total amount of 'small write' requests<br />"bluestore_write_small_bytes": 52751806,<br />"bluestore_write_small_unused": 992, - amount of write requests that hit unused block in an existing extent<br />"bluestore_write_small_deferred": 3469, - amount of write requests that were deferred<br />"bluestore_write_small_new": 9484, - amount of write requests that were immediately written to new location</p>
</blockquote>
<p>This metric is misleading. It indicates we are writing into a new blob... but that new write is <strong>still</strong> deferred if it is smaller than the deferred ratio. I'll open a PR to fix that. <a class="external" href="http://tracker.ceph.com/issues/38816">http://tracker.ceph.com/issues/38816</a></p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1354732019-04-25T15:52:28ZSage Weilsage@newdream.net
<ul></ul><p>Vitaliy Filippov wrote:</p>
<blockquote>
<p>1) Bluestore doesn't honor the max_deferred_txc parameter. It starts to flush operations as soon as there are bluestore_deferred_batch_ops operations available. This is the biggest problem because you either wait for 10000 operations to accumulate and then flush them all at once which makes OSD just hang for 30 seconds, or you flush when you have 32-64 operations which just makes latency inconsistent like in filestore (700-0-700-0-700-0 iops), but does not provide any sort of a buffer.</p>
</blockquote>
<p>IIUC, the suggestion here is that we should defer more IO for longer, and when we do eventually perform the IO, instead of queueing <strong>everything</strong> that is deferred, only queue part of it? that way we can defer for a longer period without causing bigger spikes?</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1355112019-04-26T01:14:45ZKonstantin Shalygink0ste@k0ste.ru
<ul></ul><p>master: <a class="external" href="https://github.com/ceph/ceph/pull/27789">https://github.com/ceph/ceph/pull/27789</a><br />nautilus: <a class="external" href="https://github.com/ceph/ceph/pull/27819">https://github.com/ceph/ceph/pull/27819</a> merged</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1370122019-05-20T15:20:26ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Need More Info</i> to <i>In Progress</i></li></ul> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1370142019-05-20T16:10:52ZVitaliy Filippovvitalif@yourcmc.ru
<ul></ul><p>Sage Weil wrote:</p>
<blockquote>
<p>Vitaliy Filippov wrote:</p>
<blockquote>
<p>1) Bluestore doesn't honor the max_deferred_txc parameter. It starts to flush operations as soon as there are bluestore_deferred_batch_ops operations available. This is the biggest problem because you either wait for 10000 operations to accumulate and then flush them all at once which makes OSD just hang for 30 seconds, or you flush when you have 32-64 operations which just makes latency inconsistent like in filestore (700-0-700-0-700-0 iops), but does not provide any sort of a buffer.</p>
</blockquote>
<p>IIUC, the suggestion here is that we should defer more IO for longer, and when we do eventually perform the IO, instead of queueing <strong>everything</strong> that is deferred, only queue part of it? that way we can defer for a longer period without causing bigger spikes?</p>
</blockquote>
<p>Yes, I think so, and if in addition to that it has some kind of a background flush thread it will eventually clear the queue on idle. It can become a new Bluestore feature :) even though the current deferred write mechanism is also not useless. It's really optimal for small journaled writes to HDDs, it just doesn't provide buffering.</p> bluestore - Feature #38816: Deferred writes do not work for random writeshttps://tracker.ceph.com/issues/38816?journal_id=1375422019-05-30T14:21:13ZJosh Durgin
<ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Feature</i></li></ul>