https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2010-11-01T12:08:35ZCeph Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=13702010-11-01T12:08:35ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>I forgot a bit about the setup.</p>
<p>4 x OSD all with journals on separate drives. Each OSD is on a separate system.<br />Btrfs used as the filesystem for all the OSDs.<br />Each OSD is a backup MDS.<br />There is a dedicated MDS server.<br />3 x Dedicated Monitors.</p>
<p>4 x Client machines all connected over librados using 16MB object block sizes.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=13782010-11-01T22:22:31ZSage Weilsage@newdream.net
<ul><li><strong>Assignee</strong> set to <i>Sage Weil</i></li><li><strong>Priority</strong> changed from <i>High</i> to <i>Urgent</i></li><li><strong>Target version</strong> set to <i>v0.22.3</i></li></ul><p>It's expected that you'll get extremely slow performance without the journal.</p>
<p>I'll work on replicating this in our environment tomorrow. Thanks!</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=13802010-11-01T22:29:53ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Yeah,</p>
<p>I figured not running with journals wouldn't work right. As long as the block size of the writes is very large it performs fine but file-system operations it doesn't like.</p>
<p>If you would like access to our cluster just let me know. I would be happy to help you out. We are still just rearranging and testing.</p>
<p>I forgot to note we are using the latest released debs from the website and have also run from sources in the passed.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=14052010-11-03T21:47:02ZSage Weilsage@newdream.net
<ul><li><strong>Assignee</strong> changed from <i>Sage Weil</i> to <i>Greg Farnum</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=14162010-11-04T10:25:19ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Brian, can you give us a few more details about your cluster and the performance drop you're seeing here? Specific questions I have:<br />1) What's your baseline performance? You said it's 300Mb/s (37.5MB/s) -- did you mean 300MB/s? If that's the case you must not be running over GigE?<br />2) What does "The write stalls when it hits the limit of the journals and hangs for up to 15 minutes while the journals slowly commit to disk. Write speed drops to below 40MB/s." mean? Do you mean no progress is made for 15 minutes and average write speeds over the whole time drop below 40MB/s from a significantly higher starting point? Or is there just a sudden precipitous drop in dynamic write speed?<br />3) Are you sure your sustainable performance is better than 40MB/s? Depending on your cluster size, disk speed, and replication level that might be as good as it gets, except<br />4) You're indicating that running with the journal off gives you higher performance in large writes than with it on? What test is demonstrating this to you?<br />I ask because we do have a journal throttler that slows down the speed of journal commits when it fills up and outpaces the data drive, and there could be issues with this but we don't have any SSDs available for testing right now so reproducing it locally is going to be difficult. Feel free to email me (<a class="email" href="mailto:gregf@hq.newdream.net">gregf@hq.newdream.net</a>) if you prefer. :)</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=14292010-11-05T10:49:30ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Hello,</p>
<p>1) Correct we are running transparent 10GbE</p>
<p>2) From what I can tell monitoring dstat across the cluster and watching the data transfer that the write basically hangs completely once the journal fills up. The 40MB/s was the diluted average.</p>
<p>3) Our theoretical write speed per storage brick is ~500MB/s writing straight to disk. There are two other bricks in the cluster than can sustain ~900MB/s. That combined with 10GbE we can sustain writes over iSCSI of 500MB/s sustained.</p>
<p>4) When writing a very large file I used the same test for both.<br />a - with journal) dd if=/dev/zero of=/mnt/ceph/ddtest bs=1M count=50k - ~40-60MB/s<br />b - without journal) dd if=/dev/zero of=/mnt/ceph/ddtest bs=1M count=50k - ~275MB/s</p>
<p>I have been through a hundred different configurations on the cluster and I have also done hardware benchmarking from the ground up to rule out bottlenecks. At this point I am sure I have narrowed it down.</p>
<p>I will shoot you an email.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=14702010-11-08T13:24:20ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Assignee</strong> changed from <i>Greg Farnum</i> to <i>Sage Weil</i></li></ul><p>We've looked at this a bit more but decided today that Sage is taking it over since he's a lot more familiar with the MDS journaling!</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=14752010-11-08T15:32:50ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Target version</strong> changed from <i>v0.22.3</i> to <i>v0.23</i></li><li><strong>Estimated time</strong> set to <i>4.00 h</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=14902010-11-08T16:09:24ZSage Weilsage@newdream.net
<ul><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>Immediate</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=14982010-11-08T16:09:48ZSage Weilsage@newdream.net
<ul><li><strong>Priority</strong> changed from <i>Immediate</i> to <i>Normal</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=15042010-11-08T16:10:02ZSage Weilsage@newdream.net
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Immediate</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=15192010-11-08T16:40:00ZSage Weilsage@newdream.net
<ul><li><strong>Source</strong> set to <i>2</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=16112010-11-10T07:24:50ZSage Weilsage@newdream.net
<ul></ul><p>What would be helpful in diagnosing this problem is:</p>
<p>- turn up osd logging, in [osd] section:<br /> debug osd = 20<br /> debug filestore = 20<br /> debug journal = 20<br /> debug ms = 1<br />- reproduce the hang. take note of the time that io starts, when things appear to hang, and when things recover, so that i can correlate the observed behavior with the logs<br />- bundle up the osd logs from all the osds machines.</p>
<p>Also, I'm curious if this is something you can reproduce using 'rados -p casdata bench 600 write'. If so, include the output from that as well?</p>
<p>Thanks!</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=16272010-11-11T14:26:49ZSage Weilsage@newdream.net
<ul><li><strong>Target version</strong> changed from <i>v0.23</i> to <i>v0.23.1</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=16372010-11-11T17:47:35ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Sorry I have been able to get the debug output yet. We have spent the last few days working with our production systems. I should be able to get these posted tomorrow.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=16782010-11-17T12:56:37ZSage Weilsage@newdream.net
<ul></ul><p>Just pinging you on this one. If you can send the logs I'd like to sort this out. Thanks!</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=16962010-11-19T14:37:28ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>I am working on getting the output now. We are having to work on several projects at once right now. Sorry for the delay.</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=16972010-11-19T14:49:46ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Okay,</p>
<p>I just started the deployment of 12 vms on a new cephfs with 3 osds in and ssd's for journals on all the systems.</p>
<p>I also wanted to note that the SSDs we are using are enterprise grade so they have a write speed of about 220MB/s</p>
<p>Once these VM deployments complete I will move on to writing a very large file from the client.</p>
<p>I will keep you updated.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=16982010-11-19T15:52:26ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Okay,</p>
<p>More updates.</p>
<p>1) All the VMs deployed okay but it looks like towards the end of the deployments I hit the journal limit. Its just real hard to tell when there is so much random IO going on.</p>
<p>2) I have shutdown all the vms and waiting for the cluster to settle.</p>
<p>3) Starting a 400GB write to the ceph client using the following.<br />dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=400k</p>
<p>4) Monitoring the data throughput through dstat I see steady streaming.</p>
<p>5) At<br />root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:07:00 CST 2010</p>
<p>cephdisk03 had a 15 second glitch in writes</p>
<p>5) As expected at near<br />root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:09:56 CST 2010</p>
<p>Everything locked up and the write came to a halt.</p>
<p>2010-11-19 17:10:26.812088 pg v475: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail<br />2010-11-19 17:10:33.129148 pg v476: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail<br />2010-11-19 17:10:38.372873 pg v477: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail<br />2010-11-19 17:10:43.240149 pg v478: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail</p>
<p>If the data from the VMs is figured in this is very near the size of the journals which is 3 x 32GB or 96GB of journal space.</p>
<p>After about 45 seconds the write started streaming again.</p>
<p>Another system flutter here</p>
<p>root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:29:38 CST 2010</p>
<p>Another here</p>
<p>root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:34:12 CST 2010</p>
<p>Intermittent</p>
<p>root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:35:35 CST 2010</p>
<p>More intermittence</p>
<p>root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:37:13 CST 2010</p>
<p>At this point its almost completely halted</p>
<p>root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:37:28 CST 2010</p>
<p>6) The average write speed judging by dstat was between 2.5 and 4 Gbps or 250 - 400MB/s</p>
<p>7) Here is the result of the dd</p>
<p>root@cephnode01:/mnt# dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=400k<br />409600+0 records in<br />409600+0 records out<br />429496729600 bytes (429 GB) copied, 2309.2 s, 186 MB/s</p>
<p>8) Here is a smaller write</p>
<p>root@cephnode01:/mnt# dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=10k<br />10240+0 records in<br />10240+0 records out<br />10737418240 bytes (11 GB) copied, 62.5441 s, 172 MB/s</p>
<p>It seems like before the write returns complete there is a time where no data is actually being written. I am not sure how that works.</p>
<p>All the logs are on the cluster and all the needed logging is turned on.</p>
<p>Doing the rbd command</p>
<p>root@cephnode02:~/cephutils# date<br />Fri Nov 19 17:44:11 CST 2010</p>
<p>Before this the command was not writing at all and the system was hung completely.</p>
<pre><code>91 16 33 17 0.746616 0 - 79.8428<br /> 92 16 33 17 0.7385 0 - 79.8428<br /> 93 16 33 17 0.73056 0 - 79.8428<br /> 94 16 33 17 0.722787 0 - 79.8428<br /> 95 16 36 20 0.841388 1 18.6564 70.6427<br /> 96 16 50 34 1.41546 56 6.88076 50.6385<br /> 97 16 88 72 2.96654 152 2.41285 25.1971<br /> 98 16 151 135 5.50548 252 3.12239 14.6442<br /> 99 16 207 191 7.71049 224 2.51849 11.1606<br />min lat: 1.19794 max lat: 84.9395 avg lat: 9.25169<br /> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat<br /> 100 16 264 248 9.91131 228 3.18548 9.25169<br /> 101 16 319 303 11.9895 220 2.47626 8.12353<br /> 102 16 376 360 14.1052 228 2.50121 7.28927<br /> 103 16 435 419 16.2574 236 3.01149 6.6238<br /> 104 16 487 471 18.0993 208 4.31959 6.18961<br /> 105 16 543 527 20.0582 224 2.60889 5.88472<br /> 106 16 600 584 22.018 228 2.73522 5.57708<br /> 107 16 663 647 24.1651 252 2.79579 5.28485<br /> 108 16 723 707 26.1615 240 2.09517 5.07365<br /> 109 16 784 768 28.1577 244 2.50826 4.86907<br /> 110 16 839 823 29.8998 220 2.31899 4.74817<br /> 111 16 868 852 30.6743 116 6.17081 4.70212<br /> 112 16 899 883 31.5065 124 5.01973 4.73841<br /> 113 16 947 931 32.9251 192 2.21323 4.69338<br /> 114 16 1011 995 34.8797 256 2.81231 4.56228<br /> 115 16 1069 1053 36.5918 232 1.98472 4.46632<br /> 116 16 1129 1113 38.3432 240 2.91814 4.36779<br /> 117 16 1190 1174 40.0988 244 2.5139 4.28434<br /> 118 16 1229 1213 41.0797 156 3.73419 4.27287<br /> 119 16 1286 1270 42.6484 228 2.05218 4.20385</code></pre>
<p>Then out of no where it picked up and started writing.</p>
<p>min lat: 0.88051 max lat: 84.9395 avg lat: 3.3814<br /> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat<br /> 160 16 3533 3517 87.8297 220 2.98483 3.3814<br /> 161 16 3589 3573 88.674 224 2.59933 3.37463<br /> 162 15 3644 3629 89.5076 224 3.70755 3.36497<br /> 163 16 3695 3679 90.184 200 4.06077 3.36353<br /> 164 16 3737 3721 90.6571 168 3.22226 3.36508<br /> 165 16 3791 3775 91.415 216 3.11739 3.36397<br /> 166 16 3850 3834 92.284 236 2.37901 3.35491<br /> 167 16 3903 3887 92.9994 212 3.30706 3.34597<br />min lat: 0.85431 max lat: 84.9395 avg lat: 3.19795<br /> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat<br /> 600 15 24906 24891 165.765 196 1.82349 3.19795<br />Total time run: 600.184263<br />Total writes made: 24906<br />Write size: 4194304<br />Bandwidth (MB/sec): 165.532</p>
<p>Average Latency: 3.1976<br />Max latency: 84.9395<br />Min latency: 0.85431</p>
<p>It seems like its pretty clear that the journal was flushing during the beginning of this command and when it finished flushing the write started.</p>
<p>I have emailed you with a download link for the logs.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=17022010-11-20T17:06:52ZSage Weilsage@newdream.net
<ul></ul><p>Please try out the patches in the filestore_throttle branch, commit:b28c0bf82ac28ded4fe85573d32fdc111c66e50b</p>
<p>It looks ot me like this is triggering due to the sync in the underlying filesystem (btrfs in this case). When a btrfs commit happens, writes block<br />temporarily while a bunch of data is flushed out to disk. The osd is throttling incoming writes based on the amount of data queued for the fs, and the default (100MB) isn't enough to absorb the amount of time the commit is taking (~7 seconds in the case I checked). The tunable is filestore_queue_max_ops and _bytes. Bumping these way way up might have worked in your case since writes appear to be limited by the journal.</p>
<p>I also added a filestore_queue_commiting_max_ops/bytes so that an alternate (and higher) limit will kick in while the underlying fs is doing a commit to absorb the extra work that gets queued up. Please try that tunable, and turn it way up (maybe avg throughput * 10 seconds?) and see if that helps.</p>
<p>I also added some instrumentation so we can watch the amount of data flowing through and the size of the various journal and fs queues involved (/var/log/ceph/stat/osd.#.fs.log).</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=17382010-11-23T09:53:02ZSage Weilsage@newdream.net
<ul><li><strong>Target version</strong> changed from <i>v0.23.1</i> to <i>v0.24</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=17562010-11-23T15:00:31ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Awesome, thanks for the help. I will give these patches a shot towards the end of the week.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=19362010-12-07T22:01:13ZSage Weilsage@newdream.net
<ul><li><strong>Target version</strong> changed from <i>v0.24</i> to <i>v0.25</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=20042010-12-17T11:41:56ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Just wanted to make an update.</p>
<p>I just got a chance to get back on the cluster today.</p>
<p>I have upgraded the cluster to the latest stable version of ceph as well as the OS's the machines are running on.</p>
<p>I am going to attempt to regenerate the issue and if it still exists I will test out the patches above.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=20052010-12-17T12:27:32ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>Okay,</p>
<p>I got everything updated and it appears the cluster hung right about 180GB of data written. Which was double the journal size. So I would assume the issue is still present.</p>
<p>I will begin trying out the patches.</p>
<p>Thanks</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=20072010-12-17T16:43:00ZBryan Tongbryan.tong@gigenet.com
<ul></ul><p>This is strange.</p>
<p>I tried a very large write.</p>
<p>root@cephnode01:/mnt# dd if=/dev/zero of=cephtest01 bs=1M count=1M</p>
<p>2010-12-17 16:32:51.135613 pg v4253: 1056 pgs: 267 active, 513 active+clean, 265 peering, 11 active+degraded; 413 GB data, 565 GB used, 27364 GB / 27929 GB avail; 70932/402328 degraded (17.630%)</p>
<p>And finally</p>
<p>2010-12-17 17:00:10.305307 pg v4328: 1056 pgs: 788 active+clean+degraded, 268 degraded+peering; 426 GB data, 290 GB used, 9950 GB / 10240 GB avail; 206016/412032 degraded (50.000%)<br />2010-12-17 17:22:09.173125 mds e26: 1/1/1 up {0=up:replay}<br />2010-12-17 17:47:19.317031 mds e27: 1/1/1 up {0=up:replay(laggy or crashed)}</p>
<p>It appears during the write the cluster crashed completely.</p> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=25562011-03-01T16:37:32ZSage Weilsage@newdream.net
<ul><li><strong>Target version</strong> changed from <i>v0.25</i> to <i>v0.25.1</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=27012011-03-11T13:10:45ZSage Weilsage@newdream.net
<ul><li><strong>Target version</strong> changed from <i>v0.25.1</i> to <i>v0.25.2</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=27532011-03-13T14:40:43ZSage Weilsage@newdream.net
<ul><li><strong>translation missing: en.field_position</strong> deleted (<del><i>351</i></del>)</li><li><strong>translation missing: en.field_position</strong> set to <i>1</i></li><li><strong>translation missing: en.field_position</strong> changed from <i>1</i> to <i>545</i></li></ul> Ceph - Bug #531: Journaling Causes System Hanghttps://tracker.ceph.com/issues/531?journal_id=28012011-03-15T12:41:42ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li></ul><p><a class="changeset" title="filestore: instrument filestore, journal throughput and throttling Signed-off-by: Sage Weil <sag..." href="https://tracker.ceph.com/projects/ceph/repository/revisions/24342a71919e50e81a8c05682a5687b2e54d649b">24342a71919e50e81a8c05682a5687b2e54d649b</a></p>