Project

General

Profile

Bug #531

Journaling Causes System Hang

Added by Bryan Tong almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
OSD
Target version:
Start date:
11/01/2010
Due date:
% Done:

0%

Estimated time:
4.00 h
Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Hello,

It seems that when doing a large write once the journal fills up the system goes into a state of lock and has trouble deciding how to proceed.

Steps to reproduce

1) Create a Cluster using separate drives as the Journal. In our case 32GB SSD x 4.
2) Write a file from the client large enough to fill the journals. 60GB will do since replication will add another 60GB to the write.

Expected results

1) The write will continue to write through until completed at the speed of a smaller write. In our case 300Mb/s+

Actual results

1) The write stalls when it hits the limit of the journals and hangs for up to 15 minutes while the journals slowly commit to disk. Write speed drops to below 40MB/s.

Workaround

1) Running without journals causes the problem to go away.

Caveats

1) When running without journals smaller block sizes will not write correctly. For example, creating a filesystem on an image file takes hours for a simple 40GB file.

Theoretical Solution

The journaler needs to start committing before it fills up and should throttle the incoming data before the system locks up completely.

Thanks

History

#1 Updated by Bryan Tong almost 9 years ago

I forgot a bit about the setup.

4 x OSD all with journals on separate drives. Each OSD is on a separate system.
Btrfs used as the filesystem for all the OSDs.
Each OSD is a backup MDS.
There is a dedicated MDS server.
3 x Dedicated Monitors.

4 x Client machines all connected over librados using 16MB object block sizes.

Thanks

#2 Updated by Sage Weil almost 9 years ago

  • Assignee set to Sage Weil
  • Priority changed from High to Urgent
  • Target version set to v0.22.3

It's expected that you'll get extremely slow performance without the journal.

I'll work on replicating this in our environment tomorrow. Thanks!

#3 Updated by Bryan Tong almost 9 years ago

Yeah,

I figured not running with journals wouldn't work right. As long as the block size of the writes is very large it performs fine but file-system operations it doesn't like.

If you would like access to our cluster just let me know. I would be happy to help you out. We are still just rearranging and testing.

I forgot to note we are using the latest released debs from the website and have also run from sources in the passed.

Thanks

#4 Updated by Sage Weil almost 9 years ago

  • Assignee changed from Sage Weil to Greg Farnum

#5 Updated by Greg Farnum almost 9 years ago

Brian, can you give us a few more details about your cluster and the performance drop you're seeing here? Specific questions I have:
1) What's your baseline performance? You said it's 300Mb/s (37.5MB/s) -- did you mean 300MB/s? If that's the case you must not be running over GigE?
2) What does "The write stalls when it hits the limit of the journals and hangs for up to 15 minutes while the journals slowly commit to disk. Write speed drops to below 40MB/s." mean? Do you mean no progress is made for 15 minutes and average write speeds over the whole time drop below 40MB/s from a significantly higher starting point? Or is there just a sudden precipitous drop in dynamic write speed?
3) Are you sure your sustainable performance is better than 40MB/s? Depending on your cluster size, disk speed, and replication level that might be as good as it gets, except
4) You're indicating that running with the journal off gives you higher performance in large writes than with it on? What test is demonstrating this to you?
I ask because we do have a journal throttler that slows down the speed of journal commits when it fills up and outpaces the data drive, and there could be issues with this but we don't have any SSDs available for testing right now so reproducing it locally is going to be difficult. Feel free to email me () if you prefer. :)

#6 Updated by Bryan Tong almost 9 years ago

Hello,

1) Correct we are running transparent 10GbE

2) From what I can tell monitoring dstat across the cluster and watching the data transfer that the write basically hangs completely once the journal fills up. The 40MB/s was the diluted average.

3) Our theoretical write speed per storage brick is ~500MB/s writing straight to disk. There are two other bricks in the cluster than can sustain ~900MB/s. That combined with 10GbE we can sustain writes over iSCSI of 500MB/s sustained.

4) When writing a very large file I used the same test for both.
a - with journal) dd if=/dev/zero of=/mnt/ceph/ddtest bs=1M count=50k - ~40-60MB/s
b - without journal) dd if=/dev/zero of=/mnt/ceph/ddtest bs=1M count=50k - ~275MB/s

I have been through a hundred different configurations on the cluster and I have also done hardware benchmarking from the ground up to rule out bottlenecks. At this point I am sure I have narrowed it down.

I will shoot you an email.

Thanks

#7 Updated by Greg Farnum almost 9 years ago

  • Assignee changed from Greg Farnum to Sage Weil

We've looked at this a bit more but decided today that Sage is taking it over since he's a lot more familiar with the MDS journaling!

#8 Updated by Greg Farnum almost 9 years ago

  • Target version changed from v0.22.3 to v0.23
  • Estimated time set to 4.00 h

#9 Updated by Sage Weil almost 9 years ago

  • Priority changed from Urgent to Immediate

#10 Updated by Sage Weil almost 9 years ago

  • Priority changed from Immediate to Normal

#11 Updated by Sage Weil almost 9 years ago

  • Priority changed from Normal to Immediate

#12 Updated by Sage Weil almost 9 years ago

  • Source set to 2

#13 Updated by Sage Weil almost 9 years ago

What would be helpful in diagnosing this problem is:

- turn up osd logging, in [osd] section:
debug osd = 20
debug filestore = 20
debug journal = 20
debug ms = 1
- reproduce the hang. take note of the time that io starts, when things appear to hang, and when things recover, so that i can correlate the observed behavior with the logs
- bundle up the osd logs from all the osds machines.

Also, I'm curious if this is something you can reproduce using 'rados -p casdata bench 600 write'. If so, include the output from that as well?

Thanks!

#14 Updated by Sage Weil almost 9 years ago

  • Target version changed from v0.23 to v0.23.1

#15 Updated by Bryan Tong almost 9 years ago

Sorry I have been able to get the debug output yet. We have spent the last few days working with our production systems. I should be able to get these posted tomorrow.

Thanks

#16 Updated by Sage Weil almost 9 years ago

Just pinging you on this one. If you can send the logs I'd like to sort this out. Thanks!

#17 Updated by Bryan Tong almost 9 years ago

I am working on getting the output now. We are having to work on several projects at once right now. Sorry for the delay.

#18 Updated by Bryan Tong almost 9 years ago

Okay,

I just started the deployment of 12 vms on a new cephfs with 3 osds in and ssd's for journals on all the systems.

I also wanted to note that the SSDs we are using are enterprise grade so they have a write speed of about 220MB/s

Once these VM deployments complete I will move on to writing a very large file from the client.

I will keep you updated.

Thanks

#19 Updated by Bryan Tong almost 9 years ago

Okay,

More updates.

1) All the VMs deployed okay but it looks like towards the end of the deployments I hit the journal limit. Its just real hard to tell when there is so much random IO going on.

2) I have shutdown all the vms and waiting for the cluster to settle.

3) Starting a 400GB write to the ceph client using the following.
dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=400k

4) Monitoring the data throughput through dstat I see steady streaming.

5) At
root@cephnode02:~/cephutils# date
Fri Nov 19 17:07:00 CST 2010

cephdisk03 had a 15 second glitch in writes

5) As expected at near
root@cephnode02:~/cephutils# date
Fri Nov 19 17:09:56 CST 2010

Everything locked up and the write came to a halt.

2010-11-19 17:10:26.812088 pg v475: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail
2010-11-19 17:10:33.129148 pg v476: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail
2010-11-19 17:10:38.372873 pg v477: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail
2010-11-19 17:10:43.240149 pg v478: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail

If the data from the VMs is figured in this is very near the size of the journals which is 3 x 32GB or 96GB of journal space.

After about 45 seconds the write started streaming again.

Another system flutter here

root@cephnode02:~/cephutils# date
Fri Nov 19 17:29:38 CST 2010

Another here

root@cephnode02:~/cephutils# date
Fri Nov 19 17:34:12 CST 2010

Intermittent

root@cephnode02:~/cephutils# date
Fri Nov 19 17:35:35 CST 2010

More intermittence

root@cephnode02:~/cephutils# date
Fri Nov 19 17:37:13 CST 2010

At this point its almost completely halted

root@cephnode02:~/cephutils# date
Fri Nov 19 17:37:28 CST 2010

6) The average write speed judging by dstat was between 2.5 and 4 Gbps or 250 - 400MB/s

7) Here is the result of the dd

root@cephnode01:/mnt# dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=400k
409600+0 records in
409600+0 records out
429496729600 bytes (429 GB) copied, 2309.2 s, 186 MB/s

8) Here is a smaller write

root@cephnode01:/mnt# dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=10k
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 62.5441 s, 172 MB/s

It seems like before the write returns complete there is a time where no data is actually being written. I am not sure how that works.

All the logs are on the cluster and all the needed logging is turned on.

Doing the rbd command

root@cephnode02:~/cephutils# date
Fri Nov 19 17:44:11 CST 2010

Before this the command was not writing at all and the system was hung completely.

91      16        33        17  0.746616         0         -   79.8428
92 16 33 17 0.7385 0 - 79.8428
93 16 33 17 0.73056 0 - 79.8428
94 16 33 17 0.722787 0 - 79.8428
95 16 36 20 0.841388 1 18.6564 70.6427
96 16 50 34 1.41546 56 6.88076 50.6385
97 16 88 72 2.96654 152 2.41285 25.1971
98 16 151 135 5.50548 252 3.12239 14.6442
99 16 207 191 7.71049 224 2.51849 11.1606
min lat: 1.19794 max lat: 84.9395 avg lat: 9.25169
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
100 16 264 248 9.91131 228 3.18548 9.25169
101 16 319 303 11.9895 220 2.47626 8.12353
102 16 376 360 14.1052 228 2.50121 7.28927
103 16 435 419 16.2574 236 3.01149 6.6238
104 16 487 471 18.0993 208 4.31959 6.18961
105 16 543 527 20.0582 224 2.60889 5.88472
106 16 600 584 22.018 228 2.73522 5.57708
107 16 663 647 24.1651 252 2.79579 5.28485
108 16 723 707 26.1615 240 2.09517 5.07365
109 16 784 768 28.1577 244 2.50826 4.86907
110 16 839 823 29.8998 220 2.31899 4.74817
111 16 868 852 30.6743 116 6.17081 4.70212
112 16 899 883 31.5065 124 5.01973 4.73841
113 16 947 931 32.9251 192 2.21323 4.69338
114 16 1011 995 34.8797 256 2.81231 4.56228
115 16 1069 1053 36.5918 232 1.98472 4.46632
116 16 1129 1113 38.3432 240 2.91814 4.36779
117 16 1190 1174 40.0988 244 2.5139 4.28434
118 16 1229 1213 41.0797 156 3.73419 4.27287
119 16 1286 1270 42.6484 228 2.05218 4.20385

Then out of no where it picked up and started writing.

min lat: 0.88051 max lat: 84.9395 avg lat: 3.3814
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
160 16 3533 3517 87.8297 220 2.98483 3.3814
161 16 3589 3573 88.674 224 2.59933 3.37463
162 15 3644 3629 89.5076 224 3.70755 3.36497
163 16 3695 3679 90.184 200 4.06077 3.36353
164 16 3737 3721 90.6571 168 3.22226 3.36508
165 16 3791 3775 91.415 216 3.11739 3.36397
166 16 3850 3834 92.284 236 2.37901 3.35491
167 16 3903 3887 92.9994 212 3.30706 3.34597
min lat: 0.85431 max lat: 84.9395 avg lat: 3.19795
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
600 15 24906 24891 165.765 196 1.82349 3.19795
Total time run: 600.184263
Total writes made: 24906
Write size: 4194304
Bandwidth (MB/sec): 165.532

Average Latency: 3.1976
Max latency: 84.9395
Min latency: 0.85431

It seems like its pretty clear that the journal was flushing during the beginning of this command and when it finished flushing the write started.

I have emailed you with a download link for the logs.

Thanks

#20 Updated by Sage Weil almost 9 years ago

Please try out the patches in the filestore_throttle branch, commit:b28c0bf82ac28ded4fe85573d32fdc111c66e50b

It looks ot me like this is triggering due to the sync in the underlying filesystem (btrfs in this case). When a btrfs commit happens, writes block
temporarily while a bunch of data is flushed out to disk. The osd is throttling incoming writes based on the amount of data queued for the fs, and the default (100MB) isn't enough to absorb the amount of time the commit is taking (~7 seconds in the case I checked). The tunable is filestore_queue_max_ops and _bytes. Bumping these way way up might have worked in your case since writes appear to be limited by the journal.

I also added a filestore_queue_commiting_max_ops/bytes so that an alternate (and higher) limit will kick in while the underlying fs is doing a commit to absorb the extra work that gets queued up. Please try that tunable, and turn it way up (maybe avg throughput * 10 seconds?) and see if that helps.

I also added some instrumentation so we can watch the amount of data flowing through and the size of the various journal and fs queues involved (/var/log/ceph/stat/osd.#.fs.log).

#21 Updated by Sage Weil almost 9 years ago

  • Target version changed from v0.23.1 to v0.24

#22 Updated by Bryan Tong almost 9 years ago

Awesome, thanks for the help. I will give these patches a shot towards the end of the week.

Thanks

#23 Updated by Sage Weil almost 9 years ago

  • Target version changed from v0.24 to v0.25

#24 Updated by Bryan Tong over 8 years ago

Just wanted to make an update.

I just got a chance to get back on the cluster today.

I have upgraded the cluster to the latest stable version of ceph as well as the OS's the machines are running on.

I am going to attempt to regenerate the issue and if it still exists I will test out the patches above.

Thanks

#25 Updated by Bryan Tong over 8 years ago

Okay,

I got everything updated and it appears the cluster hung right about 180GB of data written. Which was double the journal size. So I would assume the issue is still present.

I will begin trying out the patches.

Thanks

#26 Updated by Bryan Tong over 8 years ago

This is strange.

I tried a very large write.

root@cephnode01:/mnt# dd if=/dev/zero of=cephtest01 bs=1M count=1M

2010-12-17 16:32:51.135613 pg v4253: 1056 pgs: 267 active, 513 active+clean, 265 peering, 11 active+degraded; 413 GB data, 565 GB used, 27364 GB / 27929 GB avail; 70932/402328 degraded (17.630%)

And finally

2010-12-17 17:00:10.305307 pg v4328: 1056 pgs: 788 active+clean+degraded, 268 degraded+peering; 426 GB data, 290 GB used, 9950 GB / 10240 GB avail; 206016/412032 degraded (50.000%)
2010-12-17 17:22:09.173125 mds e26: 1/1/1 up {0=up:replay}
2010-12-17 17:47:19.317031 mds e27: 1/1/1 up {0=up:replay(laggy or crashed)}

It appears during the write the cluster crashed completely.

#27 Updated by Sage Weil over 8 years ago

  • Target version changed from v0.25 to v0.25.1

#28 Updated by Sage Weil over 8 years ago

  • Target version changed from v0.25.1 to v0.25.2

#29 Updated by Sage Weil over 8 years ago

  • translation missing: en.field_position deleted (351)
  • translation missing: en.field_position set to 1
  • translation missing: en.field_position changed from 1 to 545

#30 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF