Bug #531: Journaling Causes System Hang - Ceph - Ceph

Actions

Copy link

Bug #531

closed

Journaling Causes System Hang

Added by Bryan Tong over 13 years ago. Updated about 13 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.25.2

% Done:

Spent time:

2:00 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello,

It seems that when doing a large write once the journal fills up the system goes into a state of lock and has trouble deciding how to proceed.

Steps to reproduce

1) Create a Cluster using separate drives as the Journal. In our case 32GB SSD x 4.
2) Write a file from the client large enough to fill the journals. 60GB will do since replication will add another 60GB to the write.

Expected results

1) The write will continue to write through until completed at the speed of a smaller write. In our case 300Mb/s+

Actual results

1) The write stalls when it hits the limit of the journals and hangs for up to 15 minutes while the journals slowly commit to disk. Write speed drops to below 40MB/s.

Workaround

1) Running without journals causes the problem to go away.

Caveats

1) When running without journals smaller block sizes will not write correctly. For example, creating a filesystem on an image file takes hours for a simple 40GB file.

Theoretical Solution

The journaler needs to start committing before it fills up and should throttle the incoming data before the system locks up completely.

Thanks

Actions

Copy link

Updated by Bryan Tong over 13 years ago

I forgot a bit about the setup.

4 x OSD all with journals on separate drives. Each OSD is on a separate system.
Btrfs used as the filesystem for all the OSDs.
Each OSD is a backup MDS.
There is a dedicated MDS server.
3 x Dedicated Monitors.

4 x Client machines all connected over librados using 16MB object block sizes.

Thanks

Actions

Copy link

Updated by Sage Weil over 13 years ago

Assignee set to Sage Weil
Priority changed from High to Urgent
Target version set to v0.22.3

It's expected that you'll get extremely slow performance without the journal.

I'll work on replicating this in our environment tomorrow. Thanks!

Actions

Copy link

Updated by Bryan Tong over 13 years ago

Yeah,

I figured not running with journals wouldn't work right. As long as the block size of the writes is very large it performs fine but file-system operations it doesn't like.

If you would like access to our cluster just let me know. I would be happy to help you out. We are still just rearranging and testing.

I forgot to note we are using the latest released debs from the website and have also run from sources in the passed.

Thanks

Actions

Copy link

Updated by Sage Weil over 13 years ago

Assignee changed from Sage Weil to Greg Farnum

Actions

Copy link

Updated by Greg Farnum over 13 years ago

Brian, can you give us a few more details about your cluster and the performance drop you're seeing here? Specific questions I have:
1) What's your baseline performance? You said it's 300Mb/s (37.5MB/s) -- did you mean 300MB/s? If that's the case you must not be running over GigE?
2) What does "The write stalls when it hits the limit of the journals and hangs for up to 15 minutes while the journals slowly commit to disk. Write speed drops to below 40MB/s." mean? Do you mean no progress is made for 15 minutes and average write speeds over the whole time drop below 40MB/s from a significantly higher starting point? Or is there just a sudden precipitous drop in dynamic write speed?
3) Are you sure your sustainable performance is better than 40MB/s? Depending on your cluster size, disk speed, and replication level that might be as good as it gets, except
4) You're indicating that running with the journal off gives you higher performance in large writes than with it on? What test is demonstrating this to you?
I ask because we do have a journal throttler that slows down the speed of journal commits when it fills up and outpaces the data drive, and there could be issues with this but we don't have any SSDs available for testing right now so reproducing it locally is going to be difficult. Feel free to email me (gregf@hq.newdream.net) if you prefer. :)

Actions

Copy link

Updated by Bryan Tong over 13 years ago

Hello,

1) Correct we are running transparent 10GbE

2) From what I can tell monitoring dstat across the cluster and watching the data transfer that the write basically hangs completely once the journal fills up. The 40MB/s was the diluted average.

3) Our theoretical write speed per storage brick is ~500MB/s writing straight to disk. There are two other bricks in the cluster than can sustain ~900MB/s. That combined with 10GbE we can sustain writes over iSCSI of 500MB/s sustained.

4) When writing a very large file I used the same test for both.
a - with journal) dd if=/dev/zero of=/mnt/ceph/ddtest bs=1M count=50k - ~40-60MB/s
b - without journal) dd if=/dev/zero of=/mnt/ceph/ddtest bs=1M count=50k - ~275MB/s

I have been through a hundred different configurations on the cluster and I have also done hardware benchmarking from the ground up to rule out bottlenecks. At this point I am sure I have narrowed it down.

I will shoot you an email.

Thanks

Actions

Copy link

Updated by Greg Farnum over 13 years ago

Assignee changed from Greg Farnum to Sage Weil

We've looked at this a bit more but decided today that Sage is taking it over since he's a lot more familiar with the MDS journaling!

Actions

Copy link

Updated by Greg Farnum over 13 years ago

Target version changed from v0.22.3 to v0.23
Estimated time set to 4:00 h

Actions

Copy link

Updated by Sage Weil over 13 years ago

Priority changed from Urgent to Immediate

Actions

Copy link

#10

Updated by Sage Weil over 13 years ago

Priority changed from Immediate to Normal

Actions

Copy link

#11

Updated by Sage Weil over 13 years ago

Priority changed from Normal to Immediate

Actions

Copy link

#12

Updated by Sage Weil over 13 years ago

Source set to 2

Actions

Copy link

#13

Updated by Sage Weil over 13 years ago

What would be helpful in diagnosing this problem is:

- turn up osd logging, in [osd] section:
debug osd = 20
debug filestore = 20
debug journal = 20
debug ms = 1
- reproduce the hang. take note of the time that io starts, when things appear to hang, and when things recover, so that i can correlate the observed behavior with the logs
- bundle up the osd logs from all the osds machines.

Also, I'm curious if this is something you can reproduce using 'rados -p casdata bench 600 write'. If so, include the output from that as well?

Thanks!

Actions

Copy link

#14

Updated by Sage Weil over 13 years ago

Target version changed from v0.23 to v0.23.1

Actions

Copy link

#15

Updated by Bryan Tong over 13 years ago

Sorry I have been able to get the debug output yet. We have spent the last few days working with our production systems. I should be able to get these posted tomorrow.

Thanks

Actions

Copy link

#16

Updated by Sage Weil over 13 years ago

Just pinging you on this one. If you can send the logs I'd like to sort this out. Thanks!

Actions

Copy link

#17

Updated by Bryan Tong over 13 years ago

I am working on getting the output now. We are having to work on several projects at once right now. Sorry for the delay.

Actions

Copy link

#18

Updated by Bryan Tong over 13 years ago

Okay,

I just started the deployment of 12 vms on a new cephfs with 3 osds in and ssd's for journals on all the systems.

I also wanted to note that the SSDs we are using are enterprise grade so they have a write speed of about 220MB/s

Once these VM deployments complete I will move on to writing a very large file from the client.

I will keep you updated.

Thanks

Actions

Copy link

#19

Updated by Bryan Tong over 13 years ago

Okay,

More updates.

1) All the VMs deployed okay but it looks like towards the end of the deployments I hit the journal limit. Its just real hard to tell when there is so much random IO going on.

2) I have shutdown all the vms and waiting for the cluster to settle.

3) Starting a 400GB write to the ceph client using the following.
dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=400k

4) Monitoring the data throughput through dstat I see steady streaming.

5) At
root@cephnode02:~/cephutils# date
Fri Nov 19 17:07:00 CST 2010

cephdisk03 had a 15 second glitch in writes

5) As expected at near
root@cephnode02:~/cephutils# date
Fri Nov 19 17:09:56 CST 2010

Everything locked up and the write came to a halt.

2010-11-19 17:10:26.812088 pg v475: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail
2010-11-19 17:10:33.129148 pg v476: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail
2010-11-19 17:10:38.372873 pg v477: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail
2010-11-19 17:10:43.240149 pg v478: 1056 pgs: 1056 active+clean; 129 GB data, 256 GB used, 37914 GB / 38170 GB avail

If the data from the VMs is figured in this is very near the size of the journals which is 3 x 32GB or 96GB of journal space.

After about 45 seconds the write started streaming again.

Another system flutter here

root@cephnode02:~/cephutils# date
Fri Nov 19 17:29:38 CST 2010

Another here

root@cephnode02:~/cephutils# date
Fri Nov 19 17:34:12 CST 2010

Intermittent

root@cephnode02:~/cephutils# date
Fri Nov 19 17:35:35 CST 2010

More intermittence

root@cephnode02:~/cephutils# date
Fri Nov 19 17:37:13 CST 2010

At this point its almost completely halted

root@cephnode02:~/cephutils# date
Fri Nov 19 17:37:28 CST 2010

6) The average write speed judging by dstat was between 2.5 and 4 Gbps or 250 - 400MB/s

7) Here is the result of the dd

root@cephnode01:/mnt# dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=400k
409600+0 records in
409600+0 records out
429496729600 bytes (429 GB) copied, 2309.2 s, 186 MB/s

8) Here is a smaller write

root@cephnode01:/mnt# dd if=/dev/zero of=/mnt/cephtest01 bs=1M count=10k
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 62.5441 s, 172 MB/s

It seems like before the write returns complete there is a time where no data is actually being written. I am not sure how that works.

All the logs are on the cluster and all the needed logging is turned on.

Doing the rbd command

root@cephnode02:~/cephutils# date
Fri Nov 19 17:44:11 CST 2010

Before this the command was not writing at all and the system was hung completely.

91      16        33        17  0.746616         0         -   79.8428
   92      16        33        17    0.7385         0         -   79.8428
   93      16        33        17   0.73056         0         -   79.8428
   94      16        33        17  0.722787         0         -   79.8428
   95      16        36        20  0.841388         1   18.6564   70.6427
   96      16        50        34   1.41546        56   6.88076   50.6385
   97      16        88        72   2.96654       152   2.41285   25.1971
   98      16       151       135   5.50548       252   3.12239   14.6442
   99      16       207       191   7.71049       224   2.51849   11.1606
min lat: 1.19794 max lat: 84.9395 avg lat: 9.25169
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  100      16       264       248   9.91131       228   3.18548   9.25169
  101      16       319       303   11.9895       220   2.47626   8.12353
  102      16       376       360   14.1052       228   2.50121   7.28927
  103      16       435       419   16.2574       236   3.01149    6.6238
  104      16       487       471   18.0993       208   4.31959   6.18961
  105      16       543       527   20.0582       224   2.60889   5.88472
  106      16       600       584    22.018       228   2.73522   5.57708
  107      16       663       647   24.1651       252   2.79579   5.28485
  108      16       723       707   26.1615       240   2.09517   5.07365
  109      16       784       768   28.1577       244   2.50826   4.86907
  110      16       839       823   29.8998       220   2.31899   4.74817
  111      16       868       852   30.6743       116   6.17081   4.70212
  112      16       899       883   31.5065       124   5.01973   4.73841
  113      16       947       931   32.9251       192   2.21323   4.69338
  114      16      1011       995   34.8797       256   2.81231   4.56228
  115      16      1069      1053   36.5918       232   1.98472   4.46632
  116      16      1129      1113   38.3432       240   2.91814   4.36779
  117      16      1190      1174   40.0988       244    2.5139   4.28434
  118      16      1229      1213   41.0797       156   3.73419   4.27287
  119      16      1286      1270   42.6484       228   2.05218   4.20385

Then out of no where it picked up and started writing.

min lat: 0.88051 max lat: 84.9395 avg lat: 3.3814
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
160 16 3533 3517 87.8297 220 2.98483 3.3814
161 16 3589 3573 88.674 224 2.59933 3.37463
162 15 3644 3629 89.5076 224 3.70755 3.36497
163 16 3695 3679 90.184 200 4.06077 3.36353
164 16 3737 3721 90.6571 168 3.22226 3.36508
165 16 3791 3775 91.415 216 3.11739 3.36397
166 16 3850 3834 92.284 236 2.37901 3.35491
167 16 3903 3887 92.9994 212 3.30706 3.34597
min lat: 0.85431 max lat: 84.9395 avg lat: 3.19795
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
600 15 24906 24891 165.765 196 1.82349 3.19795
Total time run: 600.184263
Total writes made: 24906
Write size: 4194304
Bandwidth (MB/sec): 165.532

Average Latency: 3.1976
Max latency: 84.9395
Min latency: 0.85431

It seems like its pretty clear that the journal was flushing during the beginning of this command and when it finished flushing the write started.

I have emailed you with a download link for the logs.

Thanks

Actions

Copy link

#20

Updated by Sage Weil over 13 years ago

Please try out the patches in the filestore_throttle branch, commit:b28c0bf82ac28ded4fe85573d32fdc111c66e50b

It looks ot me like this is triggering due to the sync in the underlying filesystem (btrfs in this case). When a btrfs commit happens, writes block
temporarily while a bunch of data is flushed out to disk. The osd is throttling incoming writes based on the amount of data queued for the fs, and the default (100MB) isn't enough to absorb the amount of time the commit is taking (~7 seconds in the case I checked). The tunable is filestore_queue_max_ops and _bytes. Bumping these way way up might have worked in your case since writes appear to be limited by the journal.

I also added a filestore_queue_commiting_max_ops/bytes so that an alternate (and higher) limit will kick in while the underlying fs is doing a commit to absorb the extra work that gets queued up. Please try that tunable, and turn it way up (maybe avg throughput * 10 seconds?) and see if that helps.

I also added some instrumentation so we can watch the amount of data flowing through and the size of the various journal and fs queues involved (/var/log/ceph/stat/osd.#.fs.log).

Actions

Copy link

#21

Updated by Sage Weil over 13 years ago

Target version changed from v0.23.1 to v0.24

Actions

Copy link

#22

Updated by Bryan Tong over 13 years ago

Awesome, thanks for the help. I will give these patches a shot towards the end of the week.

Thanks

Actions

Copy link

#23

Updated by Sage Weil over 13 years ago

Target version changed from v0.24 to v0.25

Actions

Copy link

#24

Updated by Bryan Tong over 13 years ago

Just wanted to make an update.

I just got a chance to get back on the cluster today.

I have upgraded the cluster to the latest stable version of ceph as well as the OS's the machines are running on.

I am going to attempt to regenerate the issue and if it still exists I will test out the patches above.

Thanks

Actions

Copy link

#25

Updated by Bryan Tong over 13 years ago

Okay,

I got everything updated and it appears the cluster hung right about 180GB of data written. Which was double the journal size. So I would assume the issue is still present.

I will begin trying out the patches.

Thanks

Actions

Copy link

#26

Updated by Bryan Tong over 13 years ago

This is strange.

I tried a very large write.

root@cephnode01:/mnt# dd if=/dev/zero of=cephtest01 bs=1M count=1M

2010-12-17 16:32:51.135613 pg v4253: 1056 pgs: 267 active, 513 active+clean, 265 peering, 11 active+degraded; 413 GB data, 565 GB used, 27364 GB / 27929 GB avail; 70932/402328 degraded (17.630%)

And finally

2010-12-17 17:00:10.305307 pg v4328: 1056 pgs: 788 active+clean+degraded, 268 degraded+peering; 426 GB data, 290 GB used, 9950 GB / 10240 GB avail; 206016/412032 degraded (50.000%)
2010-12-17 17:22:09.173125 mds e26: 1/1/1 up {0=up:replay}
2010-12-17 17:47:19.317031 mds e27: 1/1/1 up {0=up:replay(laggy or crashed)}

It appears during the write the cluster crashed completely.

Actions

Copy link

#27