Project

General

Profile

Actions

Bug #538

closed

Write performance does not scale over multiple computers

Added by Ed Burnette over 13 years ago. Updated over 12 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have ceph0.22.1 installed on a cluster of 208 lightly loaded 64-bit Linux nodes (RHEL5.5 ext3). The configuration is pretty much out of the box (no changes to replication or crush options). I've tested write performance with a couple of different benchmarks and they indicate that the total write throughput does not scale with the number of computers doing the writes. With 1 computer I can get about 63MB/sec, but with 4 I get as little as 33MB/sec (per node), with 8 I get 10MB/sec, with 16 I get 6MB/sec, and so on up to 208 nodes where I get under 1MB/sec per node. The cpu usage on the machine running cmon and cmds stays relatively low.

I tried it with the high level cfuse file system writes and the low level rados writes with similar results. For example here are some results using the "rados bench" command on one machine and on multiple machines. (The mpirun program runs a program simultaneously on multiple computers, the number specified by the -np option).

> rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    62.686
> mpirun -np 4 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    33.410
Bandwidth (MB/sec):    37.618
Bandwidth (MB/sec):    38.106
Bandwidth (MB/sec):    59.590
> mpirun -np 8 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    9.931
Bandwidth (MB/sec):    15.875
...
Bandwidth (MB/sec):    27.801
Bandwidth (MB/sec):    34.115
> mpirun -np 16 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    6.240
Bandwidth (MB/sec):    7.860
...
Bandwidth (MB/sec):    12.906
Bandwidth (MB/sec):    15.725
> mpirun -np 32 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    4.035
Bandwidth (MB/sec):    4.168
...
Bandwidth (MB/sec):    8.686
Bandwidth (MB/sec):    8.821
> mpirun -np 64 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    2.435
Bandwidth (MB/sec):    2.537
...
Bandwidth (MB/sec):    5.009
Bandwidth (MB/sec):    6.364
> mpirun -np 128 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    1.452
Bandwidth (MB/sec):    1.479
...
Bandwidth (MB/sec):    3.152
Bandwidth (MB/sec):    3.733
> mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    0.930
Bandwidth (MB/sec):    0.950
...
Bandwidth (MB/sec):    2.006
Bandwidth (MB/sec):    2.136

I would expect to see bandwidth drop slightly going from the 1 computer to 4 computer case, but then remain steady until it hit a network bottleneck. The disk drives on these machines give about 50MB/sec of write throughput, and they all have a 1Gbps Ethernet connection going to a fast switch that should do up to 40Gbps bisectional throughput.

I tried longer tests (same results) and I tried using -t 1 (worse results).

Actions #1

Updated by Sage Weil over 13 years ago

If the benchpool is a new pool you created, the problem is likely that it is too small. By default, new pools have only 8 PGs (placement groups), which means you're only using 8 osds randomly chosen out of the cluster. You can confirm this by looking at 'ceph osd dump -o - | grep benchpool' and looking at the pg_num value.

You can expand the number of placement groups using something like:

fatty:src 10:21 AM $ ./ceph osd pool set foo pg_num 100
2010-11-02 10:21:17.485039 mon <- [osd,pool,set,foo,pg_num,100]
2010-11-02 10:21:18.208009 mon0 -> 'set pool 4 pg_num to 100' (0)

..then wait for the cluster to settle (watch ceph -w or ceph -s and make sure all the creating pgs become active+clean), and then
fatty:src 10:21 AM $ ./ceph osd pool set foo pgp_num 100
2010-11-02 10:21:35.392986 mon <- [osd,pool,set,foo,pgp_num,100]
2010-11-02 10:21:35.906984 mon2 -> 'set pool 4 pgp_num to 100' (0)

to spread them around the cluster. The target pg count is probably something like 20-50 * the number of osds.

Looks like this isn't in the wiki yet, adding a todo item!

Actions #2

Updated by Ed Burnette over 13 years ago

I set the target pg count to 400 and tried again. It helped some, up to 2x, but is still slower than I expected:

> rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    73.729
> mpirun -np 4 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    55.654
Bandwidth (MB/sec):    60.220
Bandwidth (MB/sec):    65.504
Bandwidth (MB/sec):    72.513
> mpirun -np 8 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    34.325
Bandwidth (MB/sec):    39.030
...
Bandwidth (MB/sec):    57.714
Bandwidth (MB/sec):    59.773
> mpirun -np 16 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    10.609
Bandwidth (MB/sec):    11.276
...
Bandwidth (MB/sec):    43.041
Bandwidth (MB/sec):    48.061
> mpirun -np 32 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    8.833
Bandwidth (MB/sec):    9.179
...
Bandwidth (MB/sec):    26.500
Bandwidth (MB/sec):    27.031
> mpirun -np 64 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    4.221
Bandwidth (MB/sec):    4.266
...
Bandwidth (MB/sec):    16.718
Bandwidth (MB/sec):    16.935
> mpirun -np 128 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    2.628
Bandwidth (MB/sec):    2.628
...
Bandwidth (MB/sec):    5.848
Bandwidth (MB/sec):    8.312
> mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    0.955
Bandwidth (MB/sec):    0.979
...
Bandwidth (MB/sec):    3.995
Bandwidth (MB/sec):    4.816

Also I noticed it greatly increased the CPU usage of the cmds process. Now it's pegged at about 180%-220% all the time (out of 8 cpus) even when I'm not doing anything. ceph -s says:

2010-11-02 14:17:02.476577    pg v2858: 55736 pgs: 55735 active+clean, 1 active+degraded; 11606 MB data, 15126 GB used, 26018 GB / 42444 GB avail; 1/5870 degraded (0.017%)
2010-11-02 14:17:02.648263   mds e37: 1/1/1 up {0=up:active}
2010-11-02 14:17:02.648308   osd e656: 208 osds: 207 up, 207 in
2010-11-02 14:17:02.648523   log 2010-11-02 14:05:06.004896 mon0 10.29.10.11:6789/0 8 : [INF] mds0 10.29.10.11:6800/26314 up:active
2010-11-02 14:17:02.648618   mon e1: 1 mons at {0=10.29.10.11:6789/0}

Is there a way to make writes go to the local drive first and then replicate later?

Actions #3

Updated by Ed Burnette over 13 years ago

I also tried setting the target pg count to 4,000 and got about the same numbers as 400, maybe a small amount faster. (It's hard to tell because of network variability).

Actions #4

Updated by Greg Farnum over 13 years ago

Just to be clear, do you have all 208 nodes running server daemons and the client? What's your configuration look like?

The thing with the PGs is that objects are pseudo-randomly distributed into placement groups, and placement groups are pseudo-randomly distributed across the OSDs, which means with lower numbers you can get a number of writes simultaneously going to the same node if your sample size is too low. If you're really running 208 OSDs, I'd try 10000 or so PGs.

Actions #5

Updated by Sage Weil over 13 years ago

Oh, there is another issue: the rados bench command always writes to objects "Object %d". So all of your nodes are writing to the same objects on the same OSDs.

I just pushed a fix for that to the unstable branch. You can cherry-pick it at e304a2451a80f11117cb01031734e72077c88ce0.

Or:

diff --git a/src/osdc/rados_bencher.h b/src/osdc/rados_bencher.h
index 9d4f34f..dfd11c6 100644
--- a/src/osdc/rados_bencher.h
+++ b/src/osdc/rados_bencher.h
@@ -47,6 +47,14 @@ struct bench_data {
   char *object_contents; //pointer to the contents written to each object
 };

+void generate_object_name(char *s, int objnum)
+{
+  char hostname[30];
+  gethostname(hostname, sizeof(hostname)-1);
+  hostname[sizeof(hostname)-1] = 0;
+  sprintf(s, "%s_%d_object%d", hostname, getpid(), objnum);
+}
+
 int write_bench(Rados& rados, rados_pool_t pool,
                 int secondsToRun, int concurrentios, bench_data *data);
 int seq_read_bench(Rados& rados, rados_pool_t pool,
@@ -132,7 +140,7 @@ int write_bench(Rados& rados, rados_pool_t pool,
   for (int i = 0; i<concurrentios; ++i) {
     name[i] = new char[128];
     contents[i] = new bufferlist();
-    snprintf(name[i], 128, "Object %d", i);
+    generate_object_name(name[i], i);
     snprintf(data->object_contents, data->object_size, "I'm the %dth object!", i);
     contents[i]->append(data->object_contents, data->object_size);
   }
@@ -173,7 +181,7 @@ int write_bench(Rados& rados, rados_pool_t pool,
     //create new contents and name on the heap, and fill them
     newContents = new bufferlist();
     newName = new char[128];
-    snprintf(newName, 128, "Object %d", data->started);
+    generate_object_name(newName, data->started);
     snprintf(data->object_contents, data->object_size, "I'm the %dth object!", data->started);
     newContents->append(data->object_contents, data->object_size);
     completions[slot]->wait_for_safe();
@@ -297,7 +305,7 @@ int seq_read_bench(Rados& rados, rados_pool_t pool, int seconds_to_run,
   //set up initial reads
   for (int i = 0; i < concurrentios; ++i) {
     name[i] = new char[128];
-    snprintf(name[i], 128, "Object %d", i);
+    generate_object_name(name[i], i);
     contents[i] = new bufferlist();
   }

@@ -334,7 +342,7 @@ int seq_read_bench(Rados& rados, rados_pool_t pool, int seconds_to_run,
        ++i) {
     slot = data->finished % concurrentios;
     newName = new char[128];
-    snprintf(newName, 128, "Object %d", data->started);
+    generate_object_name(newName, data->started);
     completions[slot]->wait_for_complete();
     dataLock.Lock();
     r = completions[slot]->get_return_value();

Actions #6

Updated by Ed Burnette over 13 years ago

Greg Farnum wrote:

Just to be clear, do you have all 208 nodes running server daemons and the client? What's your configuration look like?

Yes. 208 16GB 8-core x86_84 nodes, all alike, all running cosd and all running the client code. One node is also running cmds and cmon. (I tried removing the cosd from that node but it didn't make much difference).

The thing with the PGs is that objects are pseudo-randomly distributed into placement groups, and placement groups are pseudo-randomly distributed across the OSDs, which means with lower numbers you can get a number of writes simultaneously going to the same node if your sample size is too low. If you're really running 208 OSDs, I'd try 10000 or so PGs.

I'll try that if I can the servers to stay up long enough. ceph -w is swamped with chatter about osds going up and down and about degraded nodes when I know they're not really bouncing. :)

What'd I'd really like is to make the write go first to the node running the client (i.e., the osd running on the same node as the client), in order to reduce network traffic. I can guarantee that the writes will be balanced (i.e., each client is writing about the same amount of data). I'm not using Hadoop but the idea of code and data being co-located is similar in the system I'm building. (After that I'd like to usually put the replica copy on a nearby node, where nearby means same subnet, switch, and rack. But that's less important than writing first to the local disk).

Actions #7

Updated by Sage Weil over 13 years ago

Ed Burnette wrote:

I'll try that if I can the servers to stay up long enough. ceph -w is swamped with chatter about osds going up and down and about degraded nodes when I know they're not really bouncing. :)

That is contributing to the problem. Can you try setting in your [osd] section 'osd heartbeat grace = 60' or possibly even higher (20 seconds is the default). The bouncing itself contributes to load which can interfere with the heartbeats. We haven't had access to a large enough cluster to tune these things.

Actions #8

Updated by Ed Burnette over 13 years ago

I set 'osd heartbeat grace=120' and that got rid of the chatter. My performance is now:

> rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    62.628
> mpirun -np 4 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    44.434
Bandwidth (MB/sec):    66.421
Bandwidth (MB/sec):    67.449
> mpirun -np 8 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    37.737
Bandwidth (MB/sec):    38.411
...
Bandwidth (MB/sec):    51.655
Bandwidth (MB/sec):    58.610
> mpirun -np 16 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    15.549
Bandwidth (MB/sec):    15.888
...
Bandwidth (MB/sec):    48.211
Bandwidth (MB/sec):    48.350
> mpirun -np 32 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    10.682
Bandwidth (MB/sec):    10.992
...
Bandwidth (MB/sec):    22.931
Bandwidth (MB/sec):    26.331
> mpirun -np 64 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    3.929
Bandwidth (MB/sec):    4.487
...
Bandwidth (MB/sec):    17.497
Bandwidth (MB/sec):    18.321
> mpirun -np 128 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    1.415
Bandwidth (MB/sec):    1.419
...
Bandwidth (MB/sec):    5.676
Bandwidth (MB/sec):    5.745
> mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
couldn't initialize rados!
...
> mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3
Bandwidth (MB/sec):    0.650
Bandwidth (MB/sec):    0.655
...
Bandwidth (MB/sec):    0.871
Bandwidth (MB/sec):    0.879

We've decided to go with a home grown system that can get better scaling.

Actions #9

Updated by Greg Farnum over 13 years ago

Did you update your installed version of the rados tool as Sage said? If you did and are still getting poor performance across your cluster, something is very wrong and we'd like to learn more about what's going on! 208 machines shouldn't be anywhere near RADOS' scaling limitations.

Actions #10

Updated by Sage Weil over 12 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF