Bug #538
closedWrite performance does not scale over multiple computers
0%
Description
I have ceph0.22.1 installed on a cluster of 208 lightly loaded 64-bit Linux nodes (RHEL5.5 ext3). The configuration is pretty much out of the box (no changes to replication or crush options). I've tested write performance with a couple of different benchmarks and they indicate that the total write throughput does not scale with the number of computers doing the writes. With 1 computer I can get about 63MB/sec, but with 4 I get as little as 33MB/sec (per node), with 8 I get 10MB/sec, with 16 I get 6MB/sec, and so on up to 208 nodes where I get under 1MB/sec per node. The cpu usage on the machine running cmon and cmds stays relatively low.
I tried it with the high level cfuse file system writes and the low level rados writes with similar results. For example here are some results using the "rados bench" command on one machine and on multiple machines. (The mpirun program runs a program simultaneously on multiple computers, the number specified by the -np option).
> rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 62.686 > mpirun -np 4 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 33.410 Bandwidth (MB/sec): 37.618 Bandwidth (MB/sec): 38.106 Bandwidth (MB/sec): 59.590 > mpirun -np 8 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 9.931 Bandwidth (MB/sec): 15.875 ... Bandwidth (MB/sec): 27.801 Bandwidth (MB/sec): 34.115 > mpirun -np 16 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 6.240 Bandwidth (MB/sec): 7.860 ... Bandwidth (MB/sec): 12.906 Bandwidth (MB/sec): 15.725 > mpirun -np 32 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 4.035 Bandwidth (MB/sec): 4.168 ... Bandwidth (MB/sec): 8.686 Bandwidth (MB/sec): 8.821 > mpirun -np 64 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 2.435 Bandwidth (MB/sec): 2.537 ... Bandwidth (MB/sec): 5.009 Bandwidth (MB/sec): 6.364 > mpirun -np 128 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 1.452 Bandwidth (MB/sec): 1.479 ... Bandwidth (MB/sec): 3.152 Bandwidth (MB/sec): 3.733 > mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 0.930 Bandwidth (MB/sec): 0.950 ... Bandwidth (MB/sec): 2.006 Bandwidth (MB/sec): 2.136
I would expect to see bandwidth drop slightly going from the 1 computer to 4 computer case, but then remain steady until it hit a network bottleneck. The disk drives on these machines give about 50MB/sec of write throughput, and they all have a 1Gbps Ethernet connection going to a fast switch that should do up to 40Gbps bisectional throughput.
I tried longer tests (same results) and I tried using -t 1 (worse results).
Updated by Sage Weil over 13 years ago
If the benchpool is a new pool you created, the problem is likely that it is too small. By default, new pools have only 8 PGs (placement groups), which means you're only using 8 osds randomly chosen out of the cluster. You can confirm this by looking at 'ceph osd dump -o - | grep benchpool' and looking at the pg_num value.
You can expand the number of placement groups using something like:
fatty:src 10:21 AM $ ./ceph osd pool set foo pg_num 100 2010-11-02 10:21:17.485039 mon <- [osd,pool,set,foo,pg_num,100] 2010-11-02 10:21:18.208009 mon0 -> 'set pool 4 pg_num to 100' (0)
..then wait for the cluster to settle (watch ceph -w or ceph -s and make sure all the creating pgs become active+clean), and then
fatty:src 10:21 AM $ ./ceph osd pool set foo pgp_num 100 2010-11-02 10:21:35.392986 mon <- [osd,pool,set,foo,pgp_num,100] 2010-11-02 10:21:35.906984 mon2 -> 'set pool 4 pgp_num to 100' (0)
to spread them around the cluster. The target pg count is probably something like 20-50 * the number of osds.
Looks like this isn't in the wiki yet, adding a todo item!
Updated by Ed Burnette over 13 years ago
I set the target pg count to 400 and tried again. It helped some, up to 2x, but is still slower than I expected:
> rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 73.729 > mpirun -np 4 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 55.654 Bandwidth (MB/sec): 60.220 Bandwidth (MB/sec): 65.504 Bandwidth (MB/sec): 72.513 > mpirun -np 8 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 34.325 Bandwidth (MB/sec): 39.030 ... Bandwidth (MB/sec): 57.714 Bandwidth (MB/sec): 59.773 > mpirun -np 16 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 10.609 Bandwidth (MB/sec): 11.276 ... Bandwidth (MB/sec): 43.041 Bandwidth (MB/sec): 48.061 > mpirun -np 32 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 8.833 Bandwidth (MB/sec): 9.179 ... Bandwidth (MB/sec): 26.500 Bandwidth (MB/sec): 27.031 > mpirun -np 64 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 4.221 Bandwidth (MB/sec): 4.266 ... Bandwidth (MB/sec): 16.718 Bandwidth (MB/sec): 16.935 > mpirun -np 128 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 2.628 Bandwidth (MB/sec): 2.628 ... Bandwidth (MB/sec): 5.848 Bandwidth (MB/sec): 8.312 > mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 0.955 Bandwidth (MB/sec): 0.979 ... Bandwidth (MB/sec): 3.995 Bandwidth (MB/sec): 4.816
Also I noticed it greatly increased the CPU usage of the cmds process. Now it's pegged at about 180%-220% all the time (out of 8 cpus) even when I'm not doing anything. ceph -s says:
2010-11-02 14:17:02.476577 pg v2858: 55736 pgs: 55735 active+clean, 1 active+degraded; 11606 MB data, 15126 GB used, 26018 GB / 42444 GB avail; 1/5870 degraded (0.017%) 2010-11-02 14:17:02.648263 mds e37: 1/1/1 up {0=up:active} 2010-11-02 14:17:02.648308 osd e656: 208 osds: 207 up, 207 in 2010-11-02 14:17:02.648523 log 2010-11-02 14:05:06.004896 mon0 10.29.10.11:6789/0 8 : [INF] mds0 10.29.10.11:6800/26314 up:active 2010-11-02 14:17:02.648618 mon e1: 1 mons at {0=10.29.10.11:6789/0}
Is there a way to make writes go to the local drive first and then replicate later?
Updated by Ed Burnette over 13 years ago
I also tried setting the target pg count to 4,000 and got about the same numbers as 400, maybe a small amount faster. (It's hard to tell because of network variability).
Updated by Greg Farnum over 13 years ago
Just to be clear, do you have all 208 nodes running server daemons and the client? What's your configuration look like?
The thing with the PGs is that objects are pseudo-randomly distributed into placement groups, and placement groups are pseudo-randomly distributed across the OSDs, which means with lower numbers you can get a number of writes simultaneously going to the same node if your sample size is too low. If you're really running 208 OSDs, I'd try 10000 or so PGs.
Updated by Sage Weil over 13 years ago
Oh, there is another issue: the rados bench command always writes to objects "Object %d". So all of your nodes are writing to the same objects on the same OSDs.
I just pushed a fix for that to the unstable branch. You can cherry-pick it at e304a2451a80f11117cb01031734e72077c88ce0.
Or:
diff --git a/src/osdc/rados_bencher.h b/src/osdc/rados_bencher.h index 9d4f34f..dfd11c6 100644 --- a/src/osdc/rados_bencher.h +++ b/src/osdc/rados_bencher.h @@ -47,6 +47,14 @@ struct bench_data { char *object_contents; //pointer to the contents written to each object }; +void generate_object_name(char *s, int objnum) +{ + char hostname[30]; + gethostname(hostname, sizeof(hostname)-1); + hostname[sizeof(hostname)-1] = 0; + sprintf(s, "%s_%d_object%d", hostname, getpid(), objnum); +} + int write_bench(Rados& rados, rados_pool_t pool, int secondsToRun, int concurrentios, bench_data *data); int seq_read_bench(Rados& rados, rados_pool_t pool, @@ -132,7 +140,7 @@ int write_bench(Rados& rados, rados_pool_t pool, for (int i = 0; i<concurrentios; ++i) { name[i] = new char[128]; contents[i] = new bufferlist(); - snprintf(name[i], 128, "Object %d", i); + generate_object_name(name[i], i); snprintf(data->object_contents, data->object_size, "I'm the %dth object!", i); contents[i]->append(data->object_contents, data->object_size); } @@ -173,7 +181,7 @@ int write_bench(Rados& rados, rados_pool_t pool, //create new contents and name on the heap, and fill them newContents = new bufferlist(); newName = new char[128]; - snprintf(newName, 128, "Object %d", data->started); + generate_object_name(newName, data->started); snprintf(data->object_contents, data->object_size, "I'm the %dth object!", data->started); newContents->append(data->object_contents, data->object_size); completions[slot]->wait_for_safe(); @@ -297,7 +305,7 @@ int seq_read_bench(Rados& rados, rados_pool_t pool, int seconds_to_run, //set up initial reads for (int i = 0; i < concurrentios; ++i) { name[i] = new char[128]; - snprintf(name[i], 128, "Object %d", i); + generate_object_name(name[i], i); contents[i] = new bufferlist(); } @@ -334,7 +342,7 @@ int seq_read_bench(Rados& rados, rados_pool_t pool, int seconds_to_run, ++i) { slot = data->finished % concurrentios; newName = new char[128]; - snprintf(newName, 128, "Object %d", data->started); + generate_object_name(newName, data->started); completions[slot]->wait_for_complete(); dataLock.Lock(); r = completions[slot]->get_return_value();
Updated by Ed Burnette over 13 years ago
Greg Farnum wrote:
Just to be clear, do you have all 208 nodes running server daemons and the client? What's your configuration look like?
Yes. 208 16GB 8-core x86_84 nodes, all alike, all running cosd and all running the client code. One node is also running cmds and cmon. (I tried removing the cosd from that node but it didn't make much difference).
The thing with the PGs is that objects are pseudo-randomly distributed into placement groups, and placement groups are pseudo-randomly distributed across the OSDs, which means with lower numbers you can get a number of writes simultaneously going to the same node if your sample size is too low. If you're really running 208 OSDs, I'd try 10000 or so PGs.
I'll try that if I can the servers to stay up long enough. ceph -w is swamped with chatter about osds going up and down and about degraded nodes when I know they're not really bouncing. :)
What'd I'd really like is to make the write go first to the node running the client (i.e., the osd running on the same node as the client), in order to reduce network traffic. I can guarantee that the writes will be balanced (i.e., each client is writing about the same amount of data). I'm not using Hadoop but the idea of code and data being co-located is similar in the system I'm building. (After that I'd like to usually put the replica copy on a nearby node, where nearby means same subnet, switch, and rack. But that's less important than writing first to the local disk).
Updated by Sage Weil over 13 years ago
Ed Burnette wrote:
I'll try that if I can the servers to stay up long enough. ceph -w is swamped with chatter about osds going up and down and about degraded nodes when I know they're not really bouncing. :)
That is contributing to the problem. Can you try setting in your [osd] section 'osd heartbeat grace = 60' or possibly even higher (20 seconds is the default). The bouncing itself contributes to load which can interfere with the heartbeats. We haven't had access to a large enough cluster to tune these things.
Updated by Ed Burnette over 13 years ago
I set 'osd heartbeat grace=120' and that got rid of the chatter. My performance is now:
> rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 62.628 > mpirun -np 4 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 44.434 Bandwidth (MB/sec): 66.421 Bandwidth (MB/sec): 67.449 > mpirun -np 8 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 37.737 Bandwidth (MB/sec): 38.411 ... Bandwidth (MB/sec): 51.655 Bandwidth (MB/sec): 58.610 > mpirun -np 16 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 15.549 Bandwidth (MB/sec): 15.888 ... Bandwidth (MB/sec): 48.211 Bandwidth (MB/sec): 48.350 > mpirun -np 32 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 10.682 Bandwidth (MB/sec): 10.992 ... Bandwidth (MB/sec): 22.931 Bandwidth (MB/sec): 26.331 > mpirun -np 64 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 3.929 Bandwidth (MB/sec): 4.487 ... Bandwidth (MB/sec): 17.497 Bandwidth (MB/sec): 18.321 > mpirun -np 128 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 1.415 Bandwidth (MB/sec): 1.419 ... Bandwidth (MB/sec): 5.676 Bandwidth (MB/sec): 5.745 > mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 couldn't initialize rados! ... > mpirun -np 208 rados bench 10 write -p benchpool | grep Bandwidth | sort -n -k3 Bandwidth (MB/sec): 0.650 Bandwidth (MB/sec): 0.655 ... Bandwidth (MB/sec): 0.871 Bandwidth (MB/sec): 0.879
We've decided to go with a home grown system that can get better scaling.
Updated by Greg Farnum over 13 years ago
Did you update your installed version of the rados tool as Sage said? If you did and are still getting poor performance across your cluster, something is very wrong and we'd like to learn more about what's going on! 208 machines shouldn't be anywhere near RADOS' scaling limitations.