Bug #644: rsync can be sloooow - Ceph - Ceph

Actions

Copy link

Bug #644

closed

rsync can be sloooow

Added by Greg Farnum over 13 years ago. Updated over 12 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Spent time:

2:00 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This is probably due to metadata ops being fairly expensive, but we should inspect an rsync run over Ceph to make sure, and look for optimizations.

I rsynced a copy of my ceph-client folder and running inside UML it took 3:40 to run locally and 31:40 to run locally onto Ceph.

Actions

Copy link

Updated by Greg Farnum over 13 years ago

Copy of my notes file:
FS->FS
gregf@kai:~/ceph/src$ time rsync -r /btrfs/gregf/ceph-client/ mnt/ceph-client
skipping non-regular file "arch/microblaze/boot/dts/system.dts"

real 1m22.007s
user 0m30.946s
sys 0m19.701s

FS->uclient --- gave up
gregf@kai:~/ceph/src$ sudo time rsync -r /btrfs/gregf/ceph-client/ mnt/ceph-client
skipping non-regular file "arch/microblaze/boot/dts/system.dts"
^Crsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(543) [sender=3.0.7]
rsync: writefd_unbuffered failed to write 97 bytes to socket [generator]: Broken pipe (32)
Command exited with non-zero status 20
13.69user 7.27system 47:04.56elapsed 0%CPU (0avgtext+0avgdata 29696maxresident)k
7526528inputs+0outputs (3major+2224minor)pagefaults 0swaps

FS (via UML networking)->UML local
uml:~# time rsync -r gregf@kai:/btrfs/gregf/ceph-client /host/home/gregf/ceph-client/
gregf@kai's password:
skipping non-regular file "ceph-client/arch/microblaze/boot/dts/system.dts"

real 3m39.621s
user 1m17.250s
sys 0m33.940s

FS (via UML networking)->UML ceph
uml:~# time rsync -r gregf@kai:/btrfs/gregf/ceph-client mnt/ceph-client/
gregf@kai's password:
[ 1306.190000] libceph: tid 9586 timed out on osd0, will reset osd
skipping non-regular file "ceph-client/arch/microblaze/boot/dts/system.dts"
[ 1473.000000] libceph: tid 12934 timed out on osd0, will reset osd
[ 1619.590000] libceph: tid 15582 timed out on osd0, will reset osd

real 31m39.795s
user 2m1.700s
sys 1m13.860s

Actions

Copy link

Updated by Wido den Hollander over 13 years ago

Like I promised, I've done some rsync benchmarking on my Noisy machine, first a small hardware list:

Dual-Core Intel E5310 1.6Ghz
6GB DDR2
4x WD Greenpower 2TB

In my first test, I simply made a btrfs filesystem on one of the Greenpower disks and ran rsync, this gave me:

Number of files: 179899
Number of files transferred: 149100
Total file size: 77759728862 bytes
Total transferred file size: 77759727466 bytes
Literal data: 77759972268 bytes
Matched data: 0 bytes
File list size: 4486135
File list generation time: 19.884 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 3465014
Total bytes received: 77780895925

sent 3465014 bytes  received 77780895925 bytes  3746747.96 bytes/sec
total size is 77759728862  speedup is 1.00

real    346m0.698s
user    31m55.460s
sys     12m13.790s

I then used my Ceph filesystem (RC branch, commit: 89d5c91e7d207d646651f8959ee37a15ea199d1b), with the 2.6.37-rc5 kernel. My journal was on a tmpfs and my replication level was set to 3 for both the 'data' and 'metadata' pool.

Number of files: 180014
Number of files transferred: 149213
Total file size: 78254366970 bytes
Total transferred file size: 78254365574 bytes
Literal data: 78255506613 bytes
Matched data: 0 bytes
File list size: 4490275
File list generation time: 23.995 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 3467512
Total bytes received: 78276499964

sent 3467512 bytes  received 78276499964 bytes  3313928.73 bytes/sec
total size is 78254366970  speedup is 1.00

real    394m31.248s
user    31m54.510s
sys     12m31.080s

As you can see, the rsync to Ceph wasn't that slow at all, but it might be due to the large dataset and my download speed at the office, which is 100Mbit max.

I then tried removing the files, on the btrfs filesystem this took 23 seconds, on the Ceph filesystem the rsync is already running for 15 minutes now and it seems it will need a lot more time to finish.

Actions

Copy link

Updated by Greg Farnum over 13 years ago

Assignee set to Greg Farnum

Started taking a look at this again.

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Discovered that the kernel client is clearing its I_COMPLETE flag on every newly-created directory due to how the caps code handles the addition of the shared bit. I have a patch but want to get it reviewed.

Will have to re-run tests to see how much improvement this provides since I think my test data got changed since I originally ran them.

Actions

Copy link

Updated by Greg Farnum about 13 years ago

All right, ran into a wall unfortunately. :(
Because rsync does a transfer-to-temp-file-and-rename, and because renames necessarily clear the I_COMPLETE flag in the kclient (due to obscure ordering issues caused by d_move's behavior), this is just going to be stuck at pretty slow since the directory can't retain the I_COMPLETE flag. sigh

I may put some effort into testing rsyncs after the initial one to see if there are other behaviors that slow it down which we can modify, but it's not going to be a priority.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Try this?

From 28257f973e309b4e0989cbefb0245a70a1aa4d4b Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@newdream.net>
Date: Fri, 4 Feb 2011 21:38:47 -0800
Subject: [PATCH] ceph: preserve I_COMPLETE across rename

d_move puts the renamed dentry at the end of d_subdirs, screwing with our
cached dentry directory offsets.  We were just clearing I_COMPLETE to avoid
any possibility of trouble.  However, assigning the renamed dentry an
offset at the end of the directory (to match its new d_subdirs position)
is sufficient to maintain correct behavior and hold onto I_COMPLETE.

This is especially important for workloads like rsync, which renames files
into place.  Before, we would lose I_COMPLETE and do MDS lookups for each
file.  With this patch we only talk to the MDS on create and rename.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/ceph/inode.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 50001de..83f65f9 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1021,9 +1021,6 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req,
                        dout("fill_trace doing d_move %p -> %p\n",
                             req->r_old_dentry, dn);

-                       /* d_move screws up d_subdirs order */
-                       ceph_i_clear(dir, CEPH_I_COMPLETE);
-
                        d_move(req->r_old_dentry, dn);
                        dout(" src %p '%.*s' dst %p '%.*s'\n",
                             req->r_old_dentry,
@@ -1035,12 +1032,15 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req,
                           rehashing bug in vfs_rename_dir */
                        ceph_invalidate_dentry_lease(dn);

-                       /* take overwritten dentry's readdir offset */
-                       dout("dn %p gets %p offset %lld (old offset %lld)\n",
-                            req->r_old_dentry, dn, ceph_dentry(dn)->offset,
+                       /*
+                        * d_move() puts the renamed dentry at the end of
+                        * d_subdirs.  We need to assign it an appropriate
+                        * directory offset so we can behave when holding
+                        * I_COMPLETE.
+                        */
+                       ceph_set_dentry_offset(req->r_old_dentry);
+                       dout("dn %p gets new offset %lld\n", req->r_old_dentry, 
                             ceph_dentry(req->r_old_dentry)->offset);
-                       ceph_dentry(req->r_old_dentry)->offset =
-                               ceph_dentry(dn)->offset;

                        dn = req->r_old_dentry;  /* use old_dentry */
                        in = dn->d_inode;
-- 
1.7.0

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Target version changed from v0.25 to 12

Well, Sage's above patch makes it some better, but there are still issues with losing the I_COMPLETE flag due to preemptively dropping caps we aren't using. Since that's something we want to continue doing in general, there's not a lot we can do about it.
I've considered the idea of holding on to caps when we created the node and would like to test it out, but a change like that should wait until we have a suite of performance tests since it could pretty easily hurt other kinds of workloads.

So, putting this on hold for now.

Actions

Copy link