Project

General

Profile

Actions

Documentation #57062

open

Document access patterns that have good/pathological performance on CephFS

Added by Niklas Hambuechen over 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Labels (FS):
Pull request ID:

Description

I have a CephFS 16.2.7 with 200 M small files (between 1 KB and 100 KB; ther are a few larger ones up to 200 MB) and am slowly discovering by experimentation that there are some access patterns that are brutally faster than others.

For example, `stat()`ing files in directory order (as returned by `find`) is much faster than in another order (e.g. sorted order is slow).

Consider this example, where I stat 10k files on CephFS:

$ tail -n+6000000 cephfs-find-output | head -n1000000 | sort | head -n10000 | time strace -fwce statx xargs stat > /dev/null

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00  134.939687       13493     10000           statx
$ tail -n+7000000 cephfs-find-output | head -n1000000 | head -n10000 | time strace -fwce statx xargs stat > /dev/null

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    7.201124         720     10000           statx

Statting in directory order (inserted `| sort`) is 20x faster in this case, and I've seen other cases where the factors are even larger.

(I'm using different `tail -n+` to avoid hitting cached results for repeated runs.)

When looking at the `strace -T` output that shows the duration for each individual statx() call, I see many of them take > 300 millseconds even though my CephFS is on 10 Gbit/s Ethernet with 0.2 ms latency and with an SSD metadata pool. Why the time is this brutally long is questionable by itself (the 200 M files are in 1 directory, perhaps that is relevant), but the key thing is that many of the sorted stats are slow, while stats in unsorted (directory) each usually take 1 ms.

I suspect that there must be some ceph-mds cache that fetches RADOS metadata objects, each of which contains stat information for multiple files in directory order; statting files out of this order would mean fetching a metadata object for each file, as opposed to fetching 1 object and then doing e.g. 1000 stats against its data.

I found this when trying to `rsync` files from my CephFS mount to another machine; using `find` + parallel `rsync --files-from` would result in absolutely unusable performance (50 stats per second -> 40 days of copying) if `sort` was used, despite me having metadata on NVMe SSDs.

Thus, I think that this type of key knowledge about fast/pathologic CephFS access patterns should really be documented in the CephFS documentation.

I believe that somebody familiar with CephFS can make a more thorough list of what should be written down, but to give some constructive starting point I propose to write at least about:

  • What data (and how large) is transferred from RADOS -> MDS -> kclient/fuse when stat()ing or open()ing a file?
  • Which parts of these data parts are cached where, at what granularity?
  • What exactly determines directory order? Probably large-dir fragmentation is involved? When are locks involved in reading?
  • Thus, concrete recommendations like: "Always stat()/open() files in directory order if you can".
Actions #1

Updated by Niklas Hambuechen over 1 year ago

I think that a good place for this info to be added would be https://docs.ceph.com/en/quincy/cephfs/app-best-practices/

It currently has some sections that slightly go into this direction, so I think it's the right place. But it's not complete yet, e.g. the section on `ls` focuses on the userspace cost of sorting the output, not on explaining which access patterns would be fast based on how CephFS works.

Actions #2

Updated by Venky Shankar over 1 year ago

Hi Niklas,

Do you see this behavior with user-space and kclient?

Actions #3

Updated by Niklas Hambuechen over 1 year ago

Hi Venky, I'm using the kclient on Linux 5.10.88 in this cluster.

Actions #4

Updated by Venky Shankar over 1 year ago

Niklas Hambuechen wrote:

Hi Venky, I'm using the kclient on Linux 5.10.88 in this cluster.

Thanks, Niklas. I'll try this out and post and update.

Actions #5

Updated by hongsong wu about 1 year ago

Venky Shankar wrote:

Niklas Hambuechen wrote:

Hi Venky, I'm using the kclient on Linux 5.10.88 in this cluster.

Thanks, Niklas. I'll try this out and post and update.

good job,look forward to.

Actions

Also available in: Atom PDF