Bug #44744: Slow file creation/sync on kernel cephfs - Linux kernel client - Ceph

Actions

Copy link

Bug #44744

closed

Slow file creation/sync on kernel cephfs

Added by Jan Fajerski about 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

Target version:

% Done:

100%

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

We're seeing weird behaviour of the sync tool with a kernel cephfs mount. To trigger this its enough to run the following snippet on a cephfs mount, this writes a 1MB file and calls sync immediately after in a loop (works without the loop to but the runtimes are much shorter):

time for i in $(seq -w 11 30); do xfs_io -f -c "pwrite 0 1m" $i.txt ; sync; done

This takes 90 seconds or longer in various test setups. On a ceph-fuse mount this takes more like 1.5 seconds.

Things get stranger when the file size changes or the sync tool is called with arguments:
- The kernel mount runtime goes down to fuse level when sync is called with the filename like so:

time for i in $(seq -w 11 30); do xfs_io -f -c "pwrite 0 1m" $i.txt ; sync $i.txt; done

- The kernel mount runtime goes down to fuse level when a 4MB file is written:

time for i in $(seq -w 11 30); do xfs_io -f -c "pwrite 0 4m" $i.txt ; sync; done

I also noticed that that the 4m write is only fast if the argument to xfs_io is the file name only.
Calling this with any path component (either absolute or relative) at all triggers the slow behaviour again.

Luis looked at what messages are exchanged:

when doing a full sync (i.e. when ->sync_fs() is executed), the client will wait for the MDS to send the CEPH_CAP_OP_FLUSH_ACK, acknowledging that everything is safe (in the journal, I believe). And this message is handled in Locker::handle_client_caps() (called from MDSRank::handle_deferrable_message), where it is queued for processing. Finally, it is later sent out to the client from Locker::file_update_finish(). And the time between being queued and sent corresponds to the delay experienced in the kernel client.

The fuse client does not wait for this FLUSH_ACK message from the MDS when 'sync' is executed. In fact, it even allows the filesystem to be umounted before receiving this ACK.

All this was tested with a recent nautilus and a kernel with backports roughly corresponding to a 5.0 kernel iirc. Numbers for a master build and mainline kernel are coming up.

Files

Download all files

0001-ceph-request-expedited-service-when-flushing-caps.patch (1.13 KB) 0001-ceph-request-expedited-service-when-flushing-caps.patch		Jeff Layton, 03/30/2020 03:50 PM
0001-ceph-request-expedited-service-when-flushing-caps.patch (1.13 KB) 0001-ceph-request-expedited-service-when-flushing-caps.patch		Jeff Layton, 03/30/2020 03:51 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Jeff Layton about 4 years ago

Assignee set to Jeff Layton

Actions

Copy link

Updated by Jeff Layton about 4 years ago

Jan Fajerski wrote:

Luis looked at what messages are exchanged:

when doing a full sync (i.e. when ->sync_fs() is executed), the client will wait for the MDS to send the CEPH_CAP_OP_FLUSH_ACK, acknowledging that everything is safe (in the journal, I believe). And this message is handled in Locker::handle_client_caps() (called from MDSRank::handle_deferrable_message), where it is queued for processing. Finally, it is later sent out to the client from Locker::file_update_finish(). And the time between being queued and sent corresponds to the delay experienced in the kernel client.

I had a patchset that fixes some similar issues for fsync back in 2017 (see commit 1e4ef0c6332bd). That set it in try_flush_caps but maybe we need to do something similar in the sync_fs codepath. I'll have a look and see.

The fuse client does not wait for this FLUSH_ACK message from the MDS when 'sync' is executed. In fact, it even allows the filesystem to be umounted before receiving this ACK.

That sounds broken. It might be good to open a separate libcephfs bug for that.

Actions

Copy link

Updated by Jan Fajerski about 4 years ago

Some more infos on this:

There is an MDS daemon option 'mds_tick_interval' which is "time in seconds between upkeep tasks", in simple words: each 5 seconds (by default) there is a timer tick, which kicks mdslog to flush periodically the metadata. And exactly this flush generates OP_FLUSH_ACK, which is waited on kernel side (mentioned by David). In order to decrease flush latency this value can be reduced, say to 0.5 seconds, the runtime decreases significantly
So when I run mds with default 5 seconds interval:

$ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done

real    1m35.463s
user    0m0.026s
sys     0m0.064s

And now with mds_tick_interval=0.5

time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done

real    0m9.847s
user    0m0.035s
sys     0m0.045s

Actions

Copy link

Updated by Jeff Layton about 4 years ago

File 0001-ceph-request-expedited-service-when-flushing-caps.patch 0001-ceph-request-expedited-service-when-flushing-caps.patch added

Actions

Copy link

Updated by Jeff Layton about 4 years ago

File 0001-ceph-request-expedited-service-when-flushing-caps.patch 0001-ceph-request-expedited-service-when-flushing-caps.patch added

Ok, this patch seems to fix it for me. I think it's probably correct, but I need to look over the code a little more, and flesh out the patch description before posting. Jan, can you test it out too and get back to me?

Actions

Copy link

Updated by Jeff Layton about 4 years ago

Jan Fajerski wrote:

Some more infos on this:

There is an MDS daemon option 'mds_tick_interval' which is "time in seconds between upkeep tasks", in simple words: each 5 seconds (by default) there is a timer tick, which kicks mdslog to flush periodically the metadata. And exactly this flush generates OP_FLUSH_ACK, which is waited on kernel side (mentioned by David). In order to decrease flush latency this value can be reduced, say to 0.5 seconds, the runtime decreases significantly
[...]

Reducing that is probably not good for performance in general. What we really want to do is clue the MDS into the fact that we're (synchronously) waiting on the replies to these updates. That should help it expedite the flushes on the client's behalf.

Actions

Copy link

Updated by David Disseldorp about 4 years ago

Jeff Layton wrote:

Ok, this patch seems to fix it for me. I think it's probably correct, but I need to look over the code a little more, and flesh out the patch description before posting. Jan, can you test it out too and get back to me?

Hi Jeff, I'd like to give this a spin. Which branch did you base it on? __prep_cap() doesn't appear to be present in mainline or ceph-client/testing .

Actions

Copy link

Updated by Jeff Layton about 4 years ago

This is based on ceph-client/testing. This patch refactors some of the code into __prep_cap:

https://github.com/ceph/ceph-client/commit/0c4b255369bcfc48f46b8828dddc7434d39508a0

Actions

Copy link

Updated by David Disseldorp about 4 years ago

Jeff Layton wrote:

This is based on ceph-client/testing. This patch refactors some of the code into __prep_cap:

https://github.com/ceph/ceph-client/commit/0c4b255369bcfc48f46b8828dddc7434d39508a0

Thanks. It seems I was looking at a stale ceph-client testing. The patch works nicely for me - FS sync time goes from 1.330s to 0.028s :).

Actions

Copy link

#10

Updated by Jan Fajerski about 4 years ago

Thx Jeff! I also gave it a quick spin and it does improve things quite a bit.

I guess I should still open a libcephfs tracker to look into whether this is working correctly or not?

Actions

Copy link

#11

Updated by Jeff Layton about 4 years ago

Jan Fajerski wrote:

Thx Jeff! I also gave it a quick spin and it does improve things quite a bit.

I guess I should still open a libcephfs tracker to look into whether this is working correctly or not?

Yes, please. That sounds like it may be a data-integrity issue.

It's not 100% clear to me that it's required that the kernel write back all of the data before a non-MNT_DETACH umount2() returns, but it does seem like that's what userland would expect.

Actions

Copy link

#12

Updated by Jeff Layton about 4 years ago

The other bit I'm not sure about is whether we want to set this every time CHECK_CAPS_FLUSH is set. I wonder if we could somehow get away with just setting this flag on the last cap flush to be sent.

OTOH, the MDS is pretty simple minded about the ordering of incoming independent requests, so it may not make much difference.

Actions

Copy link

#13

Updated by Jan Fajerski about 4 years ago

Project changed from CephFS to Linux kernel client

Actions

Copy link

#14

Updated by Jan Fajerski about 4 years ago

Copied to Bug #44850: sync on libcephfs and wait for CEPH_CAP_OP_FLUSH_ACK added

Actions

Copy link

#15

Updated by Greg Farnum about 4 years ago

Jeff Layton wrote:

The other bit I'm not sure about is whether we want to set this every time CHECK_CAPS_FLUSH is set. I wonder if we could somehow get away with just setting this flag on the last cap flush to be sent.

OTOH, the MDS is pretty simple minded about the ordering of incoming independent requests, so it may not make much difference.

It'll make a BIG difference — if you issue that flag only on the last flush, it will force a single journal commit with all the cap updates. If you issue it for each one, every one of those caps will be a separate journal update the MDS has to prepare and send out to RADOS.

Actions

Copy link

#16

Updated by Jeff Layton about 4 years ago

Status changed from New to In Progress

Zheng shot down the initial draft, since we'd be setting the CLIENT_CAPS_SYNC flag too often. What we probably need to do is set it on the last cap making the dirty->flush transition on a session.

That's not simple to determine though. We'll need to add new infrastructure for it.

Actions

Copy link

#17

Updated by Jeff Layton about 4 years ago

I dropped a dump_stack() into __mark_caps_flushing during one of the xfs_io+sync tests. I see this:

[ 5864.771799] CPU: 3 PID: 5222 Comm: kworker/3:0 Kdump: loaded Tainted: G           OE     5.6.0+ #32
[ 5864.773114] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[ 5864.774890] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 5864.775607] Call Trace:
[ 5864.775958]  dump_stack+0x66/0x90
[ 5864.776396]  __mark_caps_flushing+0xb7/0xe0 [ceph]
[ 5864.777029]  ? __cap_is_valid+0x1c/0xb0 [ceph]
[ 5864.777602]  ? __ceph_caps_issued+0x6d/0xc0 [ceph]
[ 5864.778234]  ceph_check_caps+0x683/0x9c0 [ceph]
[ 5864.778840]  ? sched_clock+0x5/0x10
[ 5864.779292]  ? sched_clock_cpu+0xc/0xa0
[ 5864.779806]  ? try_to_wake_up+0x218/0x670
[ 5864.780318]  ? apic_timer_interrupt+0xa/0x20
[ 5864.780875]  ? release_pages+0x3c0/0x470
[ 5864.781387]  ceph_put_wrbuffer_cap_refs+0x1ac/0x300 [ceph]
[ 5864.782103]  writepages_finish+0x2d8/0x420 [ceph]
[ 5864.782718]  __complete_request+0x22/0x70 [libceph]
[ 5864.783350]  dispatch+0x5a/0xb20 [libceph]
[ 5864.783894]  ? calc_signature+0xdb/0x100 [libceph]
[ 5864.784516]  ceph_con_workfn+0xd40/0x2790 [libceph]
[ 5864.785154]  ? __switch_to_asm+0x40/0x70
[ 5864.785662]  ? __switch_to_asm+0x34/0x70
[ 5864.786180]  ? __switch_to_asm+0x40/0x70
[ 5864.786687]  ? __switch_to_asm+0x34/0x70
[ 5864.787199]  ? __switch_to_asm+0x40/0x70
[ 5864.787703]  ? __switch_to_asm+0x40/0x70
[ 5864.788223]  ? __switch_to+0x152/0x420
[ 5864.788716]  ? __switch_to_asm+0x34/0x70
[ 5864.789229]  process_one_work+0x1b5/0x360
[ 5864.790161]  worker_thread+0x50/0x3c0
[ 5864.791020]  kthread+0xf9/0x130
[ 5864.791814]  ? process_one_work+0x360/0x360
[ 5864.792733]  ? kthread_park+0x90/0x90
[ 5864.793577]  ret_from_fork+0x22/0x40

We create the file and write to the pagecache, and then the sync occurs and writeback is done. The cap flush occurs as the wrbuffer refs are put.

I think we probably have to do what Zheng suggests and keep track of the dirty inodes on a per-session basis, but I'd like to avoid adding yet another single-use list_head to the inode.

It might even be nice to consolidate i_dirty_item and i_flushing_item as they don't appear to be used at the same time.

Actions

Copy link

#18

Updated by Jeff Layton about 4 years ago

Fixed in in ceph/testing branch by moving flushing inodes to be tracked per-session. That fix should make v5.7 or v5.8.

Actions

Copy link

#19

Updated by Jeff Layton over 3 years ago

Status changed from In Progress to Resolved

This went into v5.8 (commit d67c72e6cce99eab5ab9d62c599e33e5141ff8b4).

Actions

Copy link

#20

Updated by Patrick Donnelly about 2 years ago

Has duplicate Bug #45153: fsync locking up in certain conditions added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #44744

Slow file creation/sync on kernel cephfs

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jan Fajerski about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by David Disseldorp about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by David Disseldorp about 4 years ago

Updated by Jan Fajerski about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jan Fajerski about 4 years ago

Updated by Jan Fajerski about 4 years ago

Updated by Greg Farnum about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton over 3 years ago

Updated by Patrick Donnelly about 2 years ago