Project

General

Profile

Actions

Bug #44744

closed

Slow file creation/sync on kernel cephfs

Added by Jan Fajerski about 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

We're seeing weird behaviour of the sync tool with a kernel cephfs mount. To trigger this its enough to run the following snippet on a cephfs mount, this writes a 1MB file and calls sync immediately after in a loop (works without the loop to but the runtimes are much shorter):

time for i in $(seq -w 11 30); do xfs_io -f -c "pwrite 0 1m" $i.txt ; sync; done

This takes 90 seconds or longer in various test setups. On a ceph-fuse mount this takes more like 1.5 seconds.

Things get stranger when the file size changes or the sync tool is called with arguments:
- The kernel mount runtime goes down to fuse level when sync is called with the filename like so:

time for i in $(seq -w 11 30); do xfs_io -f -c "pwrite 0 1m" $i.txt ; sync $i.txt; done

- The kernel mount runtime goes down to fuse level when a 4MB file is written:

time for i in $(seq -w 11 30); do xfs_io -f -c "pwrite 0 4m" $i.txt ; sync; done

I also noticed that that the 4m write is only fast if the argument to xfs_io is the file name only.
Calling this with any path component (either absolute or relative) at all triggers the slow behaviour again.

Luis looked at what messages are exchanged:

when doing a full sync (i.e. when ->sync_fs() is executed), the client will wait for the MDS to send the CEPH_CAP_OP_FLUSH_ACK, acknowledging that everything is safe (in the journal, I believe). And this message is handled in Locker::handle_client_caps() (called from MDSRank::handle_deferrable_message), where it is queued for processing. Finally, it is later sent out to the client from Locker::file_update_finish(). And the time between being queued and sent corresponds to the delay experienced in the kernel client.

The fuse client does not wait for this FLUSH_ACK message from the MDS when 'sync' is executed. In fact, it even allows the filesystem to be umounted before receiving this ACK.

All this was tested with a recent nautilus and a kernel with backports roughly corresponding to a 5.0 kernel iirc. Numbers for a master build and mainline kernel are coming up.


Files


Related issues 2 (0 open2 closed)

Has duplicate Linux kernel client - Bug #45153: fsync locking up in certain conditionsDuplicateJeff Layton

Actions
Copied to CephFS - Bug #44850: sync on libcephfs and wait for CEPH_CAP_OP_FLUSH_ACKRejected

Actions
Actions #1

Updated by Jeff Layton about 4 years ago

  • Assignee set to Jeff Layton
Actions #2

Updated by Jeff Layton about 4 years ago

Jan Fajerski wrote:

Luis looked at what messages are exchanged:

when doing a full sync (i.e. when ->sync_fs() is executed), the client will wait for the MDS to send the CEPH_CAP_OP_FLUSH_ACK, acknowledging that everything is safe (in the journal, I believe). And this message is handled in Locker::handle_client_caps() (called from MDSRank::handle_deferrable_message), where it is queued for processing. Finally, it is later sent out to the client from Locker::file_update_finish(). And the time between being queued and sent corresponds to the delay experienced in the kernel client.

I had a patchset that fixes some similar issues for fsync back in 2017 (see commit 1e4ef0c6332bd). That set it in try_flush_caps but maybe we need to do something similar in the sync_fs codepath. I'll have a look and see.

The fuse client does not wait for this FLUSH_ACK message from the MDS when 'sync' is executed. In fact, it even allows the filesystem to be umounted before receiving this ACK.

That sounds broken. It might be good to open a separate libcephfs bug for that.

Actions #3

Updated by Jan Fajerski about 4 years ago

Some more infos on this:

There is an MDS daemon option 'mds_tick_interval' which is "time in seconds between upkeep tasks", in simple words: each 5 seconds (by default) there is a timer tick, which kicks mdslog to flush periodically the metadata. And exactly this flush generates OP_FLUSH_ACK, which is waited on kernel side (mentioned by David). In order to decrease flush latency this value can be reduced, say to 0.5 seconds, the runtime decreases significantly

So when I run mds with default 5 seconds interval:

$ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done

real    1m35.463s
user    0m0.026s
sys     0m0.064s

And now with mds_tick_interval=0.5

time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done

real    0m9.847s
user    0m0.035s
sys     0m0.045s

Actions #5

Updated by Jeff Layton about 4 years ago

Ok, this patch seems to fix it for me. I think it's probably correct, but I need to look over the code a little more, and flesh out the patch description before posting. Jan, can you test it out too and get back to me?

Actions #6

Updated by Jeff Layton about 4 years ago

Jan Fajerski wrote:

Some more infos on this:

There is an MDS daemon option 'mds_tick_interval' which is "time in seconds between upkeep tasks", in simple words: each 5 seconds (by default) there is a timer tick, which kicks mdslog to flush periodically the metadata. And exactly this flush generates OP_FLUSH_ACK, which is waited on kernel side (mentioned by David). In order to decrease flush latency this value can be reduced, say to 0.5 seconds, the runtime decreases significantly
[...]

Reducing that is probably not good for performance in general. What we really want to do is clue the MDS into the fact that we're (synchronously) waiting on the replies to these updates. That should help it expedite the flushes on the client's behalf.

Actions #7

Updated by David Disseldorp about 4 years ago

Jeff Layton wrote:

Ok, this patch seems to fix it for me. I think it's probably correct, but I need to look over the code a little more, and flesh out the patch description before posting. Jan, can you test it out too and get back to me?

Hi Jeff, I'd like to give this a spin. Which branch did you base it on? __prep_cap() doesn't appear to be present in mainline or ceph-client/testing .

Actions #8

Updated by Jeff Layton about 4 years ago

This is based on ceph-client/testing. This patch refactors some of the code into __prep_cap:

https://github.com/ceph/ceph-client/commit/0c4b255369bcfc48f46b8828dddc7434d39508a0

Actions #9

Updated by David Disseldorp about 4 years ago

Jeff Layton wrote:

This is based on ceph-client/testing. This patch refactors some of the code into __prep_cap:

https://github.com/ceph/ceph-client/commit/0c4b255369bcfc48f46b8828dddc7434d39508a0

Thanks. It seems I was looking at a stale ceph-client testing. The patch works nicely for me - FS sync time goes from 1.330s to 0.028s :).

Actions #10

Updated by Jan Fajerski about 4 years ago

Thx Jeff! I also gave it a quick spin and it does improve things quite a bit.

I guess I should still open a libcephfs tracker to look into whether this is working correctly or not?

Actions #11

Updated by Jeff Layton about 4 years ago

Jan Fajerski wrote:

Thx Jeff! I also gave it a quick spin and it does improve things quite a bit.

I guess I should still open a libcephfs tracker to look into whether this is working correctly or not?

Yes, please. That sounds like it may be a data-integrity issue.

It's not 100% clear to me that it's required that the kernel write back all of the data before a non-MNT_DETACH umount2() returns, but it does seem like that's what userland would expect.

Actions #12

Updated by Jeff Layton about 4 years ago

The other bit I'm not sure about is whether we want to set this every time CHECK_CAPS_FLUSH is set. I wonder if we could somehow get away with just setting this flag on the last cap flush to be sent.

OTOH, the MDS is pretty simple minded about the ordering of incoming independent requests, so it may not make much difference.

Actions #13

Updated by Jan Fajerski about 4 years ago

  • Project changed from CephFS to Linux kernel client
Actions #14

Updated by Jan Fajerski about 4 years ago

  • Copied to Bug #44850: sync on libcephfs and wait for CEPH_CAP_OP_FLUSH_ACK added
Actions #15

Updated by Greg Farnum about 4 years ago

Jeff Layton wrote:

The other bit I'm not sure about is whether we want to set this every time CHECK_CAPS_FLUSH is set. I wonder if we could somehow get away with just setting this flag on the last cap flush to be sent.

OTOH, the MDS is pretty simple minded about the ordering of incoming independent requests, so it may not make much difference.

It'll make a BIG difference — if you issue that flag only on the last flush, it will force a single journal commit with all the cap updates. If you issue it for each one, every one of those caps will be a separate journal update the MDS has to prepare and send out to RADOS.

Actions #16

Updated by Jeff Layton about 4 years ago

  • Status changed from New to In Progress

Zheng shot down the initial draft, since we'd be setting the CLIENT_CAPS_SYNC flag too often. What we probably need to do is set it on the last cap making the dirty->flush transition on a session.

That's not simple to determine though. We'll need to add new infrastructure for it.

Actions #17

Updated by Jeff Layton about 4 years ago

I dropped a dump_stack() into __mark_caps_flushing during one of the xfs_io+sync tests. I see this:

[ 5864.771799] CPU: 3 PID: 5222 Comm: kworker/3:0 Kdump: loaded Tainted: G           OE     5.6.0+ #32
[ 5864.773114] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[ 5864.774890] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 5864.775607] Call Trace:
[ 5864.775958]  dump_stack+0x66/0x90
[ 5864.776396]  __mark_caps_flushing+0xb7/0xe0 [ceph]
[ 5864.777029]  ? __cap_is_valid+0x1c/0xb0 [ceph]
[ 5864.777602]  ? __ceph_caps_issued+0x6d/0xc0 [ceph]
[ 5864.778234]  ceph_check_caps+0x683/0x9c0 [ceph]
[ 5864.778840]  ? sched_clock+0x5/0x10
[ 5864.779292]  ? sched_clock_cpu+0xc/0xa0
[ 5864.779806]  ? try_to_wake_up+0x218/0x670
[ 5864.780318]  ? apic_timer_interrupt+0xa/0x20
[ 5864.780875]  ? release_pages+0x3c0/0x470
[ 5864.781387]  ceph_put_wrbuffer_cap_refs+0x1ac/0x300 [ceph]
[ 5864.782103]  writepages_finish+0x2d8/0x420 [ceph]
[ 5864.782718]  __complete_request+0x22/0x70 [libceph]
[ 5864.783350]  dispatch+0x5a/0xb20 [libceph]
[ 5864.783894]  ? calc_signature+0xdb/0x100 [libceph]
[ 5864.784516]  ceph_con_workfn+0xd40/0x2790 [libceph]
[ 5864.785154]  ? __switch_to_asm+0x40/0x70
[ 5864.785662]  ? __switch_to_asm+0x34/0x70
[ 5864.786180]  ? __switch_to_asm+0x40/0x70
[ 5864.786687]  ? __switch_to_asm+0x34/0x70
[ 5864.787199]  ? __switch_to_asm+0x40/0x70
[ 5864.787703]  ? __switch_to_asm+0x40/0x70
[ 5864.788223]  ? __switch_to+0x152/0x420
[ 5864.788716]  ? __switch_to_asm+0x34/0x70
[ 5864.789229]  process_one_work+0x1b5/0x360
[ 5864.790161]  worker_thread+0x50/0x3c0
[ 5864.791020]  kthread+0xf9/0x130
[ 5864.791814]  ? process_one_work+0x360/0x360
[ 5864.792733]  ? kthread_park+0x90/0x90
[ 5864.793577]  ret_from_fork+0x22/0x40

We create the file and write to the pagecache, and then the sync occurs and writeback is done. The cap flush occurs as the wrbuffer refs are put.

I think we probably have to do what Zheng suggests and keep track of the dirty inodes on a per-session basis, but I'd like to avoid adding yet another single-use list_head to the inode.

It might even be nice to consolidate i_dirty_item and i_flushing_item as they don't appear to be used at the same time.

Actions #18

Updated by Jeff Layton about 4 years ago

Fixed in in ceph/testing branch by moving flushing inodes to be tracked per-session. That fix should make v5.7 or v5.8.

Actions #19

Updated by Jeff Layton over 3 years ago

  • Status changed from In Progress to Resolved

This went into v5.8 (commit d67c72e6cce99eab5ab9d62c599e33e5141ff8b4).

Actions #20

Updated by Patrick Donnelly about 2 years ago

  • Has duplicate Bug #45153: fsync locking up in certain conditions added
Actions

Also available in: Atom PDF