Project

General

Profile

Bug #17563

extremely slow ceph_fsync calls

Added by Jeff Layton about 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
NFS (Linux Kernel)
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've been seeing problems with very slow fsyncs vs. a vstart cluster when I run ganesha on top of it. This is the ganesha log (with a little extra debugging in place):

12/10/2016 13:02:30 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] cih_get_by_key_latch :HT CACHE :DEBUG :cih cache hit slot 11725
12/10/2016 13:02:30 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :>>> ceph_write2
12/10/2016 13:02:30 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :>>> ceph_find_fd
12/10/2016 13:02:30 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :<<< ceph_find_fd
12/10/2016 13:02:30 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :>>> ceph_ll_write
12/10/2016 13:02:30 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :<<< ceph_ll_write
12/10/2016 13:02:30 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :>>> ceph_ll_fsync
12/10/2016 13:02:35 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :<<< ceph_ll_fsync
12/10/2016 13:02:35 : epoch 57fe6c17 : ganesha.poochiereds.net : ganesha.nfsd-2141[work-66] ceph_write2 :FSAL :NULL :<<< ceph_write2

...it's servicing a write, and finds that it needs to fsync it afterward (for whatever reason). That fsync takes 5s.

Here is the ceph client logs for that period from the machine running ganesha:

2016-10-12 13:02:30.157889 7f520a7c4700  3 client.4114 ll_fsync 0x7f522c005240 10000000004
2016-10-12 13:02:30.157895 7f520a7c4700  3 client.4114 _fsync(0x7f522c005240, data+metadata)
2016-10-12 13:02:30.157900 7f520a7c4700  3 client.4114 _fsync on 10000000004.head(faked_ino=0 ref=7 ll_ref=1 cap_refs={1024=1,4096=0,8192=1} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 274] parents=0x7f522c002620 0x7f522800dda0) (data+metadata)
2016-10-12 13:02:30.157917 7f520a7c4700 15 inode.get on 0x7f522800dda0 10000000004.head now 8
2016-10-12 13:02:30.157918 7f520a7c4700 10 client.4114 _flush 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=1,4096=0,8192=1} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 274] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:30.158050 7f520a7c4700 15 client.4114 using return-valued form of _fsync
2016-10-12 13:02:30.158053 7f520a7c4700 10 client.4114 check_caps on 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=1,4096=0,8192=1} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 274] parents=0x7f522c002620 0x7f522800dda0) wanted pAsxXsxFxcwb used Fcb issued pAsxLsXsxFsxcrwb revoking - is_delayed=1
2016-10-12 13:02:30.158072 7f520a7c4700 10 client.4114  cap mds.0 issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
2016-10-12 13:02:30.158082 7f520a7c4700 10 client.4114 mark_caps_flushing AxFxw 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=1,4096=0,8192=1} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 274] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:30.158095 7f520a7c4700 10 client.4114 send_cap 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=1,4096=0,8192=1} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 274] parents=0x7f522c002620 0x7f522800dda0) mds.0 seq 1 used Fcb want pAsxXsxFxcwb flush AxFxw retain pAsxLsxXsxFsxcrwbl held pAsxLsXsxFsxcrwb revoking - dropping -
2016-10-12 13:02:30.158116 7f520a7c4700 15 client.4114 auth cap, setting max_size = 274
2016-10-12 13:02:30.158160 7f520a7c4700 15 client.4114 waiting on unsafe requests, last tid 7
2016-10-12 13:02:30.158713 7f524dffb700 10 client.4114 ms_handle_connect on 192.168.1.3:6808/28650
2016-10-12 13:02:30.176904 7f524f7fe700 10 client.4114 _flushed 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=1,4096=0,8192=1} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:30.176937 7f524f7fe700  5 client.4114 put_cap_ref dropped last FILE_BUFFER ref on 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:30.176949 7f524f7fe700  5 client.4114 put_cap_ref dropped last FILE_CACHE ref on 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:30.176959 7f524f7fe700 10 client.4114 put_inode on 10000000004.head(faked_ino=0 ref=8 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:30.176969 7f524f7fe700 15 inode.put on 0x7f522800dda0 10000000004.head now 6
2016-10-12 13:02:30.464654 7f524effd700 21 client.4114 tick
2016-10-12 13:02:30.464683 7f524effd700 10 client.4114 check_caps on 10000000001.head(faked_ino=0 ref=2 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100664 size=601/4194304 mtime=2016-10-12 13:02:22.770067 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[10000000001 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f5248002ba0 0x7f522800c840) wanted - used Fc issued pAsxLsXsxFsxcrwb revoking - is_delayed=1
2016-10-12 13:02:30.464717 7f524effd700 10 client.4114  cap mds.0 issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
2016-10-12 13:02:30.464726 7f524effd700 10 client.4114 send_cap 10000000001.head(faked_ino=0 ref=2 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100664 size=601/4194304 mtime=2016-10-12 13:02:22.770067 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[10000000001 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f5248002ba0 0x7f522800c840) mds.0 seq 2 used Fc want - flush - retain pAsLsXsFsc held pAsxLsXsxFsxcrwb revoking - dropping AxXxFxrwb
2016-10-12 13:02:30.464753 7f524effd700 15 client.4114 auth cap, setting max_size = 601
2016-10-12 13:02:30.464821 7f524effd700 20 client.4114 trim_cache size 5 max 16384
2016-10-12 13:02:31.464792 7f524effd700 21 client.4114 tick
2016-10-12 13:02:31.464820 7f524effd700 20 client.4114 trim_cache size 5 max 16384
2016-10-12 13:02:32.464939 7f524effd700 21 client.4114 tick
2016-10-12 13:02:32.464971 7f524effd700 20 client.4114 trim_cache size 5 max 16384
2016-10-12 13:02:33.465091 7f524effd700 21 client.4114 tick
2016-10-12 13:02:33.465121 7f524effd700 20 client.4114 trim_cache size 5 max 16384
2016-10-12 13:02:34.465247 7f524effd700 21 client.4114 tick
2016-10-12 13:02:34.465278 7f524effd700 20 client.4114 trim_cache size 5 max 16384
2016-10-12 13:02:35.104372 7f524dffb700 10 client.4114  mds.0 seq now 8
2016-10-12 13:02:35.104380 7f524dffb700  5 client.4114 handle_cap_grant on in 10000000004 mds.0 seq 2 caps now pAsxLsXsxFsxcrwb was pAsxLsXsxFsxcrwb
2016-10-12 13:02:35.104410 7f524dffb700 30 client.4114 Yay have enough caps to look at our times
2016-10-12 13:02:35.104412 7f524dffb700 10 client.4114   caps unchanged at pAsxLsXsxFsxcrwb
2016-10-12 13:02:35.104437 7f524dffb700 20 client.4114 handle_client_reply got a reply. Safe:1 tid 7
2016-10-12 13:02:35.104440 7f524dffb700 10 client.4114 insert_trace from 2016-10-12 13:02:30.154663 mds.0 is_target=0 is_dentry=0
2016-10-12 13:02:35.104444 7f524dffb700 10 client.4114 insert_trace -- already got unsafe; ignoring
2016-10-12 13:02:35.104500 7f520a7c4700 10 client.4114 put_inode on 10000000004.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:35.104546 7f520a7c4700 15 inode.put on 0x7f522800dda0 10000000004.head now 5
2016-10-12 13:02:35.104550 7f520a7c4700 10 client.4114 put_inode on 10000000000.head(faked_ino=0 ref=4 ll_ref=1 cap_refs={} open={} mode=40775 size=0/0 mtime=2016-10-12 13:02:30.154610 caps=pAsLsXsFsx(0=pAsLsXsFsx) COMPLETE parents=0x7f51800013a0 0x7f5228009d30)
2016-10-12 13:02:35.104561 7f520a7c4700 15 inode.put on 0x7f5228009d30 10000000000.head now 3
2016-10-12 13:02:35.104591 7f520a7c4700 15 client.4114 waiting on data to flush
2016-10-12 13:02:35.104595 7f520a7c4700 15 client.4114 got 0 from flush writeback
2016-10-12 13:02:35.104597 7f520a7c4700 10 client.4114 wait_sync_caps on 10000000004.head(faked_ino=0 ref=5 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0) flushing AxFxw want 5 last 5
2016-10-12 13:02:35.104634 7f524dffb700 10 client.4114  mds.0 seq now 9
2016-10-12 13:02:35.104642 7f524dffb700  5 client.4114 handle_cap_flush_ack mds.0 cleaned AxFxw on 10000000004.head(faked_ino=0 ref=5 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=AxFxw objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0) with AxFxw
2016-10-12 13:02:35.104674 7f524dffb700  5 client.4114   flushing_caps AxFxw -> -
2016-10-12 13:02:35.104678 7f524dffb700 10 client.4114  10000000004.head(faked_ino=0 ref=5 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0) !flushing
2016-10-12 13:02:35.104695 7f524dffb700 10 client.4114 put_inode on 10000000004.head(faked_ino=0 ref=5 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:35.104707 7f524dffb700 15 inode.put on 0x7f522800dda0 10000000004.head now 4
2016-10-12 13:02:35.104727 7f520a7c4700 10 client.4114 ino 10000000004 has no uncommitted writes
2016-10-12 13:02:35.104736 7f520a7c4700 10 client.4114 put_inode on 10000000004.head(faked_ino=0 ref=4 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100664 size=274/4194304 mtime=2016-10-12 13:02:30.157851 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[10000000004 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f522c002620 0x7f522800dda0)
2016-10-12 13:02:35.104751 7f520a7c4700 15 inode.put on 0x7f522800dda0 10000000004.head now 3
2016-10-12 13:02:35.104854 7f520a7c4700 10 client.4114 fill_statx on 10000000004 snap/devhead mode 0100664 mtime 2016-10-12 13:02:30.157851 ctime 2016-10-12 13:02:30.156302

...and here is the server side logs from that period. Note the big delay starting at 13:02:30.887872. The MDS is fairly idle (AFAICT) until it decides to issue caps to the client 5s later.

2016-10-12 13:02:30.486122 7f99a80ff700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6804/28458 -- osd_op(unknown.0.4:42 2.6e5f474 200.00000001 [write 12507~6136 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e11) v7 -- 0x55ead5833380 con 0
2016-10-12 13:02:30.512725 7f99ad89f700  1 -- 192.168.1.3:6812/3909720403 <== osd.1 192.168.1.3:6804/28458 33 ==== osd_op_reply(42 200.00000001 [write 12507~6136 [fadvise_dontneed]] v11'14 uv14 ondisk = 0) v7 ==== 132+0+0 (3096853619 0 0) 0x55ead5832d00 con 0x55ead58c8000
2016-10-12 13:02:30.512880 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(grant ino 10000000002 4 seq 2 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxcwb follows 0 size 738/4194304 ts 1/18446744073709551615 mtime 2016-10-12 13:02:25.110851 tws 2) v9 -- 0x55ead5848880 con 0
2016-10-12 13:02:30.513015 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:5 = 0 (0) Success safe) v1 -- 0x55ead58aaa00 con 0
2016-10-12 13:02:30.513121 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(flush_ack ino 10000000002 4 seq 1 tid 3 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=- follows 0 size 0/0 mtime 0.000000) v9 -- 0x55ead5848d00 con 0
2016-10-12 13:02:30.513310 7f99a50f9700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6804/28458 -- osd_op(unknown.0.4:43 2.6e5f474 200.00000001 [write 18643~1976 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e11) v7 -- 0x55ead58343c0 con 0
2016-10-12 13:02:30.525192 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 19 ==== client_request(client.4114:6 create #10000000000/runtests.wrk 2016-10-12 13:02:30.102091 caller_uid=4447, caller_gid=4447{10,4447,}) v4 ==== 208+0+0 (1422016654 0 0) 0x55ead582a900 con 0x55ead58cc800
2016-10-12 13:02:30.525436 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:6 = 0 (0) Success unsafe) v1 -- 0x55ead582dc00 con 0
2016-10-12 13:02:30.529065 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 20 ==== client_caps(update ino 10000000003 5 seq 1 tid 4 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=pAsxXsxFxcwb follows 1 size 4093/4093 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.105686 tws 2) v9 ==== 232+0+0 (3730619151 0 0) 0x55ead5848d00 con 0x55ead58cc800
2016-10-12 13:02:30.538885 7f99ad89f700  1 -- 192.168.1.3:6812/3909720403 <== osd.1 192.168.1.3:6804/28458 34 ==== osd_op_reply(43 200.00000001 [write 18643~1976 [fadvise_dontneed]] v11'15 uv15 ondisk = 0) v7 ==== 132+0+0 (3058672592 0 0) 0x55ead5832d00 con 0x55ead58c8000
2016-10-12 13:02:30.539123 7f99a50f9700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6804/28458 -- osd_op(unknown.0.4:44 2.6e5f474 200.00000001 [write 20619~8078 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e11) v7 -- 0x55ead58343c0 con 0
2016-10-12 13:02:30.573063 7f99ad89f700  1 -- 192.168.1.3:6812/3909720403 <== osd.1 192.168.1.3:6804/28458 35 ==== osd_op_reply(44 200.00000001 [write 20619~8078 [fadvise_dontneed]] v11'16 uv16 ondisk = 0) v7 ==== 132+0+0 (1730219028 0 0) 0x55ead5832d00 con 0x55ead58c8000
2016-10-12 13:02:30.573218 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(grant ino 10000000003 5 seq 2 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxcwb follows 0 size 4093/4194304 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.105686 tws 2) v9 -- 0x55ead5849180 con 0
2016-10-12 13:02:30.573327 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:6 = 0 (0) Success safe) v1 -- 0x55ead58a8300 con 0
2016-10-12 13:02:30.573434 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(flush_ack ino 10000000003 5 seq 1 tid 4 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=- follows 0 size 0/0 mtime 0.000000) v9 -- 0x55ead5849600 con 0
2016-10-12 13:02:30.577737 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 21 ==== client_request(client.4114:7 create #10000000000/large4.sh 2016-10-12 13:02:30.154610 caller_uid=4447, caller_gid=4447{10,4447,}) v4 ==== 205+0+0 (1564375582 0 0) 0x55ead58a8300 con 0x55ead58cc800
2016-10-12 13:02:30.577919 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:7 = 0 (0) Success unsafe) v1 -- 0x55ead582ca00 con 0
2016-10-12 13:02:30.581220 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 22 ==== client_caps(update ino 10000000004 6 seq 1 tid 5 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=pAsxXsxFxcwb follows 1 size 274/274 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.157851 tws 2) v9 ==== 232+0+0 (4186497105 0 0) 0x55ead5847f80 con 0x55ead58cc800
2016-10-12 13:02:30.887872 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 23 ==== client_caps(update ino 10000000001 3 seq 2 caps=pAsLsXsFsc dirty=- wanted=- follows 0 size 601/601 ts 1/18446744073709551615 mtime 2016-10-12 13:02:22.770067 tws 2) v9 ==== 232+0+0 (2291189203 0 0) 0x55ead5848880 con 0x55ead58cc800
2016-10-12 13:02:33.702843 7f99a78fe700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6789/0 -- mdsbeacon(4113/a up:active seq 99 v5) v7 -- 0x55ead571ea80 con 0
2016-10-12 13:02:33.703359 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== mon.0 192.168.1.3:6789/0 136 ==== mdsbeacon(4113/a up:active seq 99 v5) v7 ==== 126+0+0 (2782338162 0 0) 0x55ead591b800 con 0x55ead57b1800
2016-10-12 13:02:35.486201 7f99a80ff700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6804/28458 -- osd_op(unknown.0.4:45 2.6e5f474 200.00000001 [write 28697~8880 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e11) v7 -- 0x55ead5833380 con 0
2016-10-12 13:02:35.526507 7f99ad89f700  1 -- 192.168.1.3:6812/3909720403 <== osd.1 192.168.1.3:6804/28458 36 ==== osd_op_reply(45 200.00000001 [write 28697~8880 [fadvise_dontneed]] v11'17 uv17 ondisk = 0) v7 ==== 132+0+0 (852000238 0 0) 0x55ead5832d00 con 0x55ead58c8000
2016-10-12 13:02:35.526648 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(grant ino 10000000004 6 seq 2 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxcwb follows 0 size 274/4194304 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.157851 tws 2) v9 -- 0x55ead5847b00 con 0
2016-10-12 13:02:35.526762 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:7 = 0 (0) Success safe) v1 -- 0x55ead58a8900 con 0
2016-10-12 13:02:35.526858 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(flush_ack ino 10000000004 6 seq 1 tid 5 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=- follows 0 size 0/0 mtime 0.000000) v9 -- 0x55ead5848d00 con 0
2016-10-12 13:02:35.888510 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 24 ==== client_caps(update ino 10000000002 4 seq 2 caps=pAsLsXsFsc dirty=- wanted=- follows 0 size 738/738 ts 1/18446744073709551615 mtime 2016-10-12 13:02:25.110851 tws 2) v9 ==== 232+0+0 (3912545188 0 0) 0x55ead5848d00 con 0x55ead58cc800
2016-10-12 13:02:35.888665 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 25 ==== client_caps(update ino 10000000003 5 seq 2 caps=pAsLsXsFsc dirty=- wanted=- follows 0 size 4093/4093 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.105686 tws 2) v9 ==== 232+0+0 (2295529959 0 0) 0x55ead5847b00 con 0x55ead58cc800

2016-10-12 13:02:30.525436 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:6 = 0 (0) Success unsafe) v1 -- 0x55ead582dc00 con 0
2016-10-12 13:02:30.529065 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 20 ==== client_caps(update ino 10000000003 5 seq 1 tid 4 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=pAsxXsxFxcwb follows 1 size 4093/4093 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.105686 tws 2) v9 ==== 232+0+0 (3730619151 0 0) 0x55ead5848d00 con 0x55ead58cc800
2016-10-12 13:02:30.538885 7f99ad89f700  1 -- 192.168.1.3:6812/3909720403 <== osd.1 192.168.1.3:6804/28458 34 ==== osd_op_reply(43 200.00000001 [write 18643~1976 [fadvise_dontneed]] v11'15 uv15 ondisk = 0) v7 ==== 132+0+0 (3058672592 0 0) 0x55ead5832d00 con 0x55ead58c8000
2016-10-12 13:02:30.539123 7f99a50f9700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6804/28458 -- osd_op(unknown.0.4:44 2.6e5f474 200.00000001 [write 20619~8078 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e11) v7 -- 0x55ead58343c0 con 0
2016-10-12 13:02:30.573063 7f99ad89f700  1 -- 192.168.1.3:6812/3909720403 <== osd.1 192.168.1.3:6804/28458 35 ==== osd_op_reply(44 200.00000001 [write 20619~8078 [fadvise_dontneed]] v11'16 uv16 ondisk = 0) v7 ==== 132+0+0 (1730219028 0 0) 0x55ead5832d00 con 0x55ead58c8000
2016-10-12 13:02:30.573218 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(grant ino 10000000003 5 seq 2 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxcwb follows 0 size 4093/4194304 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.105686 tws 2) v9 -- 0x55ead5849180 con 0
2016-10-12 13:02:30.573327 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:6 = 0 (0) Success safe) v1 -- 0x55ead58a8300 con 0
2016-10-12 13:02:30.573434 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(flush_ack ino 10000000003 5 seq 1 tid 4 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=- follows 0 size 0/0 mtime 0.000000) v9 -- 0x55ead5849600 con 0
2016-10-12 13:02:30.577737 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 21 ==== client_request(client.4114:7 create #10000000000/large4.sh 2016-10-12 13:02:30.154610 caller_uid=4447, caller_gid=4447{10,4447,}) v4 ==== 205+0+0 (1564375582 0 0) 0x55ead58a8300 con 0x55ead58cc800
2016-10-12 13:02:30.577919 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:7 = 0 (0) Success unsafe) v1 -- 0x55ead582ca00 con 0
2016-10-12 13:02:30.581220 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 22 ==== client_caps(update ino 10000000004 6 seq 1 tid 5 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=pAsxXsxFxcwb follows 1 size 274/274 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.157851 tws 2) v9 ==== 232+0+0 (4186497105 0 0) 0x55ead5847f80 con 0x55ead58cc800
2016-10-12 13:02:30.887872 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 23 ==== client_caps(update ino 10000000001 3 seq 2 caps=pAsLsXsFsc dirty=- wanted=- follows 0 size 601/601 ts 1/18446744073709551615 mtime 2016-10-12 13:02:22.770067 tws 2) v9 ==== 232+0+0 (2291189203 0 0) 0x55ead5848880 con 0x55ead58cc800
2016-10-12 13:02:33.702843 7f99a78fe700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6789/0 -- mdsbeacon(4113/a up:active seq 99 v5) v7 -- 0x55ead571ea80 con 0
2016-10-12 13:02:33.703359 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== mon.0 192.168.1.3:6789/0 136 ==== mdsbeacon(4113/a up:active seq 99 v5) v7 ==== 126+0+0 (2782338162 0 0) 0x55ead591b800 con 0x55ead57b1800
2016-10-12 13:02:35.486201 7f99a80ff700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.3:6804/28458 -- osd_op(unknown.0.4:45 2.6e5f474 200.00000001 [write 28697~8880 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e11) v7 -- 0x55ead5833380 con 0
2016-10-12 13:02:35.526507 7f99ad89f700  1 -- 192.168.1.3:6812/3909720403 <== osd.1 192.168.1.3:6804/28458 36 ==== osd_op_reply(45 200.00000001 [write 28697~8880 [fadvise_dontneed]] v11'17 uv17 ondisk = 0) v7 ==== 132+0+0 (852000238 0 0) 0x55ead5832d00 con 0x55ead58c8000
2016-10-12 13:02:35.526648 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(grant ino 10000000004 6 seq 2 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxcwb follows 0 size 274/4194304 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.157851 tws 2) v9 -- 0x55ead5847b00 con 0
2016-10-12 13:02:35.526762 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_reply(???:7 = 0 (0) Success safe) v1 -- 0x55ead58a8900 con 0
2016-10-12 13:02:35.526858 7f99a58fa700  1 -- 192.168.1.3:6812/3909720403 --> 192.168.1.128:0/2550047853 -- client_caps(flush_ack ino 10000000004 6 seq 1 tid 5 caps=pAsxLsXsxFsxcrwb dirty=AxFxw wanted=- follows 0 size 0/0 mtime 0.000000) v9 -- 0x55ead5848d00 con 0
2016-10-12 13:02:35.888510 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 24 ==== client_caps(update ino 10000000002 4 seq 2 caps=pAsLsXsFsc dirty=- wanted=- follows 0 size 738/738 ts 1/18446744073709551615 mtime 2016-10-12 13:02:25.110851 tws 2) v9 ==== 232+0+0 (3912545188 0 0) 0x55ead5848d00 con 0x55ead58cc800
2016-10-12 13:02:35.888665 7f99aa904700  1 -- 192.168.1.3:6812/3909720403 <== client.4114 192.168.1.128:0/2550047853 25 ==== client_caps(update ino 10000000003 5 seq 2 caps=pAsLsXsFsc dirty=- wanted=- follows 0 size 4093/4093 ts 1/18446744073709551615 mtime 2016-10-12 13:02:30.105686 tws 2) v9 ==== 232+0+0 (2295529959 0 0) 0x55ead5847b00 con 0x55ead58cc800

...and at this point, I'm a little lost when it comes to MDS code. Why is it taking so long to hand out a cap grant here?

I can reproduce this pretty much at will, so if extra debug logging would help nail it down, then it should be possible to collect it.

ceph_write_fsync.c View - reproducer (1.33 KB) Jeff Layton, 10/13/2016 06:56 PM

createtest.c View - Simple posix testcase (1.1 KB) Jeff Layton, 11/01/2016 01:08 PM

History

#1 Updated by Zheng Yan about 6 years ago

The reason is that client fsync does not force MDS flush its journal. fsync may wait up to a MDS tick if there is no other event that make MDS flush its journal.

#2 Updated by Jeff Layton about 6 years ago

Is that something we can change? Slow fsync() performance is particularly awful for applications.

In any case, the client ends up blocking here in Client::_fsync :

  if (!syncdataonly && !in->unsafe_ops.empty()) {
    MetaRequest *req = in->unsafe_ops.back();
    ldout(cct, 15) << "waiting on unsafe requests, last tid " << req->get_tid() <<  dendl;
    req->get();
    wait_on_list(req->waitfor_safe);
    put_request(req);
  }

So I guess the question is: can we send something to the MDS before this that would force it do to a log flush to unwedge the cap grant?

#3 Updated by Jeff Layton about 6 years ago

  • Subject changed from extremely slow ceph_ll_fsync calls when called from nfs-ganesha to extremely slow ceph_ll_fsync calls

#4 Updated by Jeff Layton about 6 years ago

This program is a reproducer. You can build it with something like

gcc -o ceph_write_fsync ceph_write_fsync.c -lcephfs

When run inside time from a VM and talking to my vstart test rig:

$ time ./ceph_write_fsync 

real    6m20.617s
user    0m0.570s
sys    0m0.912s

When you change the dataonly parameter in the ceph_fsync call from false to true:

$ time ./ceph_write_fsync 

real    0m22.056s
user    0m0.502s
sys    0m1.060s

So that does jive with the delayed MDS journal flushes being the source of latency.

#5 Updated by Jeff Layton about 6 years ago

  • Subject changed from extremely slow ceph_ll_fsync calls to extremely slow ceph_fsync calls

#6 Updated by Greg Farnum about 6 years ago

Oh right, this doesn't even have a way to ask the MDS for immediate service — notice how it just puts already-sent requests on a wait list? And in general we don't want to force an immediate journal flush on every client request, because batching is good. We might be able to add in something via the prior check_caps() call to request earlier service...

What workload are you running that led to this?

#7 Updated by Jeff Layton about 6 years ago

Yep, exactly. I noticed this while running the cthon04 testsuite against it. It copies some source files into place and builds them to prepare for some tests, and that phase of the test run was taking forever. So, nothing exotic, I'm afraid.

On the prior check_caps call...yeah it would be ideal if we could communicate to the MDS that this is a foreground request and not to delay it. Is there already a way to do that in a cap update?

#8 Updated by Zheng Yan about 6 years ago

No method so far, need to extend the MClientCaps message

#9 Updated by Jeff Layton about 6 years ago

OTOH...do we even need a flag at all here? Under what circumstances is it beneficial to delay granting and recalling caps? It seems to me that if we have a client waiting on caps, then the MDS should probably grant them ASAP. After all, it's almost always going to be requesting them in order to do something, so there doesn't seem to be much benefit in delaying them just to try to batch things.

#10 Updated by Greg Farnum about 6 years ago

The client is waiting for an ack to a cap flush, not to get caps granted. Usually flushes happen asynchronously (just triggered by timers) and there's no particular hurry. Since we like batching and reducing RADOS-level IO, we don't want to instantly hit the journal every time a client gives us updated inode state if we don't have to.

#11 Updated by Jeff Layton about 6 years ago

Ok, thanks. That makes sense. I've got a patchset that works as a PoC, but it's pretty ugly and could use some cleanup. What I have now adds a new "foreground" bool argument to check_caps, and passes that along.

Before I add that though, I'd like to understand what the is_delayed parameter in check_caps actually means:

void Client::check_caps(Inode *in, bool is_delayed, bool foreground)            

{
...

  if (!is_delayed)                                                              
    cap_delay_requeue(in);                                                      
  else                                                                          
    in->hold_caps_until = utime_t();                                            

It's currently used in the reverse way from what I'd expect. When is delayed is false, it calls cap_delay_requeue to delay the cap request and when it's true it sets it up to be sent immediately.

Sage, that code was added by you a very long time ago (2008), do you recall the rationale for it?

#12 Updated by Jeff Layton about 6 years ago

It looks like ceph-fuse has the same problem with fsync. Here's a POSIX API reproducer that shows similar improvement with this series.

Next up, see if the kernel client has same behavior...

update: it does. I'll look at patching the kernel client next.

#13 Updated by Jeff Layton about 6 years ago

PR to fix the userland side of things is here:

https://github.com/ceph/ceph/pull/11710

#14 Updated by Jeff Layton about 6 years ago

  • Status changed from New to In Progress

PR for the userland code has been merged, and the kernel patches are in-progress.

#15 Updated by Jeff Layton almost 6 years ago

  • Status changed from In Progress to Resolved

Kernel patches merged for v4.10.

#16 Updated by Cain Miley about 5 years ago

I've created a test cluster (3 nodes, 9 osds spread across, mounting CephFS on 4th machine), and upgraded it to luminous, which should include the patch to fix this:

# ceph --version
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)

and have updated the kernel on all machines:

# uname -sr
Linux 4.4.97-1.el7.elrepo.x86_64

However i'm still getting fsyncs that take 0 - 5 seconds (presumably due to some timer firing every 5 seconds?):

# time bash -c "echo 'Hello' > $(uuidgen) && sync"                                                                                                           

real    0m4.386s
user    0m0.000s
sys     0m0.005s

Am I missing something or is this bug still occurring?

#17 Updated by Zheng Yan about 5 years ago

your kernel does not include the fix, try ceph-fuse

#18 Updated by Jeff Layton about 5 years ago

As Zheng says, the kernel you're using doesn't have the fix for this. You need v4.10 or above (or backport the series that ends with 1e4ef0c6332).

#19 Updated by Cain Miley about 5 years ago

Ah thanks, I read that as 4.1 for some reason, my bad!

Also available in: Atom PDF