Project

General

Profile

Bug #20945

get_quota_root sends lookupname op for every buffered write

Added by Dan van der Ster 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
08/08/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
jewel
Component(FS):
Client
Needs Doc:
No

Description

We have a CAD use-case (hspice) which sees very slow buffered writes, apparently due to the quota code. (We haven't yet found a trivial reproducer for this).

hspice creates a dir (input.pvadir), then makes several thousand small writes a file, e.g. input.pvadir/pvaRadh.va.
What we're finding is that each write triggers a lookupname op which gets permission denied. So these several thousand buffered writes results in thousands of round trips for lookupname (even though the actual write doesn't happen until later).

Here is the client ll_write (debug_client = 10, debug_ms = 1), where get_quota_root is getting -13 permission denied:

2017-08-08 09:01:39.085129 7fbd02c29700  3 client.406740560 ll_write 0x7fbd3ea94b00 100031a3a42 0~380
2017-08-08 09:01:39.085166 7fbd02c29700 10 client.406740560 choose_target_mds from caps on inode 100031a3a42.head(faked_ino=0 ref=3 ll_ref=2 cap_refs={} open={3=1} mode=100640 size=0/4194304 mtime=2017-08-08 09:01:39.068420 caps=pAsxLsXsxFsxcrwb(0=pAsx
LsXsxFsxcrwb) objectset[100031a3a42 ts 0/0 objects 0 dirty_or_tx 0] 0x7fbd3f43bc00)
2017-08-08 09:01:39.085178 7fbd02c29700 10 client.406740560 send_request rebuilding request 1993 for mds.0
2017-08-08 09:01:39.085181 7fbd02c29700 10 client.406740560 send_request client_request(unknown.0:1993 lookupname #100031a3a42 2017-08-08 09:01:39.085163) v3 to mds.0
2017-08-08 09:01:39.085193 7fbd02c29700  1 -- 128.141.223.244:0/1329072871 --> 188.184.95.158:6800/65802118 -- client_request(client.406740560:1993 lookupname #100031a3a42 2017-08-08 09:01:39.085163) v3 -- ?+0 0x7fbd3ee58dc0 con 0x7fbd3e850900
2017-08-08 09:01:39.085291 7fbcfec27700  3 client.406740560 ll_forget 100031a3a43 5
2017-08-08 09:01:39.085308 7fbcfec27700 10 client.406740560 put_inode on 100031a3a43.head(faked_ino=0 ref=8 ll_ref=0 cap_refs={1024=1,2048=0,4096=0,8192=1} open={3=0} mode=100640 size=520880/4194304 mtime=2017-08-08 09:01:39.078734 caps=pAsxXsxFsxcrwb(0=pAsxXsxFsxcrwb) flushing_caps=Fw objectset[100031a3a43 ts 0/0 objects 1 dirty_or_tx 520880] 0x7fbd3f43b700)
2017-08-08 09:01:39.086123 7fbd1b740700  1 -- 128.141.223.244:0/1329072871 <== mds.0 188.184.95.158:6800/65802118 3674 ==== client_reply(???:1993 = -13 (13) Permission denied) v1 ==== 27+0+0 (1535740102 0 0) 0x7fbd3ee58dc0 con 0x7fbd3e850900
2017-08-08 09:01:39.086145 7fbd1b740700 10 client.406740560 insert_trace from 2017-08-08 09:01:39.085180 mds.0 is_target=0 is_dentry=0
2017-08-08 09:01:39.086148 7fbd1b740700 10 client.406740560 insert_trace -- no trace
2017-08-08 09:01:39.086237 7fbd02c29700  1 client.406740560 get_quota_root 100031a3a42.head failed to find parent of 100031a3a42.head err -13
2017-08-08 09:01:39.086252 7fbd02c29700 10 client.406740560 get_quota_root 100031a3a42.head -> 10000057b9d.head
2017-08-08 09:01:39.086254 7fbd02c29700 10 client.406740560 cur file size is 0
2017-08-08 09:01:39.086258 7fbd02c29700 10 client.406740560 wanted_max_size 0 -> 380

The actual write happens after all of the lookupname ops finish:

2017-08-08 09:01:52.023870 7fbd1d741700  1 -- 128.141.223.244:0/1329072871 --> 128.142.162.18:6871/2238904 -- osd_op(client.406740560.0:293 173.8bfc0696 fsvolumens_e0054975-38b9-4f1a-a6fa-04a50b59db0d/100031a3a42.00000000 [write 0~11106] snapc 1=[] ondisk+write+known_if_redirected e123636) v7 -- ?+0 0x7fbd3f7c0000 con 0x7fbd3ec61680
2017-08-08 09:01:52.027041 7fbcf54ce700  1 -- 128.141.223.244:0/1329072871 <== osd.74 128.142.162.18:6871/2238904 5 ==== osd_op_reply(293 100031a3a42.00000000 [write 0~11106] v123636'18674150 uv18674150 ondisk = 0) v7 ==== 140+0+0 (447776908 0 0) 0x7fbd3f0ed800 con 0x7fbd3ec61680

(In this case, the writes only take 13 seconds. We have other examples where files take ~3 minutes to finish ~8000 small writes.)
The complete log for above was posted with ceph-post-file: 10312221-6666-4d96-9d38-1c0f9b2c8638.

On the MDS we see that lookupname uses 'check_access stray_prior_path' but also 'check_access path ~mds0/stray0'. The latter check returns 13 permission denied. (note that the logs below are not for the exact same instance of this error -- above is from today, below is from yesterday)

2017-08-07 16:09:33.638612 7f7eb764c700  7 mds.0.server dispatch_client_request client_request(client.406335956:48728 lookupname #1000312b7f0 2017-08-07 16:09:33.638346) v3
2017-08-07 16:09:33.638622 7f7eb764c700 20 Session check_access stray_prior_path /volumes/_nogroup/e0054975-38b9-4f1a-a6fa-04a50b59db0d/simulation/wbialas/ResDir/wbTest/testDutWbInv/adexl/results/data/Interactive.27/1/wbTest:testDutWbIn
v:1/netlist/input.pvadir/pvaRadh.va
2017-08-07 16:09:33.638625 7f7eb764c700 10 MDSAuthCap is_capable inode(path /volumes/_nogroup/e0054975-38b9-4f1a-a6fa-04a50b59db0d/simulation/wbialas/ResDir/wbTest/testDutWbInv/adexl/results/data/Interactive.27/1/wbTest:testDutWbInv:1/n
etlist/input.pvadir/pvaRadh.va owner 10899:2790 mode 0100640) by caller 0:0 mask 0 new 0:0 cap: MDSAuthCaps[allow rw path="/volumes/_nogroup/e0054975-38b9-4f1a-a6fa-04a50b59db0d"]
2017-08-07 16:09:33.638633 7f7eb764c700 10 mds.0.locker acquire_locks mutation(0x7f7ed586a000)
2017-08-07 16:09:33.638636 7f7eb764c700 20 mds.0.locker  must rdlock (dn sync) [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 inode=0x7f7f01b09bd0 state=new | request=1 lock=0 inodepin=1 dirty=1 authpin=0 0x7f
7ef4e73ff0]
2017-08-07 16:09:33.638675 7f7eb764c700 10 mds.0.locker  must authpin [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 inode=0x7f7f01b09bd0 state=new | request=1 lock=0 inodepin=1 dirty=1 authpin=0 0x7f7ef4e73ff
0]
2017-08-07 16:09:33.638684 7f7eb764c700 10 mds.0.locker  auth_pinning [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 inode=0x7f7f01b09bd0 state=new | request=1 lock=0 inodepin=1 dirty=1 authpin=0 0x7f7ef4e73ff
0]
2017-08-07 16:09:33.638690 7f7eb764c700 10 mds.0.cache.den(600 1000312b7f0) auth_pin by 0x7f7ed586a000 on [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 ap=1+0 inode=0x7f7f01b09bd0 state=new | request=1 lock=0
 inodepin=1 dirty=1 authpin=1 0x7f7ef4e73ff0] now 1+0
2017-08-07 16:09:33.638697 7f7eb764c700 15 mds.0.cache.dir(600) adjust_nested_auth_pins 1/1 on [dir 600 ~mds0/stray0/ [2,head] auth v=179384689 cv=179384213/179384213 ap=0+1+1 state=1610645506|complete f(v32 m2017-08-07 16:08:47.395419
9=9+0)/f(v32 m2017-08-07 16:08:47.395419 14=14+0) n(v92 rc2017-08-07 16:08:47.395419 b32418 9=9+0) hs=9+1088,ss=0+0 dirty=23 | child=1 sticky=1 dirty=1 waiter=0 authpin=0 0x7f7ecb1e5500] by 0x7f7ed586a000 count now 0 + 1
2017-08-07 16:09:33.638746 7f7eb764c700 15 mds.0.cache.dir(100) adjust_nested_auth_pins 1/0 on [dir 100 ~mds0/ [2,head] auth v=122638810 cv=122564064/122564064 dir_auth=0 ap=0+0+1 state=1610612738|complete f(v0 10=0+10) n(v807098 rc2017
-08-07 16:09:33.419206 b864793348 422=385+37)/n(v807098 rc2017-08-07 16:09:32.196749 b593074823 369=332+37) hs=10+0,ss=0+0 dirty=10 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=0 authpin=0 0x7f7ecaec4300] by 0x7f7ed586a000 count now
 0 + 1
2017-08-07 16:09:33.638772 7f7eb764c700  7 mds.0.locker rdlock_start  on (dn sync) on [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 ap=1+0 inode=0x7f7f01b09bd0 state=new | request=1 lock=0 inodepin=1 dirty=1
authpin=1 0x7f7ef4e73ff0]
2017-08-07 16:09:33.638787 7f7eb764c700 10 mds.0.locker  got rdlock on (dn sync r=1) [dentry #100/stray0/1000312b7f0 [2,head] auth (dn sync r=1) (dversion lock) v=179384688 ap=1+0 inode=0x7f7f01b09bd0 state=new | request=1 lock=1 inodep
in=1 dirty=1 authpin=1 0x7f7ef4e73ff0]
2017-08-07 16:09:33.638812 7f7eb764c700 20 Session check_access path ~mds0/stray0
2017-08-07 16:09:33.638817 7f7eb764c700 10 MDSAuthCap is_capable inode(path /mds0/stray0 owner 0:0 mode 040500) by caller 0:0 mask 1 new 0:0 cap: MDSAuthCaps[allow rw path="/volumes/_nogroup/e0054975-38b9-4f1a-a6fa-04a50b59db0d"]
2017-08-07 16:09:33.638823 7f7eb764c700 10 mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.406335956:48728 lookupname #1000312b7f0 2017-08-07 16:09:33.638346) v3
2017-08-07 16:09:33.638849 7f7eb764c700 10 mds.0.server apply_allocated_inos 0 / [] / 0
2017-08-07 16:09:33.638857 7f7eb764c700 20 mds.0.server lat 0.000412
2017-08-07 16:09:33.638887 7f7eb764c700  7 mds.0.cache request_finish mutation(0x7f7ed586a000)
2017-08-07 16:09:33.638903 7f7eb764c700 15 mds.0.cache request_cleanup mutation(0x7f7ed586a000)
2017-08-07 16:09:33.638914 7f7eb764c700  7 mds.0.locker rdlock_finish on (dn sync) on [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 ap=1+0 inode=0x7f7f01b09bd0 state=new | request=1 lock=0 inodepin=1 dirty=1
authpin=1 0x7f7ef4e73ff0]
2017-08-07 16:09:33.638940 7f7eb764c700 10 mds.0.locker simple_eval (dn sync) on [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 ap=1+0 inode=0x7f7f01b09bd0 state=new | request=1 lock=0 inodepin=1 dirty=1 authp
in=1 0x7f7ef4e73ff0]
2017-08-07 16:09:33.638950 7f7eb764c700 10 mds.0.cache.den(600 1000312b7f0) auth_unpin by 0x7f7ed586a000 on [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 inode=0x7f7f01b09bd0 state=new | request=1 lock=0 inod
epin=1 dirty=1 authpin=0 0x7f7ef4e73ff0] now 0+0
2017-08-07 16:09:33.638957 7f7eb764c700 15 mds.0.cache.dir(600) adjust_nested_auth_pins -1/-1 on [dir 600 ~mds0/stray0/ [2,head] auth v=179384689 cv=179384213/179384213 state=1610645506|complete f(v32 m2017-08-07 16:08:47.395419 9=9+0)/
f(v32 m2017-08-07 16:08:47.395419 14=14+0) n(v92 rc2017-08-07 16:08:47.395419 b32418 9=9+0) hs=9+1088,ss=0+0 dirty=23 | child=1 sticky=1 dirty=1 waiter=0 authpin=0 0x7f7ecb1e5500] by 0x7f7ed586a000 count now 0 + 0
2017-08-07 16:09:33.638983 7f7eb764c700 15 mds.0.cache.dir(100) adjust_nested_auth_pins -1/0 on [dir 100 ~mds0/ [2,head] auth v=122638810 cv=122564064/122564064 dir_auth=0 state=1610612738|complete f(v0 10=0+10) n(v807098 rc2017-08-07 1
6:09:33.419206 b864793348 422=385+37)/n(v807098 rc2017-08-07 16:09:32.196749 b593074823 369=332+37) hs=10+0,ss=0+0 dirty=10 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=0 authpin=0 0x7f7ecaec4300] by 0x7f7ed586a000 count now 0 + 0
2017-08-07 16:09:33.639005 7f7eb764c700 10 mds.0.cache.strays eval_stray [dentry #100/stray0/1000312b7f0 [2,head] auth (dversion lock) v=179384688 inode=0x7f7f01b09bd0 state=new | request=0 lock=0 inodepin=1 dirty=1 authpin=0 0x7f7ef4e7
3ff0]
2017-08-07 16:09:33.639019 7f7eb764c700 10 mds.0.cache.strays  inode is [inode 1000312b7f0 [2,head] ~mds0/stray0/1000312b7f0 auth v179384688 dirtyparent s=17771 nl=0 n(v0 b17771 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion l
ock) cr={406335956=0-4194304@1} caps={406335956=pAsxLsXsxFsxcrwb/pAsxXsxFxcwb@5},l=406335956 | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=0 0x7f7f01b09bd0]
2017-08-07 16:09:33.639070 7f7eb764c700 20 mds.0.cache.strays  caps | leases

I suspected that this is because of the mds path restricted prefix, but even after removing that path restriction I still saw thousands of lookupname, but this time they used ino 100, 'check_access path ~mds0', which results in error -2 no such file or directory.

2017-08-07 14:49:59.137276 7ff970837700  4 mds.0.server handle_client_request client_request(client.406370931:7594 lo
okupname #100 2017-08-07 14:49:59.133854) v3
2017-08-07 14:49:59.137279 7ff970837700 20 mds.0.server get_session have 0x7ffa884f0680 client.406370931 128.141.223.
244:0/3251053959 state open
2017-08-07 14:49:59.137281 7ff970837700 15 mds.0.server  oldest_client_tid=7593
2017-08-07 14:49:59.137285 7ff970837700  7 mds.0.cache request_start mutation(0x7ff9858d4800)
2017-08-07 14:49:59.137287 7ff970837700  7 mds.0.server dispatch_client_request client_request(client.406370931:7594
lookupname #100 2017-08-07 14:49:59.133854) v3
2017-08-07 14:49:59.137292 7ff970837700 20 Session check_access path ~mds0
2017-08-07 14:49:59.137293 7ff970837700 10 MDSAuthCap is_capable inode(path /mds0 owner 0:0 mode 040500) by caller 0:
0 mask 0 new 0:0 cap: MDSAuthCaps[allow rw]
2017-08-07 14:49:59.137297 7ff970837700 10 mds.0.server reply_client_request -2 ((2) No such file or directory) client_request(client.406370931:7594 lookupname #100 2017-08-07 14:49:59.133854) v3
2017-08-07 14:49:59.137308 7ff970837700 10 mds.0.server apply_allocated_inos 0 / [] / 0
2017-08-07 14:49:59.137309 7ff970837700 20 mds.0.server lat 0.002961
2017-08-07 14:49:59.137321 7ff970837700  7 mds.0.cache request_finish mutation(0x7ff9858d4800)
2017-08-07 14:49:59.137327 7ff970837700 15 mds.0.cache request_cleanup mutation(0x7ff9858d4800)

Our user is preparing to test without client-quota enabled, just to confirm that that works around this issue.

What's unclear to me is why get_quota_root needs to CEPH_MDS_OP_LOOKUPNAME for this case, instead of getting the parent_in from dn_set. I suspect that understanding that will lead to the solution.


Related issues

Copied to fs - Backport #21112: luminous: get_quota_root sends lookupname op for every buffered write Resolved
Copied to fs - Backport #21113: jewel: get_quota_root sends lookupname op for every buffered write Resolved

History

#1 Updated by Dan van der Ster 3 months ago

Our user confirmed that without client-quota their job finishes quickly:

cephfs - quota off :

           total cpu time          382.62 seconds
           total elapsed time      411.32 seconds

in comparison to cephfs quota ON :

           total cpu time          382.36 seconds
           total elapsed time      676.13 seconds

#2 Updated by Zheng Yan 3 months ago

enabling quota and writing to unlinked file can produce this easily. get_quota_root() uses dentry in dn_set if it has valid lease. If no dentry has valid lease, CEPH_MDS_OP_LOOKUPNAME is used. LOOKUPNAME handles case that inode or its ancestor directory get renamed by other client.

This issue is somewhat design flaw of quota, not easy to fix. As a workaround, you can check inode's link count, skip the LOOKNAME request if it's zero.

#3 Updated by Dan van der Ster 3 months ago

Thanks. You're right. Here's the trivial reproducer:

#!/usr/bin/env python
import os
import time

start = time.time()
f = open('test.txt','w')
for i in xrange(10000):
  f.write('x')
  f.flush()
f.close()
t = time.time()-start

print "Writing 10000 bytes to a file", t, "seconds" 

start = time.time()
f = open('test.txt','w')
os.unlink('test.txt')
for i in xrange(10000):
  f.write('x')
  f.flush()
f.close()
t = time.time()-start

print "Writing 10000 bytes to an unlinked file", t, "seconds" 

and running on my system, each flush triggers an additional lookupname, slowing things down considerably:

# ./writeunlinked.py
Writing 10000 bytes to a file 2.52147316933 seconds
Writing 10000 bytes to an unlinked file 17.2061231136 seconds

As a workaround, you can check inode's link count, skip the LOOKNAME request if it's zero.

I'll work on that patch.

#4 Updated by Dan van der Ster 3 months ago

This seems to work

diff --git a/src/client/Client.cc b/src/client/Client.cc
index d66fdc757a..e15cbff97b 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -12195,6 +12195,12 @@ Inode *Client::get_quota_root(Inode *in)
     if (cur == root_ancestor)
       break;

+    // deleted inode
+    if (cur->nlink == 0) {
+      cur = root_ancestor;
+      break;
+    }
+
     MetaRequest *req = new MetaRequest(CEPH_MDS_OP_LOOKUPNAME);
     filepath path(cur->ino);
     req->set_filepath(path);

but I don't understand in which case the CEPH_MDS_OP_LOOKUPNAME is needed. Can we drop it completely?

An alternative which I didn't yet test is

diff --git a/src/client/Client.cc b/src/client/Client.cc
index d66fdc757a..b99d0f04fb 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -12185,6 +12185,10 @@ Inode *Client::get_quota_root(Inode *in)
       }
     } else if (root_parents.count(cur)) {
       parent_in = root_parents[cur].get();
+    } else {
+      // unlinked inode
+      cur = root_ancestor;
+      break;
     }

     if (parent_in) {

Do you have some advice about that?

#5 Updated by Zheng Yan 3 months ago

lookupname is for following case:

directory /a /b have non-default quota

client A is writing /a/file

client B rename /a/file to /b/file

#7 Updated by Dan van der Ster 2 months ago

Could you also please add the luminous backport tag for this?

#8 Updated by Nathan Cutler 2 months ago

  • Backport changed from jewel to jewel, luminous

#9 Updated by Patrick Donnelly 2 months ago

  • Status changed from New to Pending Backport
  • Target version set to v12.2.0
  • Component(FS) Client added

#10 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #21112: luminous: get_quota_root sends lookupname op for every buffered write added

#11 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #21113: jewel: get_quota_root sends lookupname op for every buffered write added

#12 Updated by Nathan Cutler about 1 month ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF