Project

General

Profile

Feature #12334

nfs-ganesha: handle client cache pressure in NFS Ganesha FSAL

Added by John Spray almost 5 years ago. Updated 28 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
octopus,nautilus
Reviewed:
Affected Versions:
Component(FS):
Client, Ganesha FSAL
Labels (FS):
task(intern)
Pull request ID:

Description

Reported by Eric Eastman on ceph-users: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003000.html

When writing a number of files greater than the MDS cache size via an NFS ganesha client, user sees "Client <foo> failing to respond to cache pressure" warnings.

Presumably this is due to the NFS layer taking references to inodes using the ll interface to libcephfs, and not also having a hook to be kicked to release cached inodes in response to cache pressure.


Related issues

Related to fs - Feature #18537: libcephfs cache invalidation upcalls Rejected 01/16/2017
Related to fs - Bug #44976: MDS problem slow requests, cache pressure, damaged metadata after upgrading 14.2.7 to 14.2.8 New
Duplicated by fs - Bug #45114: client: make cache shrinking callbacks available via libcephfs Duplicate
Copied to fs - Backport #45688: octopus: nfs-ganesha: handle client cache pressure in NFS Ganesha FSAL Resolved
Copied to fs - Backport #45689: nautilus: nfs-ganesha: handle client cache pressure in NFS Ganesha FSAL Resolved

History

#1 Updated by Greg Farnum almost 5 years ago

Note that NFS-Ganesha created significantly more inodes than the cache size limit before it had too many pinned. So it is dropping some caps. But we just don't have any feedback mechanism.

NFS file handles are probably the right way to deal with this; we ought to be able to drop basically whatever we want if the only user is an NFS client...

#2 Updated by Eric Eastman almost 5 years ago

I ran the same 5 million file create test using a cifs mount instead of a NFS mount and did not see the "Client <foo> failing to respond to cache pressure" warning. Like with the NFS test, the file creates were split between 2 clients. Both clients used the cifs mount type. The Ubuntu trusty client used SMB 3 with the fstab entry:
//ede-c2-gw01/cephfs /TEST-SMB cifs rw,guest,noauto,vers=3.0
Centos 6.6 used SMB 1 with the fstab entry:
//ede-c2-gw01/cephfs /TEST-SMB cifs rw,guest,noauto

Like the NFS test, there was the single gateway using the Ceph file system VFS interface to SAMBA. SAMBA Version: 4.3.0pre1-GIT-2c1c567. The smb.conf entry for the file system was:

[cephfs]
path = /
writeable = yes
vfs objects = ceph
ceph:config_file = /etc/ceph/ceph.conf
browseable = yes

Next test will be with the kernel ceph file system interface.

#3 Updated by Eric Eastman almost 5 years ago

I am seeing the "Client <foo> failing to respond to cache pressure" warning using the Ceph Kernel driver after creating the 5 million files. This was using only 1 Ceph file systems client.

Client and all members of the Ceph cluster are running Ubuntu Trusty with 4.1 kernel and Ceph v9.0.1. Info from the client:

# lsmod | grep ceph
ceph                  323584  1 
libceph               253952  1 ceph
libcrc32c              16384  1 libceph
fscache                65536  1 ceph

# cat /etc/fstab | grep ceph
10.15.2.121,10.15.2.122,10.15.2.123:/ /cephfs ceph name=cephfs,secretfile=/etc/ceph/client.cephfs,noatime,noauto,_netdev

# ceph --version
ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)
# cat /proc/version 
Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19 UTC 2015

This ceph -s was taken about 6 hours after the file creation tool finished, and the system was idle for these 6 hours:
# ceph -s
    cluster 6d8aae1e-1125-11e5-a708-001b78e265be
     health HEALTH_WARN
            mds0: Client ede-c2-gw01:cephfs failing to respond to cache pressure
     monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
            election epoch 22, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
     mdsmap e3512: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
     osdmap e590: 8 osds: 8 up, 8 in
      pgmap v473095: 832 pgs, 4 pools, 162 GB data, 4312 kobjects
            182 GB used, 78319 MB / 263 GB avail
                 832 active+clean

From the active MDS:

# ceph daemon mds.ede-c2-mds03 perf dump mds
{
    "mds": {
        "request": 734724,
        "reply": 734723,
        "reply_latency": {
            "avgcount": 734723,
            "sum": 2207.364066876
        },
        "forward": 0,
        "dir_fetch": 13,
        "dir_commit": 22483,
        "dir_split": 0,
        "inode_max": 100000,
        "inodes": 197476,
        "inodes_top": 0,
        "inodes_bottom": 0,
        "inodes_pin_tail": 197476,
        "inodes_pinned": 197476,
        "inodes_expired": 746886,
        "inodes_with_caps": 192730,
        "caps": 192730,
        "subtrees": 2,
        "traverse": 1476741,
        "traverse_hit": 742325,
        "traverse_forward": 0,
        "traverse_discover": 0,
        "traverse_dir_fetch": 1,
        "traverse_remote_ino": 0,
        "traverse_lock": 0,
        "load_cent": 73472400,
        "q": 0,
        "exported": 0,
        "exported_inodes": 0,
        "imported": 0,
        "imported_inodes": 0
    }
}

#4 Updated by Eric Eastman almost 5 years ago

I finished the final test using 1 Ceph file system client and the fuse interface. I ran the create 5 million file test a couple times mounting the ceph file system using the fuse interface without seeing the "Client <foo> failing to respond to cache pressure" warning. The fstab entry used:

# cat /etc/fstab | grep ceph
id=cephfs,keyring=/etc/ceph/client.cephfs.keyring /cephfs fuse.ceph noatime,_netdev,noauto 0 0

Summary of the tests:
I saw the cache pressure warning with Ganesha NFS and the Ceph file system kernel interface. I did not see the warning with the SAMBA or the Ceph fuse.ceph interface.

Let me know if you need additional information.

#5 Updated by John Spray almost 5 years ago

It looks like the ganesha FSAL interface already includes the function `up_async_invalidate` for this sort of thing, though libcephfs itself doesn't currently provide a way to register such a hook. We'll need to update the FSAL and libcephfs.

#6 Updated by John Spray almost 5 years ago

Pinged Matt & Adam about this yesterday, Matt's planning to work on it at some stage. In some cases we may want to simply disable the ganesha cache in favour of Client()'s cache, but it should also be possible to implement these invalidation hooks.

#7 Updated by Patrick Donnelly over 2 years ago

  • Subject changed from Handle client cache pressure in NFS Ganesha FSAL to nfs-ganesha: handle client cache pressure in NFS Ganesha FSAL
  • Category changed from NFS (Linux Kernel) to Correctness/Safety
  • Assignee set to Jeff Layton
  • Priority changed from Normal to High
  • Target version set to v13.0.0
  • Tags set to ganesha,intern
  • Backport set to luminous
  • Release set to luminous
  • Component(FS) Client, Ganesha FSAL added

#8 Updated by Jeff Layton over 2 years ago

This is worth looking into, but I wonder how big of a problem this is once you disable a lot of the ganesha mdcache behavior.

The one thing we can't really disable in ganesha though is the filehandle cache, and each of those holds an Inode reference. We will probably need some mechanism to force some of those to be flushed out, but there is a bit of a problem:

ganesha may not always be able to tear down a specific filehandle if it's still in use, and libcephfs doesn't have any way to know which ones are. What may be best is just some mechanism to ask ganesha to clean out whatever filehandles are not still in use or maybe just ask it to shrink the cache by some number of entries.

#9 Updated by Jeff Layton over 2 years ago

Possibly related tracker:

http://tracker.ceph.com/issues/18537

#10 Updated by Patrick Donnelly about 2 years ago

  • Related to Feature #18537: libcephfs cache invalidation upcalls added

#11 Updated by Patrick Donnelly about 2 years ago

  • Target version changed from v13.0.0 to v14.0.0
  • Tags deleted (ganesha,intern)
  • Backport changed from luminous to mimic,luminous
  • Labels (FS) task(intern) added

#12 Updated by Jeff Layton over 1 year ago

  • Status changed from New to Rejected

I've not heard of anyone hitting this that has set up ganesha to use the CACHEINODE and EXPORT parameters recommended here:

https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_samples/ceph.conf

With that, I don't see a real need for this. I'm going to close this with a resolution of REJECTED for now, but feel free to reopen if you have a use-case you'd like considered for this.

#13 Updated by Zoltan Arnold Nagy 8 months ago

I do see this on a new mds setup, with 14.2.4, having the right ganesha setup:

root@c10n5:~# cat /etc/ganesha/ganesha.conf | grep -i 'cache_size|expiration_time'
Cache_Size = 1;
root@c10n5:~# cat /etc/ganesha/ganesha.conf | grep -i Attr_Expiration_Time
Attr_Expiration_Time = 0;
root@c10n5:~#

and yet ceph -s tells me the same thing:

1 MDSs report oversized cache
1 clients failing to respond to cache pressure

I don't have anything else using this other than ganesha.

What info can I provide?

#14 Updated by Jeff Layton 8 months ago

Zoltan Arnold Nagy wrote:

What info can I provide?

I think it'd be best to open a new tracker ticket for the problem you're having. This one is all about the ceph client failing to respond to cache pressure, which may or may not be the case here.

Be sure to save the MDS logfile, in particular any "MDS cache is too large" messages, which should tell us something about which caches are too large. We'd also need to know something about the workload on the NFS clients. Are they holding a lot of files open?

#15 Updated by Janek Bevendorff 6 months ago

I am seeing similar issues on our cluster. I had the Ganesha node running on the same node as the MONs just for convenience, so you could use the same domain to connect to, but found that it puts too much load on those nodes. Therefore, I moved the Ganesha services to different nodes and the "failing to respond to cache pressure" immediately moved to the new node as well.

Right now, I only have one (!) mostly idle client with two shares connected. The MDS reports 3653 caps allocated, but apparently that's enough to show the warning. When I restart the Ganesha server, the message goes away, but it only takes about 2-5 minutes and a tiny backup task for it to reappear.

#16 Updated by Jeff Layton 3 months ago

  • Status changed from Rejected to In Progress
  • Target version changed from v14.0.0 to v16.0.0

Reopening this bug. We've had some other reports of this upstream as well, and I'm convinced we'll need to add some way to relay cache pressure to ganesha. There are couple of issues here:

1) libcephfs doesn't provide an interface to set callbacks for this (see tracker #45114). That's fairly simple to solve.

2) libcephfs doesn't have a mechanism to scan the inode_map and ask the application to release inode refs. We currently scan for dentries and release them in trim_cache(), but ganesha looks up inodes by vinodeno_t. There may be no dentries associated with them. We may be able to just walk the inode_map after trimming dentries and ask the application to release what it can, or we may need to add a separate LRU for Inodes.

3) ganesha doesn't have a mechanism to allow libcephfs to request that it release an Inode reference, if it's able. It does have a facility for "upcalls", but it doesn't have one for this. That will need to be added.

#17 Updated by Jeff Layton 3 months ago

Ok, I have a first stab at the ceph piece of this mostly done now. The ganesha piece still needs some work as it's not trivial to atomically check whether an entry (aka inode) has files open and only unhash and decrement it if it does. Hopefully I'll have something ready for testing soon though.

#18 Updated by Jeff Layton 3 months ago

  • Pull request ID set to 34596

#19 Updated by Jeff Layton 3 months ago

Ok, ceph patches are pretty much done. Just waiting on review so I can merge them. There are also some ganesha patches to make it use the new functionality, culminating in this patch:

https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/490848

#20 Updated by Jeff Layton about 2 months ago

Ganesha patches are merged and have been for over a week. The libcephfs bits are also still ready, but testing is taking a lot longer than expected.

#21 Updated by Jeff Layton about 2 months ago

  • Status changed from In Progress to Resolved

Ceph patches were merged.

#22 Updated by Nathan Cutler about 2 months ago

  • Backport deleted (mimic,luminous)

Since target version is set to 16.0.0 and the status was changed to "Resolved", I guess backports are not needed (?)

#23 Updated by Patrick Donnelly about 2 months ago

  • Status changed from Resolved to Pending Backport
  • Backport set to octopus,nautilus

No, the backport release list just needs updated.

#24 Updated by Nathan Cutler about 2 months ago

  • Copied to Backport #45688: octopus: nfs-ganesha: handle client cache pressure in NFS Ganesha FSAL added

#25 Updated by Nathan Cutler about 2 months ago

  • Copied to Backport #45689: nautilus: nfs-ganesha: handle client cache pressure in NFS Ganesha FSAL added

#26 Updated by Patrick Donnelly about 1 month ago

  • Duplicated by Bug #45114: client: make cache shrinking callbacks available via libcephfs added

#27 Updated by Patrick Donnelly about 1 month ago

  • Related to Bug #44976: MDS problem slow requests, cache pressure, damaged metadata after upgrading 14.2.7 to 14.2.8 added

#28 Updated by Nathan Cutler 28 days ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF