Project

General

Profile

Bug #57210

NFS client unable to see newly created files when listing directory contents in a FS subvolume clone

Added by Ramana Raja 4 months ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
libcephfs
Labels (FS):
Manila
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Tried the following in a vstart cluster on ceph-main that launches ganesha v3.5 containers

$ MON=1 MGR=1 OSD=1 MDS=0 ../src/vstart.sh -d -n -x --cephadm
$ ./bin/ceph fs volume create a
$ ./bin/ceph fs subvolume create a subvol01
$ ./bin/ceph fs subvolume getpath a subvol01

$ ./bin/ceph nfs cluster create nfs-ganesha
$ ./bin/ceph nfs export create cephfs nfs-ganesha /cephfs3 a `./bin/ceph fs subvolume getpath a subvol01`
$ sudo mount.nfs4 127.0.0.1:/cephfs3 /mnt/nfs1/
$ pushd /mnt/nfs1/
$ sudo touch file00
$ # can see newly created file when listing directory contents
$ ls
file00
$ popd

$ ./bin/ceph fs subvolume snapshot create a subvol01 snap01
$ ./bin/ceph fs subvolume snapshot clone a subvol01 snap01 clone01
$ ./bin/ceph nfs export create cephfs nfs-ganesha /cephfs4 a `./bin/ceph fs subvolume getpath a clone01`
$ sudo mount.nfs4 127.0.0.1:/cephfs4 /mnt/nfs2/
$ pushd /mnt/nfs2/
$ ls
file00
$ sudo touch file01
$ # can see cloned 'file00' but cannot see the newly created file when reading the directory contents within the clone
$ ls
file00
# listing the file directly works
$ ls file01
file01

I suspect it's similar to https://tracker.ceph.com/issues/54501 where libcephfs might not be updating the directory inode's metadata attributes that would invalidate NFS client's readdir cache. It's weird that this issue is observed only in a cloned subvolume and not in a regular subvolume.


Related issues

Copied to CephFS - Backport #57879: quincy: NFS client unable to see newly created files when listing directory contents in a FS subvolume clone In Progress
Copied to CephFS - Backport #57880: pacific: NFS client unable to see newly created files when listing directory contents in a FS subvolume clone In Progress

History

#1 Updated by Greg Farnum 4 months ago

I wonder if the real difference here is not the cloned subvolume, but whether the mount point had files in it prior to mounting.

Perhaps there’s something about the nfs caching and our updates that is preventing some types of metadata updates from invalidating a listing cache? “touch” creates a dentry but may not updates rstats or other values as quickly as an actual file change does.
(Though it still should be bumping directory versions immediately, but I know nfs tries to be clever based on some specific values.)

#2 Updated by Venky Shankar 3 months ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Ramana Raja
  • Target version set to v18.0.0
  • Backport set to pacific,quincy

#3 Updated by Ramana Raja 3 months ago

Greg Farnum wrote:

I wonder if the real difference here is not the cloned subvolume, but whether the mount point had files in it prior to mounting.

I mounted a subvolume with pre-existing files, and I did not hit this issue. It must be something else that makes
a FS subvolume clone behave differently

$ sudo ./bin/ceph-fuse /mnt/cephfs/ --client-mountpoint=`./bin/ceph fs subvolume getpath a subvol01`
$ pushd /mnt/cephfs
$ sudo touch file00
$ ls
file00
$ popd
$ ./bin/ceph nfs export create cephfs nfs-ganesha /cephfs1 a `./bin/ceph fs subvolume getpath a subvol01`
$ sudo mount.nfs4 127.0.0.1:/cephfs1 /mnt/nfs1/
$ pushd /mnt/nfs1/
$ ls
file00
$ sudo touch file01
$ ls
file00  file01
$ popd
$ umount /mnt/nfs1/
$ sudo mount.nfs4 127.0.0.1:/cephfs1 /mnt/nfs1/
$ pushd /mnt/nfs1/
$ ls
file00  file01
$ sudo touch file02
$ ls
file00  file01  file02

Perhaps there’s something about the nfs caching and our updates that is preventing some types of metadata updates from invalidating a listing cache? “touch” creates a dentry but may not updates rstats or other values as quickly as an actual file change does.
(Though it still should be bumping directory versions immediately, but I know nfs tries to be clever based on some specific values.)

I'll check whether turning off ganesha's dentry caching helps. This issue was originally identified in a setup with ganesha dentry caching disabled , https://bugzilla.redhat.com/show_bug.cgi?id=2118263#c2 . Let me check anyways.

#4 Updated by Ramana Raja 3 months ago

Disabled caching to match the original BZ setup https://bugzilla.redhat.com/show_bug.cgi?id=2118263#c2

$ cat dirent_caching_off.conf
MDCACHE {
         Dir_Chunk = 0;
         NParts = 1;
         Cache_Size = 1;
}

LOG {
 COMPONENTS {
     ALL = FULL_DEBUG;
 }
}

$ ./bin/ceph nfs cluster config set nfs-ganesha dirent_caching_off.conf

I was able to reproduce the issue executing the same steps mentioned in the description.

#5 Updated by Ramana Raja 3 months ago

Also worth noting is that as per https://bugzilla.redhat.com/show_bug.cgi?id=2118263#c11 , this is a regression in the manila CI. This scenario used to pass with previous recent versions of Ceph (pacific and octopus) and NFS-Ganesha.

#6 Updated by Ramana Raja 3 months ago

Setup =====
- ceph vstart cluster (ceph main branch: HEAD 91c57c3fb160db1c95d412b966d703ca08ee75ef)
- NFS-Ganesha built from next-branch using locally built ceph libraries (main branch)

$ ganesha.nfsd -v
NFS-Ganesha Release = V4.0.8
nfs-ganesha compiled on Aug 31 2022 at 07:58:26
Release comment = GANESHA file server is 64 bits compliant and supports NFS v3,4.0,4.1 (pNFS) and 9P
Git HEAD = 98fc2a9bbbc01a91a0b14626b9d8eae679d46bf1
Git Describe = V4.0.8-0-g98fc2a9bb

Steps =====
Created a subvolume (subvol00)

$ ./bin/ceph fs subvolume create a subvol00
$ ./bin/ceph fs subvolume getpath a subvol00
/volumes/_nogroup/subvol00/392dd422-1350-4db4-81b5-8cd268c68535

Created a NFS ganesha export for the CephFS subvolume; Mount the export and create a file.

$ ./bin/ceph auth get-or-create client.subvol00 mon 'allow r' osd 'allow rw tag cephfs data=a' mds 'allow rw path=/volumes/_nogroup/subvol00/392dd422-1350-4db4-81b5-8cd268c68535'
$ ./bin/rados -p cephfs.a.data -N nfs-ganesha get subvol00-export -
EXPORT
{
    Export_ID = 100;
    Protocols = 4;
    Transports = TCP;
    Path = "/volumes/_nogroup/subvol00/392dd422-1350-4db4-81b5-8cd268c68535";
        Pseudo = "/subvol00";
        Access_Type = RW;
        Attr_Expiration_Time = 0;
        FSAL {
        Name = CEPH;
        Filesystem = "a";
        User_Id = "subvol00";
        Secret_Access_Key = "AQDskxdjgh7cIxAAbtrMsecJJ6hyK7XKAjKAwQ==";
        }
        CLIENT {
        Clients = *;
                Squash = None;
        }
}
$ ./bin/rados -p cephfs.a.data -N nfs-ganesha get conf-nfs-ganesha -
%url rados://cephfs.a.data/nfs-ganesha/subvol00-export
$ cat /etc/ganesha/ganesha.conf
NFS_CORE_PARAM {
    Enable_NLM = false;
        Enable_RQUOTA = false;
    Protocols = 4;
        NFS_Port = 2049;
}
MDCACHE {
    # Size the dirent cache down as small as possible.
    Dir_Chunk = 0;
}
NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
}
RADOS_KV {
        # ceph_conf = /home/rraja/git/ceph/build/ceph.conf;
        UserId = admin;
        nodeid = localhost;
        pool = cephfs.a.data;
        namespace = "nfs-ganesha";
}
RADOS_URLS {
        UserId = admin;
        watch_url = "rados://cephfs.a.data/nfs-ganesha/conf-nfs-ganesha";
       #  ceph_conf = /home/rraja/git/ceph/build/ceph.conf;
}
LOG {
    COMPONENTS {
        ALL = FULL_DEBUG;
    }
}
%url    rados://cephfs.a.data/nfs-ganesha/conf-nfs-ganesha

$ sudo ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_FULL_DEBUG
$ sudo mount.nfs 127.0.0.1:/subvol00 /mnt/nfs
$ pushd /mnt/nfs/
$ sudo touch subvol00-file00.txt
$ popd
$ sudo umount /mnt/nfs

Create a CephFS subvolume clone (clone00) from subvolume (subvol00) snapshot. Create a ganesha export for the subvolume clone.

$ ./bin/ceph fs subvolume snapshot create a subvol00 snap00
$ ./bin/ceph fs subvolume snapshot clone a subvol00 snap00 clone0
$ ./bin/ceph fs subvolume getpath a clone00
/volumes/_nogroup/clone00/024e1b0b-e883-44a4-a572-a4f640b8937e
$ ./bin/ceph auth get-or-create client.clone00 mon 'allow r' osd "allow rw tag cephfs data=a" mds 'allow rw path=/volumes/_nogroup/clone00/024e1b0b-e883-44a4-a572-a4f640b8937e'
$ ./bin/rados -p cephfs.a.data -N nfs-ganesha get clone00 -
EXPORT
{
    Export_ID = 101;
    Protocols = 4;
    Transports = TCP;
    Path = "/volumes/_nogroup/clone00/024e1b0b-e883-44a4-a572-a4f640b8937e";
        Pseudo = "/clone00";
        Access_Type = RW;
        Attr_Expiration_Time = 0;
        FSAL {
        Name = CEPH;
        Filesystem = "a";
        User_Id = "clone00";
        Secret_Access_Key = "AQA6oBdj/rc9BBAAGr4+lf1OZjEO3nFIZIAGlA==";
        }
        CLIENT {
        Clients = *;
                Squash = None;
        }
}
$ ./bin/rados -p cephfs.a.data -N nfs-ganesha notify conf-nfs-ganesha "reread-export" 

Remount subvolume export using NFS-client. List the contents of subvolume. Touch a file. Relist directory to check file.

$ sudo mount.nfs 127.0.0.1:/subvol00 /mnt/nfs/ 
$ ls
subvol00-file00.txt
$ sudo touch subvol00-file01.txt && date
Tue Sep  6 03:36:57 PM EDT 2022
$ ls && date
subvol00-file00.txt  subvol00-file01.txt
Tue Sep  6 03:37:04 PM EDT 2022
$ # can see the newly created file

Corresponding to the "touch" command, I can see mdcache_dirent_invalidate_all() being called by ganesha. And in the libcephfs log I can see create request being sent to MDS.

grep -e 'ceph_fsal' -e 'subvol00-file01.txt' -e 'invalidate' ganesha.log | grep 15:36:57
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup :NFS READDIR :F_DBG :Lookup subvol00-file01.txt
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_lookup :FSAL :F_DBG :Lookup subvol00-file01.txt
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup subvol00-file01.txt failed with No such file or directory
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] open4_ex :STATE :F_DBG :Calling open2 for subvol00-file01.txt
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup :NFS READDIR :F_DBG :Lookup subvol00-file01.txt
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_lookup :FSAL :F_DBG :Lookup subvol00-file01.txt
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup subvol00-file01.txt failed with No such file or directory
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_open2 :FSAL :F_DBG :attrs  set attributes Valid Mask=00000080  mode=0644
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_open2 :RW LOCK :F_DBG :Got write lock on 0x7f1ec4010508 (&(*new_obj)->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:1606
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_open2 :RW LOCK :F_DBG :Unlocked 0x7f1ec4010508 (&(*new_obj)->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:1611
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_alloc_and_check_handle :INODE :F_DBG :open2 Created entry 0x7f1ec4010600 FSAL Ceph for subvol00-file01.txt
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] open2_by_name :FSAL :F_DBG :Created entry 0x7f1ec4010638 FSAL MDCACHE for subvol00-file01.txt
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_dirent_invalidate_all :NFS READDIR :F_DBG :Invalidating directory for 0x1b13af0, clearing MDCACHE_DIR_POPULATED setting MDCACHE_TRUST_CONTENT and MDCACHE_TRUST_DIR_CHUNKS
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] ceph_fsal_setattr2 :FSAL :F_DBG :attrs  set attributes Valid Mask=00600000  atime=SERVER mtime=SERVER
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] ceph_fsal_close2 :RW LOCK :F_DBG :Got write lock on 0x7f1ec4010508 (&obj_hdl->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2622
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] ceph_fsal_close2 :RW LOCK :F_DBG :Unlocked 0x7f1ec4010508 (&obj_hdl->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2628
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] ceph_fsal_close2 :RW LOCK :F_DBG :Got write lock on 0x7f1ec400fe78 (&my_fd->fdlock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2634
06/09/2022 15:36:57 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] ceph_fsal_close2 :RW LOCK :F_DBG :Unlocked 0x7f1ec400fe78 (&my_fd->fdlock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2636

$ grep 'send_request client_request' client.subvol00.221904.log | grep -e 15:36:57
2022-09-06T15:36:57.832-0400 7f1fb86f7640 10 client.4238 send_request client_request(unknown.0:19 create #0x10000000003/subvol00-file01.txt 2022-09-06T15:36:57.832746-0400 caller_uid=0, caller_gid=0{0,}) v6 to mds.0

Corresponding to the "ls" command after creating the file, I can see the NFS client sending a readdir request to the ganesha server in the ganesha log, and libcephfs client sending a readdir request to the MDS in the licephfs client logs.

$ grep -e 'ceph_fsal' -e 'subvol00-file01.txt' -e 'process_one_op :NFS4 :DEBUG' ganesha.log | grep 15:37:04
06/09/2022 15:37:04 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] process_one_op :NFS4 :DEBUG :Request 0: opcode 53 is OP_SEQUENCE
06/09/2022 15:37:04 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] process_one_op :NFS4 :DEBUG :Request 1: opcode 22 is OP_PUTFH
06/09/2022 15:37:04 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] process_one_op :NFS4 :DEBUG :Request 2: opcode 9 is OP_GETATTR
06/09/2022 15:37:04 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] process_one_op :NFS4 :DEBUG :Request 0: opcode 53 is OP_SEQUENCE
06/09/2022 15:37:04 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] process_one_op :NFS4 :DEBUG :Request 1: opcode 22 is OP_PUTFH
06/09/2022 15:37:04 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] process_one_op :NFS4 :DEBUG :Request 2: opcode 26 is OP_READDIR
06/09/2022 15:37:04 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_18] nfs4_readdir_callback :NFS READDIR :F_DBG :Entry subvol00-file01.txt pos 0 mem_left 3992

$  grep 'send_request client_request' client.subvol00.221904.log | grep -e 15:37:04
2022-09-06T15:37:04.428-0400 7f1fd04f5640 10 client.4238 send_request client_request(unknown.0:20 getattr pAsLsXsFsa #0x10000000003 2022-09-06T15:37:04.428807-0400 caller_uid=1000, caller_gid=1000{10,1000,}) v6 to mds.0
2022-09-06T15:37:04.429-0400 7f1fd04f5640 10 client.4238 send_request client_request(unknown.0:21 getattr pAsLsXsFsa #0x10000000003 2022-09-06T15:37:04.429616-0400 caller_uid=1000, caller_gid=1000{10,1000,}) v6 to mds.0
2022-09-06T15:37:04.429-0400 7f1fd04f5640 10 client.4238 send_request client_request(unknown.0:22 readdir #0x10000000003 2022-09-06T15:37:04.430192-0400 caller_uid=1000, caller_gid=1000{10,1000,}) v6 to mds.0

Mount the subvolume clone export using NFS-client. List the contents of the subvolume clone. Touch a file. Relist directory to check file.

$ sudo mount.nfs 127.0.0.1:/clone00 /mnt/nfs/ 
$ ls
subvol00-file00.txt
$ sudo touch subvol00-clone01.txt && date 
Tue Sep  6 03:37:46 PM EDT 2022
$ ls && date
subvol00-file00.txt
Tue Sep  6 03:37:53 PM EDT 2022
$ # cannot see the newly created file

Corresponding to the "touch" command, I can see mdcache_dirent_invalidate_all() being called by ganesha

$ grep -e 'ceph_fsal' -e 'subvol00-clone01.txt' -e 'invalidate' ganesha.log | grep 15:37:46
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup :NFS READDIR :F_DBG :Lookup subvol00-clone01.txt
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_lookup :FSAL :F_DBG :Lookup subvol00-clone01.txt
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup subvol00-clone01.txt failed with No such file or directory
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] open4_ex :STATE :F_DBG :Calling open2 for subvol00-clone01.txt
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup :NFS READDIR :F_DBG :Lookup subvol00-clone01.txt
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_lookup :FSAL :F_DBG :Lookup subvol00-clone01.txt
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup subvol00-clone01.txt failed with No such file or directory
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_open2 :FSAL :F_DBG :attrs  set attributes Valid Mask=00000080  mode=0644
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_open2 :RW LOCK :F_DBG :Got write lock on 0x7f1ec4012618 (&(*new_obj)->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:1606
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_open2 :RW LOCK :F_DBG :Unlocked 0x7f1ec4012618 (&(*new_obj)->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:1611
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_alloc_and_check_handle :INODE :F_DBG :open2 Created entry 0x7f1ec4013fd0 FSAL Ceph for subvol00-clone01.txt
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] open2_by_name :FSAL :F_DBG :Created entry 0x7f1ec4014008 FSAL MDCACHE for subvol00-clone01.txt
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_dirent_invalidate_all :NFS READDIR :F_DBG :Invalidating directory for 0x7f1f000719e0, clearing MDCACHE_DIR_POPULATED setting MDCACHE_TRUST_CONTENT and MDCACHE_TRUST_DIR_CHUNKS
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] ceph_fsal_setattr2 :FSAL :F_DBG :attrs  set attributes Valid Mask=00600000  atime=SERVER mtime=SERVER
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_close2 :RW LOCK :F_DBG :Got write lock on 0x7f1ec4012618 (&obj_hdl->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2622
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_close2 :RW LOCK :F_DBG :Unlocked 0x7f1ec4012618 (&obj_hdl->obj_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2628
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_close2 :RW LOCK :F_DBG :Got write lock on 0x7f1ec4011e88 (&my_fd->fdlock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2634
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] ceph_fsal_close2 :RW LOCK :F_DBG :Unlocked 0x7f1ec4011e88 (&my_fd->fdlock) at /home/rraja/git/nfs-ganesha/src/FSAL/FSAL_CEPH/handle.c:2636

$ # looking closer at the mdcache_dirent_invalidate_all
$ grep -e 'invalidate' -A 5 -B 5 ganesha.log | grep 15:37:46
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_getattrs :RW LOCK :F_DBG :Got read lock on 0x7f1f000719e0 (&entry->attr_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:948
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_getattrs :RW LOCK :F_DBG :Unlocked 0x7f1f000719e0 (&entry->attr_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:956
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_getattrs :RW LOCK :F_DBG :Got write lock on 0x7f1f000719e0 (&entry->attr_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:957
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_refresh_attrs :INODE :F_DBG :attrs  obj attributes Request Mask=0111dfce Valid Mask=0011ffce Supported Mask=0111dfee DIRECTORY numlinks=0x2 size=0x0 mode=0755 owner=0x0 group=0x0 atime=06/09/2022 14:37:33 mtime=06/09/2022 15:37:46
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_refresh_attrs :RW LOCK :F_DBG :Got write lock on 0x7f1f00071c50 (&entry->content_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:923
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_dirent_invalidate_all :NFS READDIR :F_DBG :Invalidating directory for 0x7f1f000719e0, clearing MDCACHE_DIR_POPULATED setting MDCACHE_TRUST_CONTENT and MDCACHE_TRUST_DIR_CHUNKS
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_refresh_attrs :RW LOCK :F_DBG :Unlocked 0x7f1f00071c50 (&entry->content_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:925
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_getattrs :RW LOCK :F_DBG :Unlocked 0x7f1f000719e0 (&entry->attr_lock) at /home/rraja/git/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:987
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] mdcache_getattrs :INODE :F_DBG :attrs  obj attributes Request Mask=00100000 Valid Mask=0011ffce Supported Mask=0111dfee DIRECTORY numlinks=0x2 size=0x0 mode=0755 owner=0x0 group=0x0 atime=06/09/2022 14:37:33 mtime=06/09/2022 15:37:46
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] update_stateid :STATE :DEBUG :Update OPEN stateid to OTHER=0x040000004e9e176302000000 {{CLIENTID Epoch=0x63179e4e Counter=0x00000004} StateIdCounter=0x00000002} seqid=1 for response
06/09/2022 15:37:46 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] dec_nfs4_state_ref :STATE :F_DBG :Decrement refcount now=1 {STATE 0x7f1ec4011da0 OTHER=0x040000004e9e176302000000 {{CLIENTID Epoch=0x63179e4e Counter=0x00000004} StateIdCounter=0x00000002} obj=0x7f1ec4011e08 type=SHARE seqid=1 owner={STATE_OPEN_OWNER_NFSV4 0x7f1ec400fba0: clientid={0x7f1f1c002920 ClientID={Epoch=0x63179e4e Counter=0x00000004} CONFIRMED Client={0x7f1f2c003300 name=(23:Linux NFSv4.2 localhost) refcount=1} t_delta=0 reservations=1 refcount=4} owner=(24:0x6f70656e2069643a000000400000000000012de1a94e981b) confirmed=1 seqid=0 refcount=2} refccount=2}

In the libcephfs log, I can see create and getattr requests to the MDS. Following the create request, I can see that parent directory inode's mtime is updated, and Client::clear_dir_complete_and_ordered() is called that clears the libcephfs client's readdir cache.

$ grep -e 'send_request client_request' -e 'clearing (I_COMPLETE|I_DIR_ORDERED)' -e 'add_update_inode' -e 'update_inode_file_time' client.clone00.221904.log | grep "^2022-09-06T15:37:46" 
2022-09-06T15:37:46.717-0400 7f1fb86f7640 10 client.4275 send_request client_request(unknown.0:9 create #0x100000003f1/subvol00-clone01.txt 2022-09-06T15:37:46.718234-0400 caller_uid=0, caller_gid=0{0,}) v6 to mds.0
2022-09-06T15:37:46.738-0400 7f1efcff9640 10 client.4275 update_inode_file_time 0x100000005ed.head(faked_ino=0 nref=1 ll_ref=0 cap_refs={} open={} mode=100644 size=0/0 nlink=1 btime=2022-09-06T15:37:46.718234-0400 mtime=0.000000 ctime=0.000000 caps=- objectset[0x100000005ed ts 0/0 objects 0 dirty_or_tx 0] 0x7f1ed80086a0) - ctime 2022-09-06T15:37:46.718234-0400 mtime 2022-09-06T15:37:46.718234-0400
2022-09-06T15:37:46.738-0400 7f1efcff9640 12 client.4275 add_update_inode adding 0x100000005ed.head(faked_ino=0 nref=1 ll_ref=0 cap_refs={} open={} mode=100644 size=0/0 nlink=1 btime=2022-09-06T15:37:46.718234-0400 mtime=2022-09-06T15:37:46.718234-0400 ctime=2022-09-06T15:37:46.718234-0400 caps=- objectset[0x100000005ed ts 0/0 objects 0 dirty_or_tx 0] 0x7f1ed80086a0) caps pAsxLsXsxFsxcrwb
2022-09-06T15:37:46.738-0400 7f1efcff9640 12 client.4275 add_update_inode had 0x100000003f1.head(faked_ino=0 nref=6 ll_ref=3 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2022-09-06T15:30:46.198892-0400 mtime=2022-09-06T15:27:16.000000-0400 ctime=2022-09-06T15:30:48.939083-0400 caps=pAsLsXs(0=pAsLsXs) COMPLETE has_dir_layout 0x7f1ed80076d0) caps pAsLsXs
2022-09-06T15:37:46.738-0400 7f1efcff9640 10 client.4275 update_inode_file_time 0x100000003f1.head(faked_ino=0 nref=6 ll_ref=3 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2022-09-06T15:30:46.198892-0400 mtime=2022-09-06T15:27:16.000000-0400 ctime=2022-09-06T15:30:48.939083-0400 caps=pAsLsXs(0=pAsLsXs) COMPLETE has_dir_layout 0x7f1ed80076d0) pAsLsXs ctime 2022-09-06T15:37:46.718234-0400 mtime 2022-09-06T15:37:46.718234-0400
2022-09-06T15:37:46.739-0400 7f1fb86f7640 10 client.4275 send_request client_request(unknown.0:10 getattr pAsLsXsFs #0x100000003f1 2022-09-06T15:37:46.740166-0400 caller_uid=0, caller_gid=0{0,}) v6 to mds.0
2022-09-06T15:37:46.740-0400 7f1efcff9640 12 client.4275 add_update_inode had 0x100000003f1.head(faked_ino=0 nref=7 ll_ref=3 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2022-09-06T15:30:46.198892-0400 mtime=2022-09-06T15:37:46.718234-0400 ctime=2022-09-06T15:37:46.718234-0400 caps=pAsLsXs(0=pAsLsXs) COMPLETE has_dir_layout 0x7f1ed80076d0) caps pAsLsXsFs
2022-09-06T15:37:46.740-0400 7f1efcff9640 10 client.4275 update_inode_file_time 0x100000003f1.head(faked_ino=0 nref=7 ll_ref=3 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2022-09-06T15:30:46.198892-0400 mtime=2022-09-06T15:37:46.718234-0400 ctime=2022-09-06T15:37:46.718234-0400 caps=pAsLsXs(0=pAsLsXs) COMPLETE has_dir_layout 0x7f1ed80076d0) pAsLsXs ctime 2022-09-06T15:37:46.718234-0400 mtime 2022-09-06T15:37:46.718234-0400
2022-09-06T15:37:46.740-0400 7f1efcff9640 10 client.4275  clearing (I_COMPLETE|I_DIR_ORDERED) on 0x100000003f1.head(faked_ino=0 nref=7 ll_ref=3 cap_refs={} open={} mode=40755 size=0/0 nlink=1 btime=2022-09-06T15:30:46.198892-0400 mtime=2022-09-06T15:37:46.718234-0400 ctime=2022-09-06T15:37:46.718234-0400 caps=pAsLsXs(0=pAsLsXs) COMPLETE has_dir_layout 0x7f1ed80076d0)

Corresponding to the "ls" command issued after creating the file, I do not see a readdir request from the NFS client in the ganesha logs. I'm not sure why. Since the parent directory's mtime was updated by FSAL's libcephfs, I expected the NFS kclient's readdir cache to be invalidated.

$ grep -e 'ceph_fsal' -e 'subvol00-clone01.txt' -e 'process_one_op :NFS4 :DEBUG' ganesha.log | grep 15:37:53
06/09/2022 15:37:53 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] process_one_op :NFS4 :DEBUG :Request 0: opcode 53 is OP_SEQUENCE
06/09/2022 15:37:53 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] process_one_op :NFS4 :DEBUG :Request 1: opcode 22 is OP_PUTFH
06/09/2022 15:37:53 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_16] process_one_op :NFS4 :DEBUG :Request 2: opcode 9 is OP_GETATTR
06/09/2022 15:37:53 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] process_one_op :NFS4 :DEBUG :Request 0: opcode 53 is OP_SEQUENCE
06/09/2022 15:37:53 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] process_one_op :NFS4 :DEBUG :Request 1: opcode 22 is OP_PUTFH
06/09/2022 15:37:53 : epoch 63179e4e : localhost : nfs-ganesha-221904[svc_17] process_one_op :NFS4 :DEBUG :Request 2: opcode 9 is OP_GETATTR

$  grep 'send_request client_request' client.clone00.221904.log | grep -e 15:37:53
$ # no call from libcephfs to MDS

Edit:
The full ganesha, licephfs client logs, and MDS log are at ceph-post-file: 73057506-56ca-400e-8629-d6a32d1dd97c

#7 Updated by Ramana Raja 3 months ago

Based on inputs from Frank Filz and Jeff Layton on #ganesha IRC and https://github.com/nfs-ganesha/nfs-ganesha/issues/852#issuecomment-1247138244, I performed the following test.

The first time a file is created within a FS subvolume export dir using libcephfs, I see that the mtime and ctime of the subvolume export dir changes, but the stx_version(change_attr) does not.

Create a subvolume

$ ./bin/ceph fs subvolume create a subvol05
$ ./bin/ceph fs subvolume getpath a subvol05
/volumes/_nogroup/subvol05/13a05d52-b6b5-46ad-ae47-7446499335bc

Mount the file system using libcephfs. Fetch the mtime, ctime, and stx_version of the subvolume export directory. Create a file within it. Again fetch the mtime, ctime and stx_version of the subvolume export directory.

>>> import rados
>>> import cephfs
>>> rados = rados.Rados(name="client.admin", clustername="ceph", conffile="/home/rraja/git/ceph/build/ceph.conf", conf={})
>>> rados.connect()
>>> fs = cephfs.LibCephFS(rados_inst=rados)
>>> fs.init()
>>> fs.mount(filesystem_name="a".encode('utf-8'))
>>> print(fs.statx(b'/volumes/_nogroup/subvol05/13a05d52-b6b5-46ad-ae47-7446499335bc', cephfs.CEPH_STATX_ATIME|cephfs.CEPH_STATX_MTIME|cephfs.CEPH_STATX_CTIME|cephfs.CEPH_STATX_VERSION, 0))
{'atime': datetime.datetime(2022, 9, 16, 11, 59, 46), 'mtime': datetime.datetime(2022, 9, 16, 11, 59, 46), 'ctime': datetime.datetime(2022, 9, 16, 11, 59, 46), 'version': 1}

>>> fd=fs.open(b'/volumes/_nogroup/subvol05/13a05d52-b6b5-46ad-ae47-7446499335bc/file00', os.O_WRONLY|os.O_CREAT|os.O_TRUNC, 0o755); fs.close(fd)

>>>print(fs.statx(b'/volumes/_nogroup/subvol05/13a05d52-b6b5-46ad-ae47-7446499335bc', cephfs.CEPH_STATX_ATIME|cephfs.CEPH_STATX_MTIME|cephfs.CEPH_STATX_CTIME|cephfs.CEPH_STATX_VERSION, 0))
{'atime': datetime.datetime(2022, 9, 16, 11, 59, 46), 'mtime': datetime.datetime(2022, 9, 16, 12, 2, 35), 'ctime': datetime.datetime(2022, 9, 16, 12, 2, 35), 'version': 1}

The mtime, and ctime changed, but the stx_version of the directory did not change post creation of the file.

#8 Updated by Ramana Raja 3 months ago

Continued to root cause... As described in https://tracker.ceph.com/issues/57210#note-7, the stx_version (change_attr) of the subvolume export directory (/volumes/<gp>/<subvol>/<UUID>) did not change after the first time a file was created within it. It was 1 before and after creation of the file. Proceeded to create a file within a vanilla CephFS directory. The first time a file was created, the parent directory's change_attr increased from 0 to 1. What made a FS subvolume export dir behave differently from a regular CephFS directory? Figured out that a CephFS subvolume export directory is a CephFS directory with vxattr 'ceph.dir.layout.pool' set on it.

Using libcephfs, created a directory; set 'ceph.dir.layout.pool' vxattr on the directory; created a file within the directory; observed the mtime, ctime, and change_attribute of directory after each step.

>>> import os
>>> import rados; import cephfs; rados = rados.Rados(name="client.admin", clustername="ceph", conffile="/home/rraja/git/ceph/build/ceph.conf", conf={}); rados.connect(); fs = cephfs.LibCephFS(rados_inst=rados); fs.init(); fs.mount(filesystem_name="a".encode('utf-8'))
>>> fs.mkdir('/testdir00', 0o755)
>>> print(fs.statx(b'/testdir00', cephfs.CEPH_STATX_ATIME|cephfs.CEPH_STATX_MTIME|cephfs.CEPH_STATX_CTIME|cephfs.CEPH_STATX_VERSION, 0))
{'atime': datetime.datetime(2022, 9, 19, 14, 17, 41), 'mtime': datetime.datetime(2022, 9, 19, 14, 17, 41), 'ctime': datetime.datetime(2022, 9, 19, 14, 17, 41), 'version': 0}
>>> fs.setxattr('/testdir00', "ceph.dir.layout.pool", b"cephfs.a.data", 0)
>>> print(fs.statx(b'/testdir00', cephfs.CEPH_STATX_ATIME|cephfs.CEPH_STATX_MTIME|cephfs.CEPH_STATX_CTIME|cephfs.CEPH_STATX_VERSION, 0))
{'atime': datetime.datetime(2022, 9, 19, 14, 17, 41), 'mtime': datetime.datetime(2022, 9, 19, 14, 17, 41), 'ctime': datetime.datetime(2022, 9, 19, 14, 18, 4), 'version': 1}
>>> fd=fs.open(b'/testdir00/testfile00', os.O_WRONLY|os.O_CREAT|os.O_TRUNC, 0o755); fs.close(fd)
>>> print(fs.statx(b'/testdir00', cephfs.CEPH_STATX_ATIME|cephfs.CEPH_STATX_MTIME|cephfs.CEPH_STATX_CTIME|cephfs.CEPH_STATX_VERSION, 0))
{'atime': datetime.datetime(2022, 9, 19, 14, 17, 41), 'mtime': datetime.datetime(2022, 9, 19, 14, 18, 33), 'ctime': datetime.datetime(2022, 9, 19, 14, 18, 33), 'version': 1}

After `mkdir`, the stx_version/change_attr of the directory is 0. Setting the "ceph.dir.layout.pool" vxattr, updated the directory's ctime and change_attribute (0 to 1). Checked Server::Server::handle_set_vxattr(), confirmed that it changes ctime and bumps up change_attr, https://github.com/ceph/ceph/blob/v17.2.3/src/mds/Server.cc#L5976 . When a file was created, the mtime and ctime of the directory was updated, but the change_attr remained as 1. I expected change_attr to be bumped to 2. This seems to be the bug.

#9 Updated by Venky Shankar 3 months ago

Ramana Raja wrote:

Continued to root cause... As described in https://tracker.ceph.com/issues/57210#note-7, the stx_version (change_attr) of the subvolume export directory (/volumes/<gp>/<subvol>/<UUID>) did not change after the first time a file was created within it. It was 1 before and after creation of the file. Proceeded to create a file within a vanilla CephFS directory. The first time a file was created, the parent directory's change_attr increased from 0 to 1. What made a FS subvolume export dir behave differently from a regular CephFS directory? Figured out that a CephFS subvolume export directory is a CephFS directory with vxattr 'ceph.dir.layout.pool' set on it.

Using libcephfs, created a directory; set 'ceph.dir.layout.pool' vxattr on the directory; created a file within the directory; observed the mtime, ctime, and change_attribute of directory after each step.
[...]

After `mkdir`, the stx_version/change_attr of the directory is 0. Setting the "ceph.dir.layout.pool" vxattr, updated the directory's ctime and change_attribute (0 to 1). Checked Server::Server::handle_set_vxattr(), confirmed that it changes ctime and bumps up change_attr, https://github.com/ceph/ceph/blob/v17.2.3/src/mds/Server.cc#L5976 . When a file was created, the mtime and ctime of the directory was updated, but the change_attr remained as 1. I expected change_attr to be bumped to 2. This seems to be the bug.

Good analysis, Ramana. That is likely the cause of the issue. Don't we cover such scenarios in our tests?

#10 Updated by Venky Shankar 3 months ago

We started running fs:workload tests w/ cephfs subvolumes (and that has caught a number of bugs). I think we do not run against clones subvolumes. Might be a good idea to add that.

#11 Updated by Ramana Raja 3 months ago

Venky Shankar wrote:

Ramana Raja wrote:

Continued to root cause... As described in https://tracker.ceph.com/issues/57210#note-7, the stx_version (change_attr) of the subvolume export directory (/volumes/<gp>/<subvol>/<UUID>) did not change after the first time a file was created within it. It was 1 before and after creation of the file. Proceeded to create a file within a vanilla CephFS directory. The first time a file was created, the parent directory's change_attr increased from 0 to 1. What made a FS subvolume export dir behave differently from a regular CephFS directory? Figured out that a CephFS subvolume export directory is a CephFS directory with vxattr 'ceph.dir.layout.pool' set on it.

Using libcephfs, created a directory; set 'ceph.dir.layout.pool' vxattr on the directory; created a file within the directory; observed the mtime, ctime, and change_attribute of directory after each step.
[...]

After `mkdir`, the stx_version/change_attr of the directory is 0. Setting the "ceph.dir.layout.pool" vxattr, updated the directory's ctime and change_attribute (0 to 1). Checked Server::Server::handle_set_vxattr(), confirmed that it changes ctime and bumps up change_attr, https://github.com/ceph/ceph/blob/v17.2.3/src/mds/Server.cc#L5976 . When a file was created, the mtime and ctime of the directory was updated, but the change_attr remained as 1. I expected change_attr to be bumped to 2. This seems to be the bug.

Good analysis, Ramana. That is likely the cause of the issue. Don't we cover such scenarios in our tests?

This is the PR of interest, https://github.com/ceph/ceph/pull/10922/
especially commits
https://github.com/ceph/ceph/pull/10922/commits/b77605f9d0b4ab7de92523cefeefae91c971baea and
https://github.com/ceph/ceph/pull/10922/commits/0d441dcd6af553d11d6be6df56d577c5659904a0

We've 2 tests in src/test/libcephfs/test.cc, "ChangeAttr" and "DirChangeAttr"

TEST DirChangeAttr missed catching this. It does mkdir + create file within directory. mkdir + setxattr on directory + create file within directory, would have caught this.
https://github.com/ceph/ceph/blob/v17.2.3/src/test/libcephfs/test.cc#L1873

I'll make changes to DirChangeAttr test: add other directory mutating operations that should update change_attr of the directory.

Background reading:
- https://jtlayton.wordpress.com/2016/09/01/cephfs-and-the-nfsv4-change-attribute/
- Comments in PR https://github.com/ceph/ceph/pull/10922/ and its previous incarnation https://github.com/ceph/ceph/pull/10842/
- https://github.com/ceph/ceph/pull/10842/commits/a2e302372b12d0eb64be5c5fe142ed6cdba79f31#r76188222

#12 Updated by Ramana Raja 3 months ago

Identified that the issue was in MDCache::predirty_journal_parents()

Making the following change fixed the issue. :-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 368f1eb57a7..8becdf40bfc 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -2288,7 +2288,7 @@ void MDCache::predirty_journal_parents(MutationRef mut, EMetaBlob *blob,
       if (touched_mtime)
        pi.inode->mtime = pi.inode->ctime = pi.inode->dirstat.mtime;
       if (touched_chattr)
-       pi.inode->change_attr = pi.inode->dirstat.change_attr;
+       pi.inode->change_attr++;
       dout(20) << "predirty_journal_parents     gives " << pi.inode->dirstat << " on " << *pin << dendl;

#13 Updated by Ramana Raja 2 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 48241

#14 Updated by Venky Shankar about 2 months ago

  • Status changed from Fix Under Review to Pending Backport

#15 Updated by Backport Bot about 2 months ago

  • Copied to Backport #57879: quincy: NFS client unable to see newly created files when listing directory contents in a FS subvolume clone added

#16 Updated by Backport Bot about 2 months ago

  • Copied to Backport #57880: pacific: NFS client unable to see newly created files when listing directory contents in a FS subvolume clone added

#17 Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed

Also available in: Atom PDF