Bug #16668
closedclient: nlink count is not maintained correctly
0%
Description
Frank reported in #ceph-devel that we don't seem to update nlink correctly from the Client. Looking through the source, we don't appear to ever change nlink except in response to an update from the MDS — which we only accept if we don't have CEPH_CAP_LINK_EXCL.
But we often do have CEPH_CAP_LINK_EXCL on files we've created, and if we unlink them we expect to see that reflected in the link count. We could try and maintain nlink ourselves, but that's somewhat odd since we have to route unlinks through the MDS anyway. That makes me think we should just let the MDS maintain nlink, and accept what it tells us.
...but that seems really odd since we have a cap for covering link count. Maybe we really only care about it for the associated inode locks on the MDS?
Updated by Jeff Layton almost 8 years ago
I suspect the kclient has a similar problem. I'll test it out when I get a chance. I do agree that we probably ought to just accept nlink changes from the MDS regardless of cap, and maybe just stop the client from relying on CEPH_CAP_LINK_EXCL at all? Seems pretty useless since the nlink is only going to change with namespace operations.
Updated by Greg Farnum almost 8 years ago
- Category changed from 46 to Correctness/Safety
- Component(FS) Client, Common/Protocol, MDS added
Updated by Zheng Yan almost 8 years ago
MDS revokes CEPH_CAP_LINK_EXCL when unlinking files. It's odd, but I can't see how does it cause problem
Updated by Jeff Layton almost 8 years ago
I rolled up a testcase for this:
TEST(LibCephFS, Nlink) { struct ceph_mount_info *cmount; ASSERT_EQ(ceph_create(&cmount, NULL), 0); ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0); ASSERT_EQ(0, ceph_conf_parse_env(cmount, NULL)); ASSERT_EQ(ceph_mount(cmount, "/"), 0); char filename[32], linkname[32]; sprintf(filename, "/nlinkorig%x", getpid()); sprintf(linkname, "/nlinklink%x", getpid()); ceph_unlink(cmount, filename); int fd = ceph_open(cmount, filename, O_RDWR|O_CREAT|O_EXCL, 0666); ASSERT_LT(0, fd); // ASSERT_EQ(ceph_close(cmount, fd), 0); struct stat st; ASSERT_EQ(ceph_stat(cmount, filename, &st), 0); ASSERT_EQ(st.st_nlink, 1); ASSERT_EQ(ceph_link(cmount, filename, linkname), 0); ASSERT_EQ(ceph_stat(cmount, filename, &st), 0); ASSERT_EQ(st.st_nlink, 2); ASSERT_EQ(ceph_unlink(cmount, linkname), 0); ASSERT_EQ(ceph_stat(cmount, filename, &st), 0); ASSERT_EQ(st.st_nlink, 1); ceph_shutdown(cmount); }
...and it passes on my test rig. Frank, am I doing something wrong here? Are you doing something different to reproduce this?
Updated by Greg Farnum almost 8 years ago
- Status changed from New to In Progress
Noted on irc that cap handling that involves the root directory (so, anything in root and frequently things in its immediate children) is often wonky. Hopefully putting them in subdirs a level or two deep will reproduce.
Updated by Jeff Layton almost 8 years ago
It also occurred to me yesterday that I was using the path-based calls, whereas ganesha would likely be using the ll calls. I changed the testcase to look like this:
TEST(LibCephFS, Nlink) { struct ceph_mount_info *cmount; ASSERT_EQ(ceph_create(&cmount, NULL), 0); ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0); ASSERT_EQ(0, ceph_conf_parse_env(cmount, NULL)); ASSERT_EQ(ceph_mount(cmount, "/"), 0); Inode *root, *dir, *file; ASSERT_EQ(ceph_ll_lookup_root(cmount, &root), 0); char dirname[32], filename[32], linkname[32]; sprintf(dirname, "nlinkdir%x", getpid()); sprintf(filename, "nlinkorig%x", getpid()); sprintf(linkname, "nlinklink%x", getpid()); struct stat st; Fh *fh; ASSERT_EQ(ceph_ll_mkdir(cmount, root, dirname, 0755, &st, &dir, getuid(), getgid()), 0); ASSERT_EQ(ceph_ll_create(cmount, dir, filename, 0666, O_RDWR|O_CREAT|O_EXCL, &st, &file, &fh, getuid(), getgid()), 0); ASSERT_EQ(st.st_nlink, 1); ASSERT_EQ(ceph_ll_link(cmount, file, dir, linkname, &st, getuid(), getgid()), 0); ASSERT_EQ(st.st_nlink, 2); ASSERT_EQ(ceph_ll_unlink(cmount, dir, linkname, getuid(), getgid()), 0); ASSERT_EQ(ceph_ll_getattr(cmount, file, &st, getuid(), getgid()), 0); ASSERT_EQ(st.st_nlink, 1); ceph_shutdown(cmount); }
...and the test still passes. At this point, I think we need the folks working with ganesha to give us a better clue about what their test actually does.
EDIT: remove bogus ceph_unlink call that's no longer necessary in this test. Still passes.
Updated by Jeff Layton almost 8 years ago
I set up a ganesha + ceph test rig today and was able to reproduce the problem. Interestingly, it does not reproduce with nfsv4.1, but does with v3. Here's a stat before and after
[jlayton@pnfsclnt ganesha]$ touch testfile; ln testfile testlink; stat testfile; rm testlink; stat testfile File: ‘testfile’ Size: 0 Blocks: 0 IO Block: 1048576 regular empty file Device: 2ah/42d Inode: 1099511627782 Links: 2 Access: (0664/-rw-rw-r--) Uid: ( 4447/ jlayton) Gid: ( 4447/ jlayton) Context: system_u:object_r:nfs_t:s0 Access: 2016-07-15 15:10:48.613250953 -0400 Modify: 2016-07-15 15:10:48.613251073 -0400 Change: 2016-07-15 15:10:48.613251073 -0400 Birth: - File: ‘testfile’ Size: 0 Blocks: 0 IO Block: 1048576 regular empty file Device: 2ah/42d Inode: 1099511627782 Links: 2 Access: (0664/-rw-rw-r--) Uid: ( 4447/ jlayton) Gid: ( 4447/ jlayton) Context: system_u:object_r:nfs_t:s0 Access: 2016-07-15 15:10:48.613250953 -0400 Modify: 2016-07-15 15:10:48.613251073 -0400 Change: 2016-07-15 15:10:48.613251073 -0400 Birth: -
Note too that the ctime didn't change there. When I stat from a completely different client though:
$ stat /mnt/cephfs/testfile File: '/mnt/cephfs/testfile' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 0h/0d Inode: 1099511627782 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 4447/ jlayton) Gid: ( 4447/ jlayton) Context: system_u:object_r:cephfs_t:s0 Access: 2016-07-15 15:10:48.613250953 -0400 Modify: 2016-07-15 15:10:48.613251073 -0400 Change: 2016-07-15 15:10:48.649113873 -0400 Birth: -
...now to determine whether this is libcephfs not updating the attributes or something else.
Updated by Jeff Layton almost 8 years ago
Tracked down the problem with the ctime and it appears to be a fairly simple bug in fill_stat(). It was only looking at the seconds field when comparing the ctime and mtime, but it needs to compare both. I have a (one line) patch that should fix that bug.
For a while yesterday I was unable to reproduce the nlink issue, but it's back today. Basically, what happens is that after the unlink, ganesha purges its metadata cache, and then will reissue a ll_lookup call for the main file when we go to stat it afterward. When the problem reproduces, libcephfs seems to be satisfying the lookup call out of its cache, instead of querying the server for it. Going over that logic now.
Updated by Jeff Layton almost 8 years ago
Successful test -- the lookup after the unlink calls into _do_lookup:
2016-07-19 11:25:15.718608 7f6535be3700 3 client.4117 ll_lookup 0x243a580 testfile 2016-07-19 11:25:15.718616 7f6535be3700 20 client.4117 _lookup have dn testfile mds.-1 ttl 0.000000 seq 0 2016-07-19 11:25:15.718623 7f6535be3700 15 inode.get on 0x243a580 10000000000.head now 3 2016-07-19 11:25:15.718626 7f6535be3700 10 client.4117 _do_lookup on #10000000000/testfile
Unsuccessful test -- no call into _do_lookup:
2016-07-19 11:25:22.897200 7f6535be3700 3 client.4117 ll_getattr 10000000000.head = 0 2016-07-19 11:25:22.897218 7f6535be3700 3 client.4117 ll_lookup 0x243a580 testfile 2016-07-19 11:25:22.897223 7f6535be3700 20 client.4117 _lookup have dn testfile mds.0 ttl 2016-07-19 11:25:52.792341 seq 6 2016-07-19 11:25:22.897234 7f6535be3700 15 inode.get on 0x243b080 100000007d3.head now 3 2016-07-19 11:25:22.897236 7f6535be3700 10 client.4117 _lookup 10000000000.head(faked_ino=0 ref=2 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 mtime=2016-07-19 11:25:22.869444 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x2442000 0x243a580) testfile = 100000007d3.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={3=0} mode=100664 size=0/4194304 mtime=2016-07-19 11:25:22.819661 caps=pAsXsFs(0=pAsXsFs) objectset[100000007d3 ts 0/0 objects 0 dirty_or_tx 0] parents=0x2442090 0x243b080)
The successful test has "mds.-1", which is dn->lease_mds, whereas it's "mds.0" on the unsuccessful one. I'm guessing that means that the difference is in whether we have a dentry lease.
Updated by Jeff Layton almost 8 years ago
Ok, I think I sort of get it now. Here's my reproducer:
#!/bin/bash MOUNTPOINT=/mnt/ganesha rm -f $MOUNTPOINT/testfile touch $MOUNTPOINT/testfile stat $MOUNTPOINT/testfile ln $MOUNTPOINT/testfile $MOUNTPOINT/testlink stat $MOUNTPOINT/testfile rm $MOUNTPOINT/testlink stat $MOUNTPOINT/testfile
So ganesha does the unlink and purges its metadata cache, so when the GETATTR (aka stat call) comes in it redoes the lookup.
The unlink reply has no attributes for the inode (which makes sense in most cases -- by the time you get there, it's gone).
In ceph-land, when we initially create the file, we're given a lease on "testfile" (which allows us to satisfy lookups involving it from the cache). That lease is not revoked when either the link or the unlink occurs on testlink, so the subsequent ceph_ll_lookup is satisfied out of the cache.
But, ceph_ll_lookup also fills out the struct stat that is supplied in the call, and that also uses the cached attributes.
I'm a little unclear on the right way to fix this. Should the MDS be revoking the lease on "testfile" when "testlink" is unlinked? That seems a little suspect since we didn't do anything with that dentry. What we really want is to ensure that the inode attrs are refetched before we fill_stat.
In NFS-land, we'd flag the inode as having possibly bogus attributes so that when the next ->getattr request comes in, we'd re-fetch the attrs from the server. Is there some equivalent here? After the unlink, we could mark the attrs as being potentially bogus.
Alternately, we could ensure that the MDS sends the inode info in the trace when the link count hasn't gone to zero. That would also likely fix this.
Updated by Jeff Layton almost 8 years ago
Actually we could probably just always return the updated inode attrs on unlink. There's always the possibility that we have something holding it open. It would be a bit of a waste in the common case where the link count goes to zero and nothing is holding it open, but it wouldn't be that big a deal, and that would almost certainly take care of this problem on the client.
Updated by Greg Farnum almost 8 years ago
I think the actual bug here is that, as you note, ll_lookup calls fill_stat without checking that it has As (and whatever else is needed, like Ls may be needed and probably isn't checked in most places).
Having the MDS return a trace as well (at least, if nlink > 0 now) would be a good performance optimization.
Updated by Jeff Layton almost 8 years ago
Ok, I have a couple of small patches that fix the testcase. One is a client-side patch to fix the ctime handling in fill_stat, and the other is a patch to make the MDS send is_target traces in UNLINK replies.
The real fix though is to ensure that when we're doing a fill_stat, that we actually hold the necessary caps. This part is a bit more tricky to get right, and requires reworking some of the client code (and me getting a better understanding of of the general caps semantics as well).
Updated by Jeff Layton almost 8 years ago
Pull request with the fix is up here:
https://github.com/ceph/ceph/pull/10386
Updated by Jeff Layton over 7 years ago
- Status changed from In Progress to Resolved
Ok, PR is now merged!
Updated by John Spray over 7 years ago
- Status changed from Resolved to Pending Backport
- Backport set to jewel
Updated by Nathan Cutler over 7 years ago
- Copied to Backport #16946: jewel: client: nlink count is not maintained correctly added
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved