https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2016-07-12T21:33:21ZCeph CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=743182016-07-12T21:33:21ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I suspect the kclient has a similar problem. I'll test it out when I get a chance. I do agree that we probably ought to just accept nlink changes from the MDS regardless of cap, and maybe just stop the client from relying on CEPH_CAP_LINK_EXCL at all? Seems pretty useless since the nlink is only going to change with namespace operations.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=743452016-07-12T23:56:33ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Category</strong> changed from <i>46</i> to <i>Correctness/Safety</i></li><li><strong>Component(FS)</strong> <i>Client, Common/Protocol, MDS</i> added</li></ul> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=747062016-07-13T12:44:56ZZheng Yanukernel@gmail.com
<ul></ul><p>MDS revokes CEPH_CAP_LINK_EXCL when unlinking files. It's odd, but I can't see how does it cause problem</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=747092016-07-13T13:40:24ZJeff Laytonjlayton@redhat.com
<ul><li><strong>Assignee</strong> set to <i>Jeff Layton</i></li></ul> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=747262016-07-13T18:21:00ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I rolled up a testcase for this:</p>
<pre>
TEST(LibCephFS, Nlink) {
struct ceph_mount_info *cmount;
ASSERT_EQ(ceph_create(&cmount, NULL), 0);
ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0);
ASSERT_EQ(0, ceph_conf_parse_env(cmount, NULL));
ASSERT_EQ(ceph_mount(cmount, "/"), 0);
char filename[32], linkname[32];
sprintf(filename, "/nlinkorig%x", getpid());
sprintf(linkname, "/nlinklink%x", getpid());
ceph_unlink(cmount, filename);
int fd = ceph_open(cmount, filename, O_RDWR|O_CREAT|O_EXCL, 0666);
ASSERT_LT(0, fd);
// ASSERT_EQ(ceph_close(cmount, fd), 0);
struct stat st;
ASSERT_EQ(ceph_stat(cmount, filename, &st), 0);
ASSERT_EQ(st.st_nlink, 1);
ASSERT_EQ(ceph_link(cmount, filename, linkname), 0);
ASSERT_EQ(ceph_stat(cmount, filename, &st), 0);
ASSERT_EQ(st.st_nlink, 2);
ASSERT_EQ(ceph_unlink(cmount, linkname), 0);
ASSERT_EQ(ceph_stat(cmount, filename, &st), 0);
ASSERT_EQ(st.st_nlink, 1);
ceph_shutdown(cmount);
}
</pre>
<p>...and it passes on my test rig. Frank, am I doing something wrong here? Are you doing something different to reproduce this?</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=747432016-07-13T22:49:24ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>Noted on irc that cap handling that involves the root directory (so, anything in root and frequently things in its immediate children) is often wonky. Hopefully putting them in subdirs a level or two deep will reproduce.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=747912016-07-14T13:25:05ZJeff Laytonjlayton@redhat.com
<ul></ul><p>It also occurred to me yesterday that I was using the path-based calls, whereas ganesha would likely be using the ll calls. I changed the testcase to look like this:</p>
<pre>
TEST(LibCephFS, Nlink) {
struct ceph_mount_info *cmount;
ASSERT_EQ(ceph_create(&cmount, NULL), 0);
ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0);
ASSERT_EQ(0, ceph_conf_parse_env(cmount, NULL));
ASSERT_EQ(ceph_mount(cmount, "/"), 0);
Inode *root, *dir, *file;
ASSERT_EQ(ceph_ll_lookup_root(cmount, &root), 0);
char dirname[32], filename[32], linkname[32];
sprintf(dirname, "nlinkdir%x", getpid());
sprintf(filename, "nlinkorig%x", getpid());
sprintf(linkname, "nlinklink%x", getpid());
struct stat st;
Fh *fh;
ASSERT_EQ(ceph_ll_mkdir(cmount, root, dirname, 0755, &st, &dir, getuid(), getgid()), 0);
ASSERT_EQ(ceph_ll_create(cmount, dir, filename, 0666, O_RDWR|O_CREAT|O_EXCL,
&st, &file, &fh, getuid(), getgid()), 0);
ASSERT_EQ(st.st_nlink, 1);
ASSERT_EQ(ceph_ll_link(cmount, file, dir, linkname, &st, getuid(), getgid()), 0);
ASSERT_EQ(st.st_nlink, 2);
ASSERT_EQ(ceph_ll_unlink(cmount, dir, linkname, getuid(), getgid()), 0);
ASSERT_EQ(ceph_ll_getattr(cmount, file, &st, getuid(), getgid()), 0);
ASSERT_EQ(st.st_nlink, 1);
ceph_shutdown(cmount);
}
</pre>
<p>...and the test still passes. At this point, I think we need the folks working with ganesha to give us a better clue about what their test actually does.</p>
<p>EDIT: remove bogus ceph_unlink call that's no longer necessary in this test. Still passes.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=748622016-07-15T19:18:29ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I set up a ganesha + ceph test rig today and was able to reproduce the problem. Interestingly, it does not reproduce with nfsv4.1, but does with v3. Here's a stat before and after</p>
<pre>
[jlayton@pnfsclnt ganesha]$ touch testfile; ln testfile testlink; stat testfile; rm testlink; stat testfile
File: ‘testfile’
Size: 0 Blocks: 0 IO Block: 1048576 regular empty file
Device: 2ah/42d Inode: 1099511627782 Links: 2
Access: (0664/-rw-rw-r--) Uid: ( 4447/ jlayton) Gid: ( 4447/ jlayton)
Context: system_u:object_r:nfs_t:s0
Access: 2016-07-15 15:10:48.613250953 -0400
Modify: 2016-07-15 15:10:48.613251073 -0400
Change: 2016-07-15 15:10:48.613251073 -0400
Birth: -
File: ‘testfile’
Size: 0 Blocks: 0 IO Block: 1048576 regular empty file
Device: 2ah/42d Inode: 1099511627782 Links: 2
Access: (0664/-rw-rw-r--) Uid: ( 4447/ jlayton) Gid: ( 4447/ jlayton)
Context: system_u:object_r:nfs_t:s0
Access: 2016-07-15 15:10:48.613250953 -0400
Modify: 2016-07-15 15:10:48.613251073 -0400
Change: 2016-07-15 15:10:48.613251073 -0400
Birth: -
</pre>
<p>Note too that the ctime didn't change there. When I stat from a completely different client though:</p>
<pre>
$ stat /mnt/cephfs/testfile
File: '/mnt/cephfs/testfile'
Size: 0 Blocks: 0 IO Block: 4194304 regular empty file
Device: 0h/0d Inode: 1099511627782 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 4447/ jlayton) Gid: ( 4447/ jlayton)
Context: system_u:object_r:cephfs_t:s0
Access: 2016-07-15 15:10:48.613250953 -0400
Modify: 2016-07-15 15:10:48.613251073 -0400
Change: 2016-07-15 15:10:48.649113873 -0400
Birth: -
</pre>
<p>...now to determine whether this is libcephfs not updating the attributes or something else.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=749942016-07-19T15:41:19ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Tracked down the problem with the ctime and it appears to be a fairly simple bug in fill_stat(). It was only looking at the seconds field when comparing the ctime and mtime, but it needs to compare both. I have a (one line) patch that should fix that bug.</p>
<p>For a while yesterday I was unable to reproduce the nlink issue, but it's back today. Basically, what happens is that after the unlink, ganesha purges its metadata cache, and then will reissue a ll_lookup call for the main file when we go to stat it afterward. When the problem reproduces, libcephfs seems to be satisfying the lookup call out of its cache, instead of querying the server for it. Going over that logic now.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=749952016-07-19T16:13:38ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Successful test -- the lookup after the unlink calls into _do_lookup:</p>
<pre>
2016-07-19 11:25:15.718608 7f6535be3700 3 client.4117 ll_lookup 0x243a580 testfile
2016-07-19 11:25:15.718616 7f6535be3700 20 client.4117 _lookup have dn testfile mds.-1 ttl 0.000000 seq 0
2016-07-19 11:25:15.718623 7f6535be3700 15 inode.get on 0x243a580 10000000000.head now 3
2016-07-19 11:25:15.718626 7f6535be3700 10 client.4117 _do_lookup on #10000000000/testfile
</pre>
<p>Unsuccessful test -- no call into _do_lookup:</p>
<pre>
2016-07-19 11:25:22.897200 7f6535be3700 3 client.4117 ll_getattr 10000000000.head = 0
2016-07-19 11:25:22.897218 7f6535be3700 3 client.4117 ll_lookup 0x243a580 testfile
2016-07-19 11:25:22.897223 7f6535be3700 20 client.4117 _lookup have dn testfile mds.0 ttl 2016-07-19 11:25:52.792341 seq 6
2016-07-19 11:25:22.897234 7f6535be3700 15 inode.get on 0x243b080 100000007d3.head now 3
2016-07-19 11:25:22.897236 7f6535be3700 10 client.4117 _lookup 10000000000.head(faked_ino=0 ref=2 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 mtime=2016-07-19 11:25:22.869444 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x2442000 0x243a580) testfile = 100000007d3.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={3=0} mode=100664 size=0/4194304 mtime=2016-07-19 11:25:22.819661 caps=pAsXsFs(0=pAsXsFs) objectset[100000007d3 ts 0/0 objects 0 dirty_or_tx 0] parents=0x2442090 0x243b080)
</pre>
<p>The successful test has "mds.-1", which is dn->lease_mds, whereas it's "mds.0" on the unsuccessful one. I'm guessing that means that the difference is in whether we have a dentry lease.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=749972016-07-19T16:46:49ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Ok, I think I sort of get it now. Here's my reproducer:</p>
<pre>
#!/bin/bash
MOUNTPOINT=/mnt/ganesha
rm -f $MOUNTPOINT/testfile
touch $MOUNTPOINT/testfile
stat $MOUNTPOINT/testfile
ln $MOUNTPOINT/testfile $MOUNTPOINT/testlink
stat $MOUNTPOINT/testfile
rm $MOUNTPOINT/testlink
stat $MOUNTPOINT/testfile
</pre>
<p>So ganesha does the unlink and purges its metadata cache, so when the GETATTR (aka stat call) comes in it redoes the lookup.</p>
<p>The unlink reply has no attributes for the inode (which makes sense in most cases -- by the time you get there, it's gone).</p>
<p>In ceph-land, when we initially create the file, we're given a lease on "testfile" (which allows us to satisfy lookups involving it from the cache). That lease is not revoked when either the link or the unlink occurs on testlink, so the subsequent ceph_ll_lookup is satisfied out of the cache.<br />But, ceph_ll_lookup also fills out the struct stat that is supplied in the call, and that also uses the cached attributes.</p>
<p>I'm a little unclear on the right way to fix this. Should the MDS be revoking the lease on "testfile" when "testlink" is unlinked? That seems a little suspect since we didn't do anything with that dentry. What we really want is to ensure that the inode attrs are refetched before we fill_stat.</p>
<p>In NFS-land, we'd flag the inode as having possibly bogus attributes so that when the next ->getattr request comes in, we'd re-fetch the attrs from the server. Is there some equivalent here? After the unlink, we could mark the attrs as being potentially bogus.</p>
<p>Alternately, we could ensure that the MDS sends the inode info in the trace when the link count hasn't gone to zero. That would also likely fix this.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=750002016-07-19T17:01:03ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Actually we could probably just always return the updated inode attrs on unlink. There's always the possibility that we have something holding it open. It would be a bit of a waste in the common case where the link count goes to zero and nothing is holding it open, but it wouldn't be that big a deal, and that would almost certainly take care of this problem on the client.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=750042016-07-19T20:05:08ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>I think the actual bug here is that, as you note, ll_lookup calls fill_stat without checking that it has As (and whatever else is needed, like Ls may be needed and probably isn't checked in most places).</p>
<p>Having the MDS return a trace as well (at least, if nlink > 0 now) would be a good performance optimization.</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=751192016-07-20T14:00:44ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Ok, I have a couple of small patches that fix the testcase. One is a client-side patch to fix the ctime handling in fill_stat, and the other is a patch to make the MDS send is_target traces in UNLINK replies.</p>
<p>The real fix though is to ensure that when we're doing a fill_stat, that we actually hold the necessary caps. This part is a bit more tricky to get right, and requires reworking some of the client code (and me getting a better understanding of of the general caps semantics as well).</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=751482016-07-21T14:42:24ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Pull request with the fix is up here:</p>
<pre><code><a class="external" href="https://github.com/ceph/ceph/pull/10386">https://github.com/ceph/ceph/pull/10386</a></code></pre> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=757662016-08-02T15:33:22ZJeff Laytonjlayton@redhat.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>Ok, PR is now merged!</p> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=758382016-08-03T11:08:29ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>jewel</i></li></ul> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=759732016-08-05T21:04:23ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/16946">Backport #16946</a>: jewel: client: nlink count is not maintained correctly</i> added</li></ul> CephFS - Bug #16668: client: nlink count is not maintained correctlyhttps://tracker.ceph.com/issues/16668?journal_id=799042016-10-17T15:14:46ZLoïc Dacharyloic@dachary.org
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>