Project

General

Profile

Bug #16668

client: nlink count is not maintained correctly

Added by Greg Farnum about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
07/12/2016
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Component(FS):
Client, Common/Protocol, MDS
Needs Doc:

Description

Frank reported in #ceph-devel that we don't seem to update nlink correctly from the Client. Looking through the source, we don't appear to ever change nlink except in response to an update from the MDS — which we only accept if we don't have CEPH_CAP_LINK_EXCL.
But we often do have CEPH_CAP_LINK_EXCL on files we've created, and if we unlink them we expect to see that reflected in the link count. We could try and maintain nlink ourselves, but that's somewhat odd since we have to route unlinks through the MDS anyway. That makes me think we should just let the MDS maintain nlink, and accept what it tells us.

...but that seems really odd since we have a cap for covering link count. Maybe we really only care about it for the associated inode locks on the MDS?


Related issues

Copied to fs - Backport #16946: jewel: client: nlink count is not maintained correctly Resolved

History

#1 Updated by Jeff Layton about 1 year ago

I suspect the kclient has a similar problem. I'll test it out when I get a chance. I do agree that we probably ought to just accept nlink changes from the MDS regardless of cap, and maybe just stop the client from relying on CEPH_CAP_LINK_EXCL at all? Seems pretty useless since the nlink is only going to change with namespace operations.

#2 Updated by Greg Farnum about 1 year ago

  • Category changed from 46 to Correctness/Safety
  • Component(FS) Client, Common/Protocol, MDS added

#3 Updated by Zheng Yan about 1 year ago

MDS revokes CEPH_CAP_LINK_EXCL when unlinking files. It's odd, but I can't see how does it cause problem

#4 Updated by Jeff Layton about 1 year ago

  • Assignee set to Jeff Layton

#5 Updated by Jeff Layton about 1 year ago

I rolled up a testcase for this:

TEST(LibCephFS, Nlink) {
  struct ceph_mount_info *cmount;
  ASSERT_EQ(ceph_create(&cmount, NULL), 0);
  ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0);
  ASSERT_EQ(0, ceph_conf_parse_env(cmount, NULL));
  ASSERT_EQ(ceph_mount(cmount, "/"), 0);

  char filename[32], linkname[32];
  sprintf(filename, "/nlinkorig%x", getpid());
  sprintf(linkname, "/nlinklink%x", getpid());

  ceph_unlink(cmount, filename);
  int fd = ceph_open(cmount, filename, O_RDWR|O_CREAT|O_EXCL, 0666);
  ASSERT_LT(0, fd);
  // ASSERT_EQ(ceph_close(cmount, fd), 0);

  struct stat   st;
  ASSERT_EQ(ceph_stat(cmount, filename, &st), 0);
  ASSERT_EQ(st.st_nlink, 1);

  ASSERT_EQ(ceph_link(cmount, filename, linkname), 0);
  ASSERT_EQ(ceph_stat(cmount, filename, &st), 0);
  ASSERT_EQ(st.st_nlink, 2);

  ASSERT_EQ(ceph_unlink(cmount, linkname), 0);
  ASSERT_EQ(ceph_stat(cmount, filename, &st), 0);
  ASSERT_EQ(st.st_nlink, 1);

  ceph_shutdown(cmount);
}

...and it passes on my test rig. Frank, am I doing something wrong here? Are you doing something different to reproduce this?

#6 Updated by Greg Farnum about 1 year ago

  • Status changed from New to In Progress

Noted on irc that cap handling that involves the root directory (so, anything in root and frequently things in its immediate children) is often wonky. Hopefully putting them in subdirs a level or two deep will reproduce.

#7 Updated by Jeff Layton about 1 year ago

It also occurred to me yesterday that I was using the path-based calls, whereas ganesha would likely be using the ll calls. I changed the testcase to look like this:

TEST(LibCephFS, Nlink) {
  struct ceph_mount_info *cmount;
  ASSERT_EQ(ceph_create(&cmount, NULL), 0);
  ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0);
  ASSERT_EQ(0, ceph_conf_parse_env(cmount, NULL));
  ASSERT_EQ(ceph_mount(cmount, "/"), 0);

  Inode *root, *dir, *file;

  ASSERT_EQ(ceph_ll_lookup_root(cmount, &root), 0);

  char dirname[32], filename[32], linkname[32];
  sprintf(dirname, "nlinkdir%x", getpid());
  sprintf(filename, "nlinkorig%x", getpid());
  sprintf(linkname, "nlinklink%x", getpid());

  struct stat    st;
  Fh *fh;

  ASSERT_EQ(ceph_ll_mkdir(cmount, root, dirname, 0755, &st, &dir, getuid(), getgid()), 0);
  ASSERT_EQ(ceph_ll_create(cmount, dir, filename, 0666, O_RDWR|O_CREAT|O_EXCL,
               &st, &file, &fh, getuid(), getgid()), 0);
  ASSERT_EQ(st.st_nlink, 1);

  ASSERT_EQ(ceph_ll_link(cmount, file, dir, linkname, &st, getuid(), getgid()), 0);
  ASSERT_EQ(st.st_nlink, 2);

  ASSERT_EQ(ceph_ll_unlink(cmount, dir, linkname, getuid(), getgid()), 0);
  ASSERT_EQ(ceph_ll_getattr(cmount, file, &st, getuid(), getgid()), 0);
  ASSERT_EQ(st.st_nlink, 1);

  ceph_shutdown(cmount);
}

...and the test still passes. At this point, I think we need the folks working with ganesha to give us a better clue about what their test actually does.

EDIT: remove bogus ceph_unlink call that's no longer necessary in this test. Still passes.

#8 Updated by Jeff Layton about 1 year ago

I set up a ganesha + ceph test rig today and was able to reproduce the problem. Interestingly, it does not reproduce with nfsv4.1, but does with v3. Here's a stat before and after

[jlayton@pnfsclnt ganesha]$ touch testfile; ln testfile testlink; stat testfile; rm testlink; stat testfile
  File: ‘testfile’
  Size: 0             Blocks: 0          IO Block: 1048576 regular empty file
Device: 2ah/42d    Inode: 1099511627782  Links: 2
Access: (0664/-rw-rw-r--)  Uid: ( 4447/ jlayton)   Gid: ( 4447/ jlayton)
Context: system_u:object_r:nfs_t:s0
Access: 2016-07-15 15:10:48.613250953 -0400
Modify: 2016-07-15 15:10:48.613251073 -0400
Change: 2016-07-15 15:10:48.613251073 -0400
 Birth: -
  File: ‘testfile’
  Size: 0             Blocks: 0          IO Block: 1048576 regular empty file
Device: 2ah/42d    Inode: 1099511627782  Links: 2
Access: (0664/-rw-rw-r--)  Uid: ( 4447/ jlayton)   Gid: ( 4447/ jlayton)
Context: system_u:object_r:nfs_t:s0
Access: 2016-07-15 15:10:48.613250953 -0400
Modify: 2016-07-15 15:10:48.613251073 -0400
Change: 2016-07-15 15:10:48.613251073 -0400
 Birth: -

Note too that the ctime didn't change there. When I stat from a completely different client though:

$ stat /mnt/cephfs/testfile
  File: '/mnt/cephfs/testfile'
  Size: 0             Blocks: 0          IO Block: 4194304 regular empty file
Device: 0h/0d    Inode: 1099511627782  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 4447/ jlayton)   Gid: ( 4447/ jlayton)
Context: system_u:object_r:cephfs_t:s0
Access: 2016-07-15 15:10:48.613250953 -0400
Modify: 2016-07-15 15:10:48.613251073 -0400
Change: 2016-07-15 15:10:48.649113873 -0400
 Birth: -

...now to determine whether this is libcephfs not updating the attributes or something else.

#9 Updated by Jeff Layton about 1 year ago

Tracked down the problem with the ctime and it appears to be a fairly simple bug in fill_stat(). It was only looking at the seconds field when comparing the ctime and mtime, but it needs to compare both. I have a (one line) patch that should fix that bug.

For a while yesterday I was unable to reproduce the nlink issue, but it's back today. Basically, what happens is that after the unlink, ganesha purges its metadata cache, and then will reissue a ll_lookup call for the main file when we go to stat it afterward. When the problem reproduces, libcephfs seems to be satisfying the lookup call out of its cache, instead of querying the server for it. Going over that logic now.

#10 Updated by Jeff Layton about 1 year ago

Successful test -- the lookup after the unlink calls into _do_lookup:

2016-07-19 11:25:15.718608 7f6535be3700  3 client.4117 ll_lookup 0x243a580 testfile
2016-07-19 11:25:15.718616 7f6535be3700 20 client.4117 _lookup have dn testfile mds.-1 ttl 0.000000 seq 0
2016-07-19 11:25:15.718623 7f6535be3700 15 inode.get on 0x243a580 10000000000.head now 3
2016-07-19 11:25:15.718626 7f6535be3700 10 client.4117 _do_lookup on #10000000000/testfile

Unsuccessful test -- no call into _do_lookup:

2016-07-19 11:25:22.897200 7f6535be3700  3 client.4117 ll_getattr 10000000000.head = 0
2016-07-19 11:25:22.897218 7f6535be3700  3 client.4117 ll_lookup 0x243a580 testfile
2016-07-19 11:25:22.897223 7f6535be3700 20 client.4117 _lookup have dn testfile mds.0 ttl 2016-07-19 11:25:52.792341 seq 6
2016-07-19 11:25:22.897234 7f6535be3700 15 inode.get on 0x243b080 100000007d3.head now 3
2016-07-19 11:25:22.897236 7f6535be3700 10 client.4117 _lookup 10000000000.head(faked_ino=0 ref=2 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 mtime=2016-07-19 11:25:22.869444 caps=pAsLsXsFs(0=pAsLsXsFs) parents=0x2442000 0x243a580) testfile = 100000007d3.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={3=0} mode=100664 size=0/4194304 mtime=2016-07-19 11:25:22.819661 caps=pAsXsFs(0=pAsXsFs) objectset[100000007d3 ts 0/0 objects 0 dirty_or_tx 0] parents=0x2442090 0x243b080)

The successful test has "mds.-1", which is dn->lease_mds, whereas it's "mds.0" on the unsuccessful one. I'm guessing that means that the difference is in whether we have a dentry lease.

#11 Updated by Jeff Layton about 1 year ago

Ok, I think I sort of get it now. Here's my reproducer:

#!/bin/bash

MOUNTPOINT=/mnt/ganesha

rm -f $MOUNTPOINT/testfile
touch $MOUNTPOINT/testfile
stat $MOUNTPOINT/testfile
ln $MOUNTPOINT/testfile $MOUNTPOINT/testlink
stat $MOUNTPOINT/testfile
rm $MOUNTPOINT/testlink
stat $MOUNTPOINT/testfile

So ganesha does the unlink and purges its metadata cache, so when the GETATTR (aka stat call) comes in it redoes the lookup.

The unlink reply has no attributes for the inode (which makes sense in most cases -- by the time you get there, it's gone).

In ceph-land, when we initially create the file, we're given a lease on "testfile" (which allows us to satisfy lookups involving it from the cache). That lease is not revoked when either the link or the unlink occurs on testlink, so the subsequent ceph_ll_lookup is satisfied out of the cache.
But, ceph_ll_lookup also fills out the struct stat that is supplied in the call, and that also uses the cached attributes.

I'm a little unclear on the right way to fix this. Should the MDS be revoking the lease on "testfile" when "testlink" is unlinked? That seems a little suspect since we didn't do anything with that dentry. What we really want is to ensure that the inode attrs are refetched before we fill_stat.

In NFS-land, we'd flag the inode as having possibly bogus attributes so that when the next ->getattr request comes in, we'd re-fetch the attrs from the server. Is there some equivalent here? After the unlink, we could mark the attrs as being potentially bogus.

Alternately, we could ensure that the MDS sends the inode info in the trace when the link count hasn't gone to zero. That would also likely fix this.

#12 Updated by Jeff Layton about 1 year ago

Actually we could probably just always return the updated inode attrs on unlink. There's always the possibility that we have something holding it open. It would be a bit of a waste in the common case where the link count goes to zero and nothing is holding it open, but it wouldn't be that big a deal, and that would almost certainly take care of this problem on the client.

#13 Updated by Greg Farnum about 1 year ago

I think the actual bug here is that, as you note, ll_lookup calls fill_stat without checking that it has As (and whatever else is needed, like Ls may be needed and probably isn't checked in most places).

Having the MDS return a trace as well (at least, if nlink > 0 now) would be a good performance optimization.

#14 Updated by Jeff Layton about 1 year ago

Ok, I have a couple of small patches that fix the testcase. One is a client-side patch to fix the ctime handling in fill_stat, and the other is a patch to make the MDS send is_target traces in UNLINK replies.

The real fix though is to ensure that when we're doing a fill_stat, that we actually hold the necessary caps. This part is a bit more tricky to get right, and requires reworking some of the client code (and me getting a better understanding of of the general caps semantics as well).

#15 Updated by Jeff Layton about 1 year ago

Pull request with the fix is up here:

https://github.com/ceph/ceph/pull/10386

#16 Updated by Jeff Layton about 1 year ago

  • Status changed from In Progress to Resolved

Ok, PR is now merged!

#17 Updated by John Spray about 1 year ago

  • Status changed from Resolved to Pending Backport
  • Backport set to jewel

#18 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #16946: jewel: client: nlink count is not maintained correctly added

#19 Updated by Loic Dachary 11 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF