Project

General

Profile

Bug #46828

cephfs kernel client s390x: Missing the first created directory when running ls

Added by Tuan Hoang over 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
kcephfs
Crash signature (v1):
Crash signature (v2):

Description

Hi ceph masters,

I am running ceph octopus from Ubuntu repository on Ubuntu 20.04 s390x arch. The cluster is manually created.

I am seeing a behavior that : when creating directories using cephfs kernel client, the first created directory is always missing by running $ls, BUT if I specifically list it, it is there. For example:

... creating the cluster ...
... creating the data/metadata cephfs pools ...

# mount -t ceph :/ /mnt -o name=admin

# mkdir /mnt/1

# ls -al /mnt
total 4
drwxr-xr-x  3 root root    1 Aug  4 09:24 .
drwxr-xr-x 18 root root 4096 Apr 30 07:35 ..

# ls -al /mnt/1
total 0
drwxr-xr-x 3 root root 1 Aug  4 09:24 ..

# mkdir /mnt/2

# ls -al /mnt
total 4
drwxr-xr-x  4 root root    2 Aug  4 09:24 .
drwxr-xr-x 18 root root 4096 Apr 30 07:35 ..
drwxr-xr-x  2 root root    0 Aug  4 09:24 2

At first, I have seen another behavior is that, all the created directory/ies seem to be missing if they are created within less than 20 seconds after the cluster is up (HEALTH_OK, etc.) and data/metadata pools are created (that means before/during the PG autoscaler is running). But now, this is quite hard to reproduce and I have seen the first directory missing behavior more often and stable. So I think it is more of an issue ...

I have tried many different clusters with configurations:
- Different kernel (5.4.0-37+ on Ubuntu) + ceph deb versions (15.2.1-0ubuntu1 and 15.2.3-0ubuntu0.20.04.1)
- Client machine is either as a node of the cluster (osd, mds, etc.) or outside of the cluster
- Cluster with 1 or 3 MONs/MGRs, 1-3 OSD, 1-3 MDS
- Creating the directory during/after PG autoscaling and/or few seconds after data/metadata pools are created

In the mean time, I'm trying to use a x86_64 client instead of a s390x client. Cluster is still pure s390x. This behavior does not exist on pure x86_64 cluster+client.

I have tried umount and mount again the cephfs on same/different client machines, still the directory is missing when $ls, but shows if specifically list it. So I guess it is more of an issue with MDS rather than the kernel cephfs client, because the cephfs kernel driver seems to correctly tell RADOS backend (and MDS ?) to create the directory because when specifically reading the directory from data pool, it can be read. It is just that when listdir, it fails.

I wonder if in the past if you have seen some kind of symtoms like this. Also the logs don't seem to show anything out of order, if you have any suggestions regarding which part of the log I should look into, please kindly let me know.

Also, I think this is

Thanks !

0001-ceph-fix-inode-number-presentation-on-32-bit-platfor.patch View - v4 (11.7 KB) Jeff Layton, 08/18/2020 01:50 PM

History

#1 Updated by Tuan Hoang over 1 year ago

This issue also applies to the first created file too.

#2 Updated by Tuan Hoang over 1 year ago

Regarding x86_64 client, it can still see the firstly created file on the ceph s390x cluster, regardless it creates the files/folders or the s390x client creates.

... on either x86_64 or s390x client, mount cephfs at /mnt ...

# cd /mnt
# touch 1-file; mkdir 1-dir; touch 1-dir/2-file

On x86_64 client :

root@ubuntu20x86:/mnt# ls -al /mnt/
total 4
drwxr-xr-x  3 root root    2 Aug 10 11:51 .
drwxr-xr-x 19 root root 4096 Aug 10 10:54 ..
drwxr-xr-x  2 root root    1 Aug 10 11:51 1-dir
-rw-r--r--  1 root root    0 Aug 10 11:51 1-file
root@ubuntu20x86:/mnt# ls -al /mnt/1-dir/
total 0
drwxr-xr-x 2 root root 1 Aug 10 11:51 .
drwxr-xr-x 3 root root 2 Aug 10 11:51 ..
-rw-r--r-- 1 root root 0 Aug 10 11:51 2-file

On s390x client:
root@ubuntu20s390x:/tmp# ls -al /mnt/
total 4
drwxr-xr-x  3 root root    2 Aug 10 11:51 .
drwxr-xr-x 19 root root 4096 Aug 10 10:05 ..
drwxr-xr-x  2 root root    1 Aug 10 11:51 1-dir
root@ubuntus390x:/tmp# ls -al /mnt/1-dir/
total 0
drwxr-xr-x 2 root root 1 Aug 10 11:51 .
drwxr-xr-x 3 root root 2 Aug 10 11:51 ..
-rw-r--r-- 1 root root 0 Aug 10 11:51 2-file

#3 Updated by Tuan Hoang over 1 year ago

Using $find:

root@ubuntu20x86:/home/tmhoang# find /mnt/ | xargs ls -l
-rw-r--r-- 1 root root 0 Aug 10 11:58 /mnt/1-dir/2-file
-rw-r--r-- 1 root root 0 Aug 10 11:58 /mnt/1-file

/mnt/:
total 0
drwxr-xr-x 2 root root 1 Aug 10 11:58 1-dir
-rw-r--r-- 1 root root 0 Aug 10 11:58 1-file

/mnt/1-dir:
total 0
-rw-r--r-- 1 root root 0 Aug 10 11:58 2-file
root@ubuntu20s390x:/mnt# find /mnt/ | xargs ls -l 
find: File system loop detected; ‘/mnt/1-dir’ is part of the same file system loop as ‘/mnt/’.
total 0
drwxr-xr-x 2 root root 1 Aug 10 11:58 1-dir

</prev>

#4 Updated by Jeff Layton over 1 year ago

I don't have anything specific to suggest. I concur that it does sound like a bug in the kclient though, and I'd suspect that it's some sort of endianness problem. As to where it is...I'm not sure. Since it mostly manifests in readdir, I'd focus on the readdir handling routines -- ceph_readdir, parse_reply_info_readdir, etc.

#5 Updated by Tuan Hoang over 1 year ago

So I see that the inode of the first file/directory is 0 - which seems to be wrong. Normally in x86_64, the first inode is 1099511627776 (0x10000000000). So it looks very much that there must be a missing byte flip somewhere.

Ulrich (https://tracker.ceph.com/users/10290) came up with this patch and it seems to be working, which shares some similarities with kernel commit 1ee9f4bd1a97026a7b2d7ae9f1f74b45680d0003:

diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index bb12c9f..b472acd 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -449,29 +449,26 @@ static inline u32 ceph_ino_to_ino32(__u64 vino)
  */
 static inline ino_t ceph_vino_to_ino(struct ceph_vino vino)
 {
-#if BITS_PER_LONG == 32
-    return ceph_ino_to_ino32(vino.ino);
-#else
-    return (ino_t)vino.ino;
-#endif
+    if ((sizeof(ino_t)) < (sizeof(u64)))
+        return ceph_ino_to_ino32(vino.ino);
+    else
+        return (ino_t)vino.ino;
 }

 /*
  * user-visible ino (stat, filldir)
  */
-#if BITS_PER_LONG == 32
 static inline ino_t ceph_translate_ino(struct super_block *sb, ino_t ino)
 {
-    return ino;
-}
-#else
-static inline ino_t ceph_translate_ino(struct super_block *sb, ino_t ino)
-{
-    if (ceph_test_mount_opt(ceph_sb_to_client(sb), INO32))
-        ino = ceph_ino_to_ino32(ino);
-    return ino;
+    if ((sizeof(ino_t)) < (sizeof(u64))) {
+        return ino;
+    }
+    else {
+        if (ceph_test_mount_opt(ceph_sb_to_client(sb), INO32))
+            ino = ceph_ino_to_ino32(ino);
+        return ino;
+    }
 }
-#endif

 /* for printf-style formatting */

To my understanding of the situation, while s390x BITS_PER_LONG is 64, s390x's ino_t is 32bit. Thus, casting (ino_t)vino.ino; is wrong. For why s390x's ino_t is 32 bit, we are seeking help from s390x kernel maintainer.

#6 Updated by Tuan Hoang over 1 year ago

Below is the output of inodes information on s390x and x86_64 ceph clients respectively where things working.

First inode on s390x is 256 and 257, 258, so on. They were 0, 1, and 2.

root@ubuntu2004s390x:/home/tmhoang/# find /mnt/ | xargs stat
  File: /mnt/2
  Size: 1             Blocks: 0          IO Block: 65536  directory
Device: 34h/52d    Inode: 257         Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-08-13 11:13:02.356384026 -0400
Modify: 2020-08-13 11:12:51.256374030 -0400
Change: 2020-08-13 11:12:51.256374030 -0400
 Birth: -
  File: /mnt/2/3
  Size: 2             Blocks: 1          IO Block: 4194304 regular file
Device: 34h/52d    Inode: 258         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-08-13 11:12:51.256374030 -0400
Modify: 2020-08-13 11:12:54.786374030 -0400
Change: 2020-08-13 11:12:54.786374030 -0400
 Birth: -
  File: /mnt/1
  Size: 0             Blocks: 0          IO Block: 65536  directory
Device: 34h/52d    Inode: 256         Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-08-13 11:12:58.216374030 -0400
Modify: 2020-08-13 11:12:39.816374030 -0400
Change: 2020-08-13 11:12:39.816374030 -0400
 Birth: -


root@ubuntu20x86:/home/tmhoang# find /mnt/| xargs stat
  File: /mnt/2
  Size: 1             Blocks: 0          IO Block: 65536  directory
Device: 32h/50d    Inode: 1099511627777  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-08-13 12:01:02.871887377 -0400
Modify: 2020-08-13 11:12:51.256374030 -0400
Change: 2020-08-13 11:12:51.256374030 -0400
 Birth: -
  File: /mnt/2/3
  Size: 2             Blocks: 1          IO Block: 4194304 regular file
Device: 32h/50d    Inode: 1099511627778  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-08-13 11:12:51.256374030 -0400
Modify: 2020-08-13 11:12:54.786374030 -0400
Change: 2020-08-13 11:12:54.786374030 -0400
 Birth: -
  File: /mnt/1
  Size: 0             Blocks: 0          IO Block: 65536  directory
Device: 32h/50d    Inode: 1099511627776  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-08-13 12:01:02.871887377 -0400
Modify: 2020-08-13 11:12:39.816374030 -0400
Change: 2020-08-13 11:12:39.816374030 -0400
 Birth: -

#7 Updated by Jeff Layton over 1 year ago

Interesting! The patch looks sane at first glance, but what would be best is to post this patch to so we can debate it publicly.

Incidentally, do you have the same issue with NFS? It looks like the is_32bit_api() function in the NFS driver might need similar adjustment.

#8 Updated by Jeff Layton over 1 year ago

Actually, I take it back...

The stat command will use stat64() or statx() to do this, and that has a 64-bit field for the inode number. I think the thing to do here is to make ceph fill out the inode number from ci->i_vino in ceph_getattr (and in readdir).

#9 Updated by Jeff Layton over 1 year ago

  • File 0001-ceph-fix-inode-number-presentation-on-32-bit-platfor.patch added

How about something like this (untested) patch? This should make the driver return 64-bit inode numbers regardless of arch, and leave the upper kernel layers (or userland) to deal with squashing it down if necessary.

#10 Updated by Jeff Layton over 1 year ago

  • File 0001-ceph-fix-inode-number-presentation-on-32-bit-platfor.patch added

Revised patch -- lets fix up some of the dout messages too...

#11 Updated by Ulrich Weigand over 1 year ago

If we can get by with using 64-bit inode numbers everywhere, that is of course great. However, looking at your patch, it seems it replaces all uses of ceph_translate_ino, but not the uses of ceph_vino_to_ino, which also calls the hash function.

Can we replace those uses too? If not, it would seem we still need to fix the hash function call in ceph_vino_to_ino.

(Also, if ceph_translate_ino is never called anymore, maybe it should be removed?)

#12 Updated by Jeff Layton over 1 year ago

  • File 0001-ceph-fix-inode-number-presentation-on-32-bit-platfor.patch added
  • Assignee set to Jeff Layton

Third patch...fixes a couple of compiler warnings.

#13 Updated by Jeff Layton over 1 year ago

  • File deleted (0001-ceph-fix-inode-number-presentation-on-32-bit-platfor.patch)

#14 Updated by Jeff Layton over 1 year ago

  • File deleted (0001-ceph-fix-inode-number-presentation-on-32-bit-platfor.patch)

#15 Updated by Jeff Layton over 1 year ago

Yes, we can remove that stuff too. It's was just used to generate a hash chain index for the inode table, but I think the lower bits of the inode number will work just as well. Can you guys give this a test and see whether it fixes the issues for you?

There is a small risk with this patchset for real 32-bit arches. If someone is running really old binaries that use old-school stat() functions then they may hit EOVERFLOW errors when they see inode numbers that are so large. We could mitigate that the same way that NFS did with a enable_ino64 module option, but it's probably not worthwhile to do it. I'm not aware of anyone using that on NFS in the last decade.

#16 Updated by Jeff Layton over 1 year ago

  • File deleted (0001-ceph-fix-inode-number-presentation-on-32-bit-platfor.patch)

#17 Updated by Tuan Hoang over 1 year ago

Your v3 and v4 both work with mainline kernel 5.9-rc1 (I was using 5.4.x in Ubuntu 20.04). The inode is now starting from 1099511627776.

#18 Updated by Ulrich Weigand over 1 year ago

One more comment on the latest patch:

@@ -324,17 +322,14 @@ static int ceph_readdir(struct file *file, struct dir_context *ctx)
     /* always start with . and .. */
     if (ctx->pos == 0) {
         dout("readdir off 0 -> '.'\n");
-        if (!dir_emit(ctx, ".", 1, 
-                ceph_translate_ino(inode->i_sb, inode->i_ino),
+        if (!dir_emit(ctx, ".", 1, ceph_ino(inode),
                 inode->i_mode >> 12))
             return 0;
         ctx->pos = 1;
     }
     if (ctx->pos == 1) {
-        ino_t ino = parent_ino(file->f_path.dentry);
         dout("readdir off 1 -> '..'\n");
-        if (!dir_emit(ctx, "..", 2,
-                ceph_translate_ino(inode->i_sb, ino),
+        if (!dir_emit(ctx, "..", 2, ceph_ino(inode),
                 inode->i_mode >> 12))
             return 0;
         ctx->pos = 2;

In that second if, doesn't this now use the inode of the current directory for ".." where it should be using the one of the parent directory? This was one of the places where I wasn't sure how to fix it properly as "parent_ino" returns ino_t ...

#19 Updated by Jeff Layton over 1 year ago

Ok, reworked the thing once more and send Tuan and Ulrich a patch. I think the right thing to do here is to reduce the code's dependency on inode->i_ino altogether. See:

https://lore.kernel.org/ceph-devel/20200818162316.389462-1-jlayton@kernel.org/T/#u

Hopefully this should be sane for all arches.

#20 Updated by Tuan Hoang about 1 year ago

Fixed in kernel commit : ebce3eb2f7ef9f6ef01a60874ebd232450107c9a
Closing this. Many thanks.

#21 Updated by Tuan Hoang about 1 year ago

Could somebody help me to close this one ? I don't see a change status button. Thanks.

#22 Updated by Greg Farnum 9 months ago

  • Project changed from Ceph to Linux kernel client
  • Status changed from New to Resolved

Also available in: Atom PDF