Project

General

Profile

Bug #45283

Kernel log flood "ceph: Failed to find inode for 1"

Added by Michael Robertson over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Rook v1.2.7 (official chart) and ceph v14.2.9 in an AKS cluster with VMSS.
OS provided by AKS is currently Ubuntu 16.04.6 LTS, kernel 4.15.0-1071-azure.

Every block written by a pod to a ceph CSI volume generates 2 warning lines in the node's system logs (kern.log, syslog, messages, warn):
"Apr 24 09:37:46 aks-<nodename> kernel: [242123.654538] ceph: Failed to find inode for 1"
Under production load, eventually the node succumbs to DiskPressure as the drive fills up.

Jos Collin in Slack says: "It looks like the MDS is sending CEPH_MSG_CLIENT_QUOTA messages to the client with the root ino in them and the client doesn't recognise that inode (likely because it didn't mount the root). Inode 1 is typically the root of the fs."
Submitting ticket here, as requested.

One more thing: I forced OS upgrade on an experiment node to Ubuntu 18, kernel 5.0.0. The warning goes away, and writes double in speed.
However, this OS is not GA in AKS yet.

History

#1 Updated by Luis Henriques over 1 year ago

My memory on this code starts to be somewhat blurry, but looking at git log (in mainline) the following 4 patches seem to be relevant to this issue:

d557c48db730 ("ceph: quota: add counter for snaprealms with quota")
e3161f17d926 ("ceph: quota: cache inode pointer in ceph_snap_realm")
0eb6bbe4d9cf ("ceph: fix root quota realm check")
2596366907f8 ("ceph: don't check quota for snap inode")

Looking at the Ubuntu kernel, I see that it only includes backports of 2 of them (d557c48db730 and 0eb6bbe4d9cf).

Commit e3161f17d926 in particular is optimizing the realms hierarchy walkthrough. Picking this patch should fix that warning, I believe.

#2 Updated by Michael Robertson over 1 year ago

Sounds good. Anything I can do to support the process?

#3 Updated by Luis Henriques over 1 year ago

Michael Robertson wrote:

Sounds good. Anything I can do to support the process?

I believe that the best thing to do is to open a bug report in Ubuntu and see if they can get these missing patches backported into the appropriate kernel.

I've done a quick test and, after compiling the bionic kernel 4.15.0-96.97 (the latest released), I can reproduce the issue. Cherry-picking the 2 missing commits (2596366907f8 and e3161f17d926) fixes it.

#4 Updated by Jeff Layton over 1 year ago

  • Assignee set to Luis Henriques

Thanks for taking a look, Luis. Assigning this to you for now. Feel free to close as you see fit.

#5 Updated by Michael Robertson over 1 year ago

Cool, thanks guys.
I reviewed Ubuntu bug reporting guidelines, and created Launchpad account.
The bug report requires a package name - Would that be "ceph" in this case?

#6 Updated by Luis Henriques over 1 year ago

Michael Robertson wrote:

Cool, thanks guys.
I reviewed Ubuntu bug reporting guidelines, and created Launchpad account.
The bug report requires a package name - Would that be "ceph" in this case?

From your description, I believe it would be 'linux-azure': https://launchpad.net/ubuntu/xenial/+source/linux-azure

#7 Updated by Michael Robertson over 1 year ago

Thanks Luis, I appreciate the help.

https://bugs.launchpad.net/ubuntu/+source/linux-meta-azure/+bug/1875884

I think that's all we can do here. Do with this ticket as you will.

#8 Updated by Michael Robertson over 1 year ago

Hmm, they are requesting details on how to reproduce the problem. I don't think they would appreciate me directing them to create an AKS cluster and install Rook, which is the only way I currently know. :)
Have you a simple testcase setup I can share? I'll check through the ceph docs to see what I can find.

#9 Updated by Luis Henriques over 1 year ago

Michael Robertson wrote:

Hmm, they are requesting details on how to reproduce the problem. I don't think they would appreciate me directing them to create an AKS cluster and install Rook, which is the only way I currently know. :)
Have you a simple testcase setup I can share? I'll check through the ceph docs to see what I can find.

Sure, something like this will reproduce the issue:

# mount -t ceph <mon>:<port>:/ /mnt/ceph -o name=admin,secret=<my-secret>
# mkdir /mnt/ceph/quotadir
# setfattr -n ceph.quota.max_files -v 10 /mnt/ceph/quotadir
# umount /mnt/ceph
# mount -t ceph <mon>:<port>:/quotadir /mnt/ceph -o name=admin,secret=<my-secret> # <== Note the 'quotadir' here!!!
# touch /mnt/ceph/newfile

#10 Updated by Patrick Donnelly over 1 year ago

  • Status changed from New to Triaged
  • Component(FS) kceph added

#11 Updated by Luis Henriques over 1 year ago

  • Status changed from Triaged to Closed

Closing, issue is being handled by the ubuntu kernel team in the launchpad URL (comment #7).

Also available in: Atom PDF