Project

General

Profile

Feature #56140

cephfs: tooling to identify inode (metadata) corruption

Added by Venky Shankar 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy
Reviewed:
Affected Versions:
Component(FS):
tools
Labels (FS):
Pull request ID:

Related issues

Related to CephFS - Bug #54546: mds: crash due to corrupt inode and omap entry New

History

#1 Updated by Venky Shankar 5 months ago

  • Subject changed from cephfs: tooling to identify inode (metadata to cephfs: tooling to identify inode (metadata) corruption
  • Category set to fsck/damage handling
  • Assignee set to Patrick Donnelly
  • Target version set to v18.0.0
  • Backport set to quincy
  • Component(FS) tools added

Tracker https://tracker.ceph.com/issues/54546 related to metadata corruption which is seen when running databases (esp. PostgreSQL) on Ceph File System. While we work on identifying the bug, we need to provide users a way to infer such corruption which might exists in the cluster (possible, if they ran PostgreSQL on CephFS).

The tool should scan the inodes in the metadata pool and alert for any anomalies in file metadata (and optionally fix it, which is a bit tricky atm), E.g.: the corruption mentioned above clobbers an inodes' "first" field (to -2,head or to a value which is not sane). Maybe, we would want to include these checks as part of file system scrub.

#2 Updated by Patrick Donnelly 5 months ago

Venky Shankar wrote:

Tracker https://tracker.ceph.com/issues/54546 related to metadata corruption which is seen when running databases (esp. PostgreSQL) on Ceph File System. While we work on identifying the bug, we need to provide users a way to infer such corruption which might exists in the cluster (possible, if they ran PostgreSQL on CephFS).

The tool should scan the inodes in the metadata pool and alert for any anomalies in file metadata (and optionally fix it, which is a bit tricky atm), E.g.: the corruption mentioned above clobbers an inodes' "first" field (to -2,head or to a value which is not sane). Maybe, we would want to include these checks as part of file system scrub.

Normally metadata pool scans is achieved via MDS scrub. You are suggesting a separate tool?

#3 Updated by Venky Shankar 5 months ago

Patrick Donnelly wrote:

Venky Shankar wrote:

Tracker https://tracker.ceph.com/issues/54546 related to metadata corruption which is seen when running databases (esp. PostgreSQL) on Ceph File System. While we work on identifying the bug, we need to provide users a way to infer such corruption which might exists in the cluster (possible, if they ran PostgreSQL on CephFS).

The tool should scan the inodes in the metadata pool and alert for any anomalies in file metadata (and optionally fix it, which is a bit tricky atm), E.g.: the corruption mentioned above clobbers an inodes' "first" field (to -2,head or to a value which is not sane). Maybe, we would want to include these checks as part of file system scrub.

Normally metadata pool scans is achieved via MDS scrub. You are suggesting a separate tool?

I'm fine with having this as part of scrub.

#4 Updated by Venky Shankar 5 months ago

Posting an update here based on discussion between me, Greg and Patrick:

Short term plan: Helper script to identify corrupted files - this will be an offline script (does not require ceph-mds to be available) that would list out corrupted files. The script will not perform any remedial action on the corrupted files though. The idea is to provide the user with a tool to assess the extent of corruption in the file system and perform any backup operation if required (such as copying the file out from the file system, etc..). This tool will be maintained in the ceph repo.

Long term plan: Fix the bug (obviously!), but, at the same time assess if its really necessary to assert when certain conditions are not satisfied (such as this one in the unlink path).

#5 Updated by Patrick Donnelly 4 months ago

  • Parent task deleted (#54546)

#6 Updated by Patrick Donnelly 4 months ago

  • Related to Bug #54546: mds: crash due to corrupt inode and omap entry added

#7 Updated by Patrick Donnelly 4 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 47542

#8 Updated by Venky Shankar 3 months ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF