Project

General

Profile

Actions

Bug #18532

open

mds: forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able)

Added by Dan Mick over 7 years ago. Updated about 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
fsck/damage handling
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
scrub
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Somehow a path in the long-running cluster got a corrupted number of files/subdirs, and responds to "rm -rf" with "cannot remove, directory not empty". There are no visible files in the directory (and find -type f | xargs rm had been done successfully). scrub_path repair notices but does not resolve the error. Also, its rstats are similarly corrupt.

ls -ld /a/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi/541839/remote/smithi059/log

drwxrwxr-x 1 teuthworker teuthworker 18446744073315575061 Jan 11 01:05 /a/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi/541839/remote/smithi059/log

excerpt from daemon dump tree:

    {
        "ino": 1100004785907,
        "rdev": 0,
        "ctime": "2017-01-11 01:05:34.995307",
        "btime": "0.000000",
        "mode": 16893,
        "uid": 1001,
        "gid": 1001,
        "nlink": 1,
        "dir_layout": {
            "dir_hash": 2
        },
        "layout": {
            "stripe_unit": 0,
            "stripe_count": 0,
            "object_size": 0,
            "pool_id": -1,
            "pool_ns": "" 
        },
        "old_pools": [],
        "size": 0,
        "truncate_seq": 1,
        "truncate_size": 18446744073709551615,
        "truncate_from": 0,
        "truncate_pending": 0,
        "mtime": "2017-01-11 01:05:34.995307",
        "atime": "2016-11-12 10:08:48.440278",
        "time_warp_seq": 0,
        "change_attr": 9699,
        "client_ranges": [],
        "dirstat": {
            "version": 1,
            "mtime": "2017-01-11 01:05:34.995307",
            "num_files": 18446744073709550716,
            "num_subdirs": 18446744073709551615
        },
        "rstat": {
            "version": 9,
            "rbytes": 18446744073315575061,
            "rfiles": 18446744073709550711,
            "rsubdirs": 0,
            "rsnaprealms": 0,
            "rctime": "2017-01-11 01:05:34.995307" 
        },
        "accounted_rstat": {
            "version": 9,
            "rbytes": 18446744073315575061,
            "rfiles": 18446744073709550711,
            "rsubdirs": 0,
            "rsnaprealms": 0,
            "rctime": "2017-01-11 01:05:34.995307" 
        },
        "version": 56343,
        "file_data_version": 0,
        "xattr_version": 1,
        "backtrace_version": 2,
        "stray_prior_path": "",
        "symlink": "",
        "old_inodes": [],
        "dirfragtree": {
            "splits": []
        },
        "is_auth": true,
        "auth_state": {
            "replicas": {}
        },
        "replica_state": {
            "authority": [
                0,
                -2
            ],
            "replica_nonce": 0
        },
        "auth_pins": 0,
        "nested_auth_pins": 0,
        "is_frozen": false,
        "is_freezing": false,
        "pins": {
            "request": 0,
            "lock": 0,
            "dirfrag": 0,
            "caps": 1,
            "scrubqueue": 0,
            "authpin": 0
        },
        "nref": 1,
        "versionlock": {
            "gather_set": [],
            "num_client_lease": 0,
            "num_rdlocks": 0,
            "num_wrlocks": 0,
            "num_xlocks": 0,
            "xlock_by": {}
        },
        "authlock": {},
        "linklock": {},
        "dirfragtreelock": {},
        "filelock": {
            "gather_set": [],
            "num_client_lease": 0,
            "num_rdlocks": 0,
            "num_wrlocks": 0,
            "num_xlocks": 0,
            "xlock_by": {}
        },
        "xattrlock": {},
        "snaplock": {},
        "nestlock": {
            "gather_set": [],
            "num_client_lease": 0,
            "num_rdlocks": 0,
            "num_wrlocks": 0,
            "num_xlocks": 0,
            "xlock_by": {}
        },
        "flocklock": {},
        "policylock": {},
        "states": [
            "auth" 
        ],
        "client_caps": [
            {
                "client_id": 25937231,
                "pending": "pAsLsXsFsx",
                "issued": "pAsLsXsFsx",
                "wanted": "-",
                "last_sent": "Asx" 
            }
        ],
        "loner": 25937231,
        "want_loner": 25937231,
        "mds_caps_wanted": [],
        "dirfrags": [
            {
                "path": "\/teuthology-archive\/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi\/541839\/remote\/smithi059\/log",
                "dirfrag": "1001d64fef3",
                "snapid_first": 2,
                "projected_version": "84312",
                "version": "84312",
                "committing_version": "84312",
                "committed_version": "84312",
                "is_rep": false,
                "dir_auth": "",
                "states": [
                    "auth",
                    "complete" 
                ],
                "is_auth": true,
                "auth_state": {
                    "replicas": {}
                },
                "replica_state": {
                    "authority": [
                        0,
                        -2
                    ],
                    "replica_nonce": 0
                },
                "auth_pins": 0,
                "nested_auth_pins": 0,
                "is_frozen": false,
                "is_freezing": false,
                "pins": {
                    "waiter": 0,
                    "authpin": 0
                },
                "nref": 0,
                "dentries": []
            }
        ]
    }

Actions #1

Updated by John Spray over 7 years ago

  • Subject changed from subdir with corrupted dirstat is un-rm-able to Forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able)
  • Category set to fsck/damage handling
  • Target version set to v12.0.0
  • Backport set to jewel kraken
  • Component(FS) MDS added

Without having looked into this in detail yet, my presumption would be that the bug is that the repair code isn't fixing the stats -- I think the refusal to delete is probably not a bug in itself.

Actions #2

Updated by Zheng Yan over 7 years ago

maybe there is a bad remote link in the directory

Actions #3

Updated by Dan Mick about 7 years ago

I don't know how to repair this or even identify other instances.

Actions #4

Updated by Zheng Yan about 7 years ago

Fixed by:

ceph daemon mds.mira049 scrub_path /teuthology-archive/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi/541839/remote/smithi059/log repair recursive force

Actions #5

Updated by Zheng Yan about 7 years ago

"ceph daemon mds.mira049 scrub_path / repair recursive force" will find and fix any other issue. But it will take very log time, I don't know if it‘s worth the effort.

Actions #6

Updated by Zheng Yan about 7 years ago

  • Status changed from New to 4
Actions #7

Updated by Dan Mick about 7 years ago

I would have sworn Greg directed me to try that, but perhaps we didn't include 'force'. Shrug. Thanks for the help.

I have a multi-day 'find' command looking for huge directories that I'll let continue; it's found at few more so far:

teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi
teuthology-2016-12-21_10:00:22-rbd-jewel-distro-basic-smithi

although, you know, actually, it occurs to me, I only need to check the toplevel directories.

I will scrub the bad ones when I get a full list.

Actions #8

Updated by Dan Mick about 7 years ago

Duh, ls was fine:

ls -ld * | sort -n -k 5
drwxrwxr-x 1 1001 1001 18446744057908416832 Jan 23 02:23 teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi
drwxrwxr-x 1 1001 1001 18446744073298154785 Jan 23 04:49 teuthology-2016-12-19_19:25:02-upgrade:hammer-jewel-x-kraken-distro-basic-vps
drwxrwxr-x 1 1001 1001 18446744073703811907 Jan 23 00:30 teuthology-2016-12-21_10:00:22-rbd-jewel-distro-basic-smithi

scrubbing those now.

Actions #9

Updated by Dan Mick about 7 years ago

# ceph daemon mds.mira049 scrub_path teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi repair recursive force
{
    "return_code": -116
}
Actions #10

Updated by Dan Mick about 7 years ago

They all return ESTALE. Not sure what else I need to be doing

Actions #11

Updated by Dan Mick about 7 years ago

Tried them again tonight after repairing the broken stray object, and they worked this time. <shrug>

I guess the answer should have been scrub_path, but I don't know why sometimes it returns ESTALE.

Actions #12

Updated by Zheng Yan about 7 years ago

Dan Mick wrote:

[...]

teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi is not in root directory, it's in teuthology-archive

Actions #13

Updated by Dan Mick about 7 years ago

Would that have caused ESTALE?

Actions #14

Updated by John Spray about 7 years ago

Current status of lab cluster is:

  • Fixed the "missing dirfrag object" damage with a script that removed the offending omap entries.
  • scrub_path on teuthology-archive/ to fix the stats, which threw up a load of (imho bogus) "bad backtrace" damage (http://tracker.ceph.com/issues/18743)
  • Restarted the MDS and did not run scrub again, so damage table currently empty
  • If someone runs scrub again on /teuthology-archive they're liable to see a load of "bad backtrace" damage again, if you see other types of damage then worry
  • Let's install an updated ceph-mds as soon as http://tracker.ceph.com/issues/18743 is fixed and backported to kraken.
Actions #15

Updated by Patrick Donnelly about 6 years ago

  • Subject changed from Forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able) to mds: forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able)
  • Target version changed from v12.0.0 to v13.0.0
  • Source set to Development
  • Tags set to scrub
  • Backport changed from jewel kraken to jewel,luminous
Actions #16

Updated by Patrick Donnelly almost 6 years ago

  • Priority changed from Normal to High
  • Target version changed from v13.0.0 to v14.0.0
  • Backport changed from jewel,luminous to mimic,luminous
Actions #17

Updated by Patrick Donnelly over 5 years ago

  • Status changed from 4 to New
Actions #18

Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0
Actions #19

Updated by Patrick Donnelly about 5 years ago

  • Target version deleted (v15.0.0)
Actions #20

Updated by Patrick Donnelly about 4 years ago

  • Tags deleted (scrub)
  • Backport deleted (mimic,luminous)
  • Labels (FS) scrub added
Actions

Also available in: Atom PDF