Bug #18532
openmds: forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able)
0%
Description
Somehow a path in the long-running cluster got a corrupted number of files/subdirs, and responds to "rm -rf" with "cannot remove, directory not empty". There are no visible files in the directory (and find -type f | xargs rm had been done successfully). scrub_path repair notices but does not resolve the error. Also, its rstats are similarly corrupt.
ls -ld /a/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi/541839/remote/smithi059/log drwxrwxr-x 1 teuthworker teuthworker 18446744073315575061 Jan 11 01:05 /a/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi/541839/remote/smithi059/log
excerpt from daemon dump tree:
{ "ino": 1100004785907, "rdev": 0, "ctime": "2017-01-11 01:05:34.995307", "btime": "0.000000", "mode": 16893, "uid": 1001, "gid": 1001, "nlink": 1, "dir_layout": { "dir_hash": 2 }, "layout": { "stripe_unit": 0, "stripe_count": 0, "object_size": 0, "pool_id": -1, "pool_ns": "" }, "old_pools": [], "size": 0, "truncate_seq": 1, "truncate_size": 18446744073709551615, "truncate_from": 0, "truncate_pending": 0, "mtime": "2017-01-11 01:05:34.995307", "atime": "2016-11-12 10:08:48.440278", "time_warp_seq": 0, "change_attr": 9699, "client_ranges": [], "dirstat": { "version": 1, "mtime": "2017-01-11 01:05:34.995307", "num_files": 18446744073709550716, "num_subdirs": 18446744073709551615 }, "rstat": { "version": 9, "rbytes": 18446744073315575061, "rfiles": 18446744073709550711, "rsubdirs": 0, "rsnaprealms": 0, "rctime": "2017-01-11 01:05:34.995307" }, "accounted_rstat": { "version": 9, "rbytes": 18446744073315575061, "rfiles": 18446744073709550711, "rsubdirs": 0, "rsnaprealms": 0, "rctime": "2017-01-11 01:05:34.995307" }, "version": 56343, "file_data_version": 0, "xattr_version": 1, "backtrace_version": 2, "stray_prior_path": "", "symlink": "", "old_inodes": [], "dirfragtree": { "splits": [] }, "is_auth": true, "auth_state": { "replicas": {} }, "replica_state": { "authority": [ 0, -2 ], "replica_nonce": 0 }, "auth_pins": 0, "nested_auth_pins": 0, "is_frozen": false, "is_freezing": false, "pins": { "request": 0, "lock": 0, "dirfrag": 0, "caps": 1, "scrubqueue": 0, "authpin": 0 }, "nref": 1, "versionlock": { "gather_set": [], "num_client_lease": 0, "num_rdlocks": 0, "num_wrlocks": 0, "num_xlocks": 0, "xlock_by": {} }, "authlock": {}, "linklock": {}, "dirfragtreelock": {}, "filelock": { "gather_set": [], "num_client_lease": 0, "num_rdlocks": 0, "num_wrlocks": 0, "num_xlocks": 0, "xlock_by": {} }, "xattrlock": {}, "snaplock": {}, "nestlock": { "gather_set": [], "num_client_lease": 0, "num_rdlocks": 0, "num_wrlocks": 0, "num_xlocks": 0, "xlock_by": {} }, "flocklock": {}, "policylock": {}, "states": [ "auth" ], "client_caps": [ { "client_id": 25937231, "pending": "pAsLsXsFsx", "issued": "pAsLsXsFsx", "wanted": "-", "last_sent": "Asx" } ], "loner": 25937231, "want_loner": 25937231, "mds_caps_wanted": [], "dirfrags": [ { "path": "\/teuthology-archive\/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi\/541839\/remote\/smithi059\/log", "dirfrag": "1001d64fef3", "snapid_first": 2, "projected_version": "84312", "version": "84312", "committing_version": "84312", "committed_version": "84312", "is_rep": false, "dir_auth": "", "states": [ "auth", "complete" ], "is_auth": true, "auth_state": { "replicas": {} }, "replica_state": { "authority": [ 0, -2 ], "replica_nonce": 0 }, "auth_pins": 0, "nested_auth_pins": 0, "is_frozen": false, "is_freezing": false, "pins": { "waiter": 0, "authpin": 0 }, "nref": 0, "dentries": [] } ] }
Updated by John Spray over 7 years ago
- Subject changed from subdir with corrupted dirstat is un-rm-able to Forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able)
- Category set to fsck/damage handling
- Target version set to v12.0.0
- Backport set to jewel kraken
- Component(FS) MDS added
Without having looked into this in detail yet, my presumption would be that the bug is that the repair code isn't fixing the stats -- I think the refusal to delete is probably not a bug in itself.
Updated by Zheng Yan over 7 years ago
maybe there is a bad remote link in the directory
Updated by Dan Mick about 7 years ago
I don't know how to repair this or even identify other instances.
Updated by Zheng Yan about 7 years ago
Fixed by:
ceph daemon mds.mira049 scrub_path /teuthology-archive/sage-2016-11-12_02:26:45-rados-wip-sage-testing---basic-smithi/541839/remote/smithi059/log repair recursive force
Updated by Zheng Yan about 7 years ago
"ceph daemon mds.mira049 scrub_path / repair recursive force" will find and fix any other issue. But it will take very log time, I don't know if it‘s worth the effort.
Updated by Dan Mick about 7 years ago
I would have sworn Greg directed me to try that, but perhaps we didn't include 'force'. Shrug. Thanks for the help.
I have a multi-day 'find' command looking for huge directories that I'll let continue; it's found at few more so far:
teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi
teuthology-2016-12-21_10:00:22-rbd-jewel-distro-basic-smithi
although, you know, actually, it occurs to me, I only need to check the toplevel directories.
I will scrub the bad ones when I get a full list.
Updated by Dan Mick about 7 years ago
Duh, ls was fine:
ls -ld * | sort -n -k 5
drwxrwxr-x 1 1001 1001 18446744057908416832 Jan 23 02:23 teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi
drwxrwxr-x 1 1001 1001 18446744073298154785 Jan 23 04:49 teuthology-2016-12-19_19:25:02-upgrade:hammer-jewel-x-kraken-distro-basic-vps
drwxrwxr-x 1 1001 1001 18446744073703811907 Jan 23 00:30 teuthology-2016-12-21_10:00:22-rbd-jewel-distro-basic-smithi
scrubbing those now.
Updated by Dan Mick about 7 years ago
# ceph daemon mds.mira049 scrub_path teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi repair recursive force { "return_code": -116 }
Updated by Dan Mick about 7 years ago
They all return ESTALE. Not sure what else I need to be doing
Updated by Dan Mick about 7 years ago
Tried them again tonight after repairing the broken stray object, and they worked this time. <shrug>
I guess the answer should have been scrub_path, but I don't know why sometimes it returns ESTALE.
Updated by Zheng Yan about 7 years ago
Dan Mick wrote:
[...]
teuthology-2016-12-18_02:01:14-rbd-master-distro-basic-smithi is not in root directory, it's in teuthology-archive
Updated by John Spray about 7 years ago
Current status of lab cluster is:
- Fixed the "missing dirfrag object" damage with a script that removed the offending omap entries.
- scrub_path on teuthology-archive/ to fix the stats, which threw up a load of (imho bogus) "bad backtrace" damage (http://tracker.ceph.com/issues/18743)
- Restarted the MDS and did not run scrub again, so damage table currently empty
- If someone runs scrub again on /teuthology-archive they're liable to see a load of "bad backtrace" damage again, if you see other types of damage then worry
- Let's install an updated ceph-mds as soon as http://tracker.ceph.com/issues/18743 is fixed and backported to kraken.
Updated by Patrick Donnelly about 6 years ago
- Subject changed from Forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able) to mds: forward scrub failing to repair dir stats (was: subdir with corrupted dirstat is un-rm-able)
- Target version changed from v12.0.0 to v13.0.0
- Source set to Development
- Tags set to scrub
- Backport changed from jewel kraken to jewel,luminous
Updated by Patrick Donnelly almost 6 years ago
- Priority changed from Normal to High
- Target version changed from v13.0.0 to v14.0.0
- Backport changed from jewel,luminous to mimic,luminous
Updated by Patrick Donnelly about 5 years ago
- Target version changed from v14.0.0 to v15.0.0
Updated by Patrick Donnelly about 4 years ago
- Tags deleted (
scrub) - Backport deleted (
mimic,luminous) - Labels (FS) scrub added