Project

General

Profile

Bug #43543

mds: scrub on directory with recently created files may fail to load backtraces and report damage

Added by Patrick Donnelly about 1 month ago. Updated 29 days ago.

Status:
Triaged
Priority:
High
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
scrub
Pull request ID:
Crash signature:

Description

On a vstart cluster, copy a directory tree into CephFS and do a recursive scrub concurrently:

$ bin/ceph tell mds.a scrub start / recursive

Then in the MDS log:

2020-01-09T23:03:23.519+0000 7f88482f3700 10 MDSContext::complete: 19MDSIOContextWrapper
2020-01-09T23:03:23.519+0000 7f88482f3700 20 mds.0.cache.ino(0x1000000009f) ondisk_read_retval: -61
2020-01-09T23:03:23.519+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) scrub: inotable ino = 0x1000000009f
2020-01-09T23:03:23.519+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) scrub: inotable free says 0
2020-01-09T23:03:23.519+0000 7f88482f3700 10 MDSContext::complete: 16C_InodeValidated
2020-01-09T23:03:23.519+0000 7f88482f3700  0 log_channel(cluster) log [WRN] : Scrub error on inode 0x1000000009f (/bin/nl-neigh-delete) see mds.a log and `damage ls` output for details
2020-01-09T23:03:23.519+0000 7f88482f3700 -1 mds.0.scrubstack _validate_inode_done scrub error on inode [inode 0x1000000009f [2,head] /bin/nl-neigh-delete auth v364 dirtyparent s=0 n(v0 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4305=0-4194304@1} caps={4305=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@2},l=4305 | request=0 caps=1 dirtyparent=1 scrubqueue=0 dirty=1 0x55fe1bf7f800]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//","memoryvalue":"(2)0x1000000009f:[<0x10000000000/nl-neigh-delete v364>,<0x1/bin v776>]//","error_str":"failed to read off disk; see retval"},"raw_stats":{"checked":false,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirrstat":"f()","memory_value.rstat":"n()","error_str":""},"return_code":-61}
2020-01-09T23:03:23.519+0000 7f88482f3700 20 mds.0.cache.ino(0x1000000009f) scrub_finished
2020-01-09T23:03:23.519+0000 7f88482f3700 20 mds.0.cache.dir(0x10000000000) scrub_dentry_finished on dn [dentry #0x1/bin/nl-neigh-delete [2,head] auth (dversion lock) v=364 ino=0x1000000009f state=1610612736 | request=0 lock=0 inodepin=1 scrubparent=1 dirty=1 authpin=0 0x55fe1bfa6900]

History

#1 Updated by Patrick Donnelly about 1 month ago

If you flush the journal:

2020-01-09T23:14:02.155+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) _stored_backtrace v 364
2020-01-09T23:14:02.155+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) auth_unpin by 0x55fe1bf7f800 on [inode 0x1000000009f [2,head] /bin/nl-neigh-delete auth v1570 dirtyparent s=13776 n(v0 rc2020-01-09T23:03:18.966105+0000 b13776 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 caps=0 dirtyparent=1 scrubqueue=0 dirty=1 authpin=0 0x55fe1bf7f800] now 0
2020-01-09T23:14:02.155+0000 7f88482f3700 15 mds.0.cache.dir(0x10000000000) adjust_nested_auth_pins -1 on [dir 0x10000000000 /bin/ [2,head] auth v=7257 cv=7257/0 ap=1+1602 state=1610612769|complete|committing f(v0 m2020-01-09T23:03:55.149261+0000 1793=1793+0) n(v0 rc2020-01-09T23:03:55.162878+0000 b270047911 1777=1777+0) hs=1793+0,ss=0+0 dirty=1793 | child=1 dirty=1 authpin=1 0x55fe1b94d700] by 0x55fe1bf7f800 count now 1/1602
2020-01-09T23:14:02.155+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) clear_dirty_parent

and then a new scrub completes without error.

2020-01-09T23:16:05.892+0000 7f88482f3700 10 MDSContext::complete: 19MDSIOContextWrapper
2020-01-09T23:16:05.892+0000 7f88482f3700 20 mds.0.cache.ino(0x1000000009f) ondisk_read_retval: 0
2020-01-09T23:16:05.892+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) decoded 100 bytes of backtrace successfully
2020-01-09T23:16:05.892+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) scrub: inotable ino = 0x1000000009f
2020-01-09T23:16:05.892+0000 7f88482f3700 10 mds.0.cache.ino(0x1000000009f) scrub: inotable free says 0
2020-01-09T23:16:05.892+0000 7f88482f3700 10 MDSContext::complete: 16C_InodeValidated
2020-01-09T23:16:05.892+0000 7f88482f3700 10 mds.0.scrubstack _validate_inode_done scrub passed on inode [inode 0x1000000009f [2,head] /bin/nl-neigh-delete auth v1570 s=13776 n(v0 rc2020-01-09T23:03:18.966105+0000 b13776 1=1+0) (iversion lock) 0x55fe1bf7f800]
2020-01-09T23:16:05.892+0000 7f88482f3700 20 mds.0.cache.ino(0x1000000009f) scrub_finished
2020-01-09T23:16:05.892+0000 7f88482f3700 20 mds.0.cache.dir(0x10000000000) scrub_dentry_finished on dn [dentry #0x1/bin/nl-neigh-delete [2,head] auth (dversion lock) v=1570 ino=0x1000000009f state=1073741824 | request=0 lock=0 inodepin=0 scrubparent=1 dirty=0 authpin=0 0x55fe1bfa6900]

#2 Updated by Zheng Yan about 1 month ago

This issue exists since scrub is first implemented. Should be easy to fix, just ignore checking backtrace if dirty_parent is set

#3 Updated by Patrick Donnelly about 1 month ago

  • Status changed from New to Triaged
  • Assignee changed from Zheng Yan to Milind Changire
  • Priority changed from Urgent to Normal

#4 Updated by Patrick Donnelly 29 days ago

  • Priority changed from Normal to High
  • Target version changed from v15.0.0 to v16.0.0
  • Labels (FS) scrub added

Also available in: Atom PDF