Project

General

Profile

Bug #54557

scrub repair does not clear earlier damage health status

Added by Milind Changire 9 months ago. Updated 20 days ago.

Status:
Fix Under Review
Priority:
Normal
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
scrub, task(easy)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From Chris Palmer on cpeh-users.ceph.io mailing list ...

Reading this thread made me realise I had overlooked cephfs scrubbing, so i tried it on a small 16.2.7 cluster. The normal forward scrub showed nothing. However "ceph tell mds.0 scrub start ~mdsdir recursive" did find one backtrace error (putting the cluster into HEALTH_ERR). I then did a repair which according to the log did rewrite the inode, and subsequent scrubs have not found it.

However the cluster health is still ERR, and the MDS still shows the damage:

ceph@xxxx1:~$ ceph tell mds.0 damage ls 
2022-03-12T18:42:01.609+0000 7f1b817fa700  0 client.173985213 ms_handle_reset on v2:192.168.80.121:6824/939134894
2022-03-12T18:42:01.625+0000 7f1b817fa700  0 client.173985219 ms_handle_reset on v2:192.168.80.121:6824/939134894
[
    {
        "damage_type": "backtrace",
        "id": 3308827822,
        "ino": 256,
        "path": "~mds0" 
    }
]

What are the right steps from here? Has the error actually been corrected but just needs clearing or is it still there?

In case it is relevant: there is one active and two standby MDS. The log is from the node currently hosting rank 0.
From the mds log:

2022-03-12T18:13:41.593+0000 7f61d30c1700  1 mds.xxxx1 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2022-03-12T18:13:41.593+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2022-03-12T18:13:41.593+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:41.593+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]
2022-03-12T18:13:41.601+0000 7f61cb0b1700  0 log_channel(cluster) log [WRN] : Scrub error on inode 0x100 (~mds0) see mds.xxxx1 log and `damage ls` output for details
2022-03-12T18:13:41.601+0000 7f61cb0b1700 -1 mds.0.scrubstack _validate_inode_done scrub error on inode [inode 0x100 [...2,head] ~mds0/ auth v6798 ap=1 snaprealm=0x55d59548
4800 f(v0 10=0+10) n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)/n(v0 rc2019-10-29T10:52:34.302967+0000 11=0+11) (inest lock) (iversion lock) | dirtysca
ttered=0 lock=0 dirfrag=1 openingsnapparents=0 dirty=1 authpin=1 scrubqueue=0 0x55d595486000]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked" 
:true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(11)0x100:[]//[]","error_str":"failed to read off disk; see retval"},"raw_stats":{"ch
ecked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 10=0+10)","ondisk_value.rstat":"n(v0 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","mem
ory_value.dirstat":"f(v0 10=0+10)","memory_value.rstat":"n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","error_str":""},"return_code":-61}
2022-03-12T18:13:41.601+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:45.317+0000 7f61cf8ba700  0 log_channel(cluster) log [INF] : scrub summary: idle

2022-03-12T18:13:52.881+0000 7f61d30c1700  1 mds.xxxx1 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive,repair]} (starting...)
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [WRN] : bad backtrace on inode 0x100(~mds0), rewriting it
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : Scrub repaired inode 0x100 (~mds0)
2022-03-12T18:13:52.881+0000 7f61cb0b1700 -1 mds.0.scrubstack _validate_inode_done scrub error on inode [inode 0x100 [...2,head] ~mds0/ auth v6798 ap=1 snaprealm=0x55d595484800 DIRTYPARENT f(v0 10=0+10) n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)/n(v0 rc2019-10-29T10:52:34.302967+0000 11=0+11) (inest lock) (iversion lock) | dirtyscattered=0 lock=0 dirfrag=1 openingsnapparents=0 dirtyparent=1 dirty=1 authpin=1 scrubqueue=0 0x55d595486000]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(11)0x100:[]//[]","error_str":"failed to read off disk; see retval"},"raw_stats":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 10=0+10)","ondisk_value.rstat":"n(v0 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","memory_value.dirstat":"f(v0 10=0+10)","memory_value.rstat":"n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","error_str":""},"return_code":-61}
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:55.317+0000 7f61cf8ba700  0 log_channel(cluster) log [INF] : scrub summary: idle

2022-03-12T18:14:12.608+0000 7f61d30c1700  1 mds.xxxx1 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive,repair]} (starting...)
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:14:15.316+0000 7f61cf8ba700  0 log_channel(cluster) log [INF] : scrub summary: idle

History

#1 Updated by Chris Palmer 9 months ago

Miland asked me to try: After you run "scrub repair" followed by a "scrub" without any issues, and if the "damage ls" still shows you an error, try running "damage rm" and re-run "scrub" to see if the system still reports a damage.

The cluster I originally tried this on is now fine. However it has also appeared on two other clusters (both originally octopus, now pacific 16.2.7). One is a test cluster with 2 MDS (active/standby). The following results are from this test cluster. In essence the "rm" works, but after a restart the problem reappears:

- Starting position: node 01 active rank 0, node 02 standby, damage ls [], HEALTH_OK
- ceph tell mds.0 scrub start ~mdsdir recursive

- damage ls [ "backtrace", "~mds0" ], HEALTH_ERR

- ceph tell mds.0 scrub start ~mdsdir recursive repair

- damage ls [ "backtrace", "~mds0" ], HEALTH_ERR

- ceph tell mds.0 scrub start ~mdsdir recursive

- damage ls [ "backtrace", "~mds0" ], HEALTH_ERR

- ceph tell mds.0 damage rm 184292443

- damage ls [], HEALTH_OK

- ceph tell mds.0 scrub start ~mdsdir recursive

- damage ls [], HEALTH_OK

============================================
- stop mds@node01

- mds@node02 now active rank 0, HEALTH_WARN (insufficient standbys)

- start mds@node01

- mds@node01 now standby, HEALTH_OK

- ceph tell mds.0 scrub start ~mdsdir recursive

- damage ls [ "backtrace", "~mds0" ], HEALTH_ERR

- ceph tell mds.0 scrub start ~mdsdir recursive repair

- damage ls [ "backtrace", "~mds0" ], HEALTH_ERR

- ceph tell mds.0 scrub start ~mdsdir recursive

- damage ls [ "backtrace", "~mds0" ], HEALTH_ERR

- ceph tell mds.0 damage rm 2358075647

- damage ls [], HEALTH_OK

- ceph tell mds.0 scrub start ~mdsdir recursive

- damage ls [], HEALTH_OK

============================================
- stop mds@node02

- mds@node01 now active rank 0, HEALTH_WARN (insufficient standbys)

- start mds@node01

- mds@node01 now standby, HEALTH_OK
(Now back to the original starting point)

- ceph tell mds.0 scrub start ~mdsdir recursive

- damage ls [ "backtrace", "~mds0" ], HEALTH_ERR

(repeats.....)

#2 Updated by Venky Shankar 9 months ago

  • Category set to fsck/damage handling
  • Status changed from New to Triaged
  • Assignee set to Milind Changire
  • Target version set to v18.0.0
  • Source set to Community (dev)
  • Backport set to quincy, pacific
  • Labels (FS) task(easy) added

#3 Updated by Venky Shankar 2 months ago

  • Assignee changed from Milind Changire to Neeraj Pratap Singh

Neeraj, please take this one.

#4 Updated by Kotresh Hiremath Ravishankar about 2 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 48450

#5 Updated by Neeraj Pratap Singh 20 days ago

  • Pull request ID changed from 48450 to 48895

Also available in: Atom PDF