Project

General

Profile

Actions

Bug #52688

open

mds: possibly corrupted entry in journal (causes replay failure with file system marked as damaged)

Added by Venky Shankar over 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Failed replay after a standby took over as active. This marks the file system as damaged.

The journal entry for inode 0x10000002380 is as follows::

2021-08-25T18:28:37.257+0000 7f6b80584700 10 mds.0.log _replay 291969831360~2337 / 292187671447 2021-07-29T08:11:13.592092+0000: EUpdate mkdir [metablob 0x10000002380, 3 dirs]
...
...
...
2021-08-25T18:28:37.411+0000 7f6b80584700 10 mds.0.log _replay 291970795266~6487 / 292187671447 2021-07-29T08:11:17.416164+0000: EUpdate unlink_local [metablob 0x10000002380, 4 dirs]

During replay of unlink_local(0x10000002380) log entry, the MDS will trim inodes in a subtree. As seen from the log, the inode/dentry if removed for [0x10000002380, 0x1/kvol3::

2021-08-25T18:28:39.028+0000 7f6b80584700 12 mds.0.cache.dir(0x10000002380) unlink_inode [dentry #0x1/kvol3/level06 [28c,head] auth (dversion lock) v=0 ino=0x10001454b87 state=1073741824 0x55c2c911b680] [inode 0x10001454b87 [...28c,head] /kvol3/level06/ auth v5252 f(v0 m2021-07-12T10:51:36.103856+0000 101=100+1) n\
(v8 rc2021-07-12T10:51:47.351634+0000 b2599308 1010=1000+10) (iversion lock) 0x55c2c9117b80]
2021-08-25T18:28:39.028+0000 7f6b80584700 14 mds.0.cache remove_inode [inode 0x10001454b87 [...28c,head] #10001454b87/ auth v5252 f(v0 m2021-07-12T10:51:36.103856+0000 101=100+1) n(v8 rc2021-07-12T10:51:47.351634+0000 b2599308 1010=1000+10) (iversion lock) 0x55c2c9117b80]
2021-08-25T18:28:39.028+0000 7f6b80584700 12 mds.0.cache.dir(0x10000002380) remove_dentry [dentry #0x1/kvol3/level06 [28c,head] auth NULL (dversion lock) v=0 ino=(nil) state=1073741824 0x55c2c911b680]
2021-08-25T18:28:39.028+0000 7f6b80584700 14 mds.0.cache.ino(0x10000002380) close_dirfrag *
2021-08-25T18:28:39.028+0000 7f6b80584700 12 mds.0.cache.dir(0x10000002380) remove_null_dentries [dir 0x10000002380 /kvol3/ [2,head] rep@-2.0 state=536870912 f(v0 m2021-07-29T08:04:20.470393+0000 11=0+11) n(v32 rc2021-07-29T08:11:17.411054+0000 b25682446 10130=10001+129)/n(v32 rc2021-07-29T08:11:17.135059+0000 b25\
682454 10132=10002+130) hs=0+0,ss=0+0 | child=0 dirty=1 0x55c2c6363a80]
2021-08-25T18:28:39.028+0000 7f6b80584700 10 mds.0.cache.dir(0x10000002380) mark_clean [dir 0x10000002380 /kvol3/ [2,head] rep@-2.0 state=536870912 f(v0 m2021-07-29T08:04:20.470393+0000 11=0+11) n(v32 rc2021-07-29T08:11:17.411054+0000 b25682446 10130=10001+129)/n(v32 rc2021-07-29T08:11:17.135059+0000 b25682454 101\
32=10002+130) hs=0+0,ss=0+0 | child=0 dirty=1 0x55c2c6363a80] version 5271
2021-08-25T18:28:39.028+0000 7f6b80584700 20 mds.0.cache trim_non_auth_subtree(0x55c2bf032900) removing inode 0x55c2c1828580 with dentry0x55c2c1821400
2021-08-25T18:28:39.028+0000 7f6b80584700 12 mds.0.cache.dir(0x1) unlink_inode [dentry #0x1/kvol3 [232,head] auth (dversion lock) v=800649 ino=0x10000002380 state=1610612736 | inodepin=1 dirty=1 0x55c2c1821400] [inode 0x10000002380 [...840,head] /kvol3/ auth v800649 snaprealm=0x55c2bef06200 f(v0 m2021-07-29T08:04:\
20.470393+0000 11=0+11) n(v32 rc2021-07-29T08:11:17.135059+0000 b25682454 rs17 10133=10002+131) old_inodes=17 (inest sync dirty) (iversion lock) | dirtyscattered=1 dirfrag=0 dirty=1 0x55c2c1828580]
2021-08-25T18:28:39.028+0000 7f6b80584700 14 mds.0.cache remove_inode [inode 0x10000002380 [...840,head] #10000002380/ auth v800649 snaprealm=0x55c2bef06200 f(v0 m2021-07-29T08:04:20.470393+0000 11=0+11) n(v32 rc2021-07-29T08:11:17.135059+0000 b25682454 rs17 10133=10002+131) old_inodes=17 (inest sync dirty) (ivers\
ion lock) | dirtyscattered=1 dirfrag=0 dirty=1 0x55c2c1828580]
2021-08-25T18:28:39.028+0000 7f6b80584700 10 mds.0.cache.ino(0x10000002380) mark_clean [inode 0x10000002380 [...840,head] #10000002380/ auth v800649 snaprealm=0x55c2bef06200 f(v0 m2021-07-29T08:04:20.470393+0000 11=0+11) n(v32 rc2021-07-29T08:11:17.135059+0000 b25682454 rs17 10133=10002+131) old_inodes=17 (inest s\
ync dirty) (iversion lock) | dirtyscattered=1 dirfrag=0 dirty=1 0x55c2c1828580]
2021-08-25T18:28:39.028+0000 7f6b80584700 10 mds.0.cache.ino(0x10000002380) clear_dirty_scattered 128 on [inode 0x10000002380 [...840,head] #10000002380/ auth v800649 snaprealm=0x55c2bef06200 f(v0 m2021-07-29T08:04:20.470393+0000 11=0+11) n(v32 rc2021-07-29T08:11:17.135059+0000 b25682454 rs17 10133=10002+131) old_\
inodes=17 (iversion lock) 0x55c2c1828580]
2021-08-25T18:28:39.028+0000 7f6b80584700 15 mds.0.cache.ino(0x10000002380) close_snaprealm snaprealm(0x10000002380 seq 83d lc 83d cr 83d cps 232 snaps={80d=snap(80d 0x10000002380 'snapk1' 2021-07-28T15:53:09.093595+0000),812=snap(812 0x10000002380 'scheduled-2021-07-28-17_00_00' 2021-07-28T17:00:00.824060+0000),8\
15=snap(815 0x10000002380 'scheduled-2021-07-28-18_00_00' 2021-07-28T18:00:00.833468+0000),818=snap(818 0x10000002380 'scheduled-2021-07-28-19_00_00' 2021-07-28T19:00:00.843013+0000),81b=snap(81b 0x10000002380 'scheduled-2021-07-28-20_00_00' 2021-07-28T20:00:00.852568+0000),81e=snap(81e 0x10000002380 'scheduled-20\
21-07-28-21_00_00' 2021-07-28T21:00:00.862462+0000),821=snap(821 0x10000002380 'scheduled-2021-07-28-22_00_00' 2021-07-28T22:00:00.871611+0000),824=snap(824 0x10000002380 'scheduled-2021-07-28-23_00_00' 2021-07-28T23:00:00.881157+0000),827=snap(827 0x10000002380 'scheduled-2021-07-29-00_00_00' 2021-07-29T00:00:00.\
891256+0000),82a=snap(82a 0x10000002380 'scheduled-2021-07-29-01_00_00' 2021-07-29T01:00:00.900723+0000),82d=snap(82d 0x10000002380 'scheduled-2021-07-29-02_00_00' 2021-07-29T02:00:00.912899+0000),830=snap(830 0x10000002380 'scheduled-2021-07-29-03_00_00' 2021-07-29T03:00:00.923015+0000),833=snap(833 0x10000002380\
 'scheduled-2021-07-29-04_00_00' 2021-07-29T04:00:00.932550+0000),836=snap(836 0x10000002380 'scheduled-2021-07-29-05_00_00' 2021-07-29T05:00:00.941772+0000),839=snap(839 0x10000002380 'scheduled-2021-07-29-06_00_00' 2021-07-29T06:00:00.951627+0000),83c=snap(83c 0x10000002380 'scheduled-2021-07-29-07_00_01' 2021-0\
7-29T07:00:01.027067+0000),83d=snap(83d 0x10000002380 'scheduled-2021-07-29-08_00_00' 2021-07-29T08:00:00.075102+0000)} past_parent_snaps= 0x55c2bef06200)
2021-08-25T18:28:39.028+0000 7f6b80584700 12 mds.0.cache.dir(0x1) remove_dentry [dentry #0x1/kvol3 [232,head] auth NULL (dversion lock) v=800649 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x55c2c1821400]
2021-08-25T18:28:39.028+0000 7f6b80584700 10 mds.0.cache.den(0x1 kvol3) mark_clean [dentry #0x1/kvol3 [232,head] auth NULL (dversion lock) v=800649 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x55c2c1821400]

Later, there is another log entry for the same inode::

2021-08-25T18:28:40.054+0000 7f6b80584700 10 mds.0.log _replay 291970801773~2340 / 292187671447 2021-07-29T08:11:17.416559+0000: EUpdate mkdir [metablob 0x10000002380, 3 dirs]
2021-08-25T18:28:40.054+0000 7f6b83d8b700 20 mds.0.cache upkeep thread waiting interval 1.000000000s
2021-08-25T18:28:40.054+0000 7f6b80584700 10 mds.0.journal EMetaBlob.replay 3 dirlumps by client.2918134
2021-08-25T18:28:40.054+0000 7f6b80584700 10 mds.0.journal EMetaBlob.replay dir 0x10000002380
2021-08-25T18:28:40.054+0000 7f6b80584700  0 mds.0.journal EMetaBlob.replay missing dir ino  0x10000002380
2021-08-25T18:28:40.054+0000 7f6b80584700 -1 log_channel(cluster) log [ERR] : failure replaying journal (EMetaBlob)

However, the MDS has trimmed this corresponding inode causing the MDS to mark itself as damaged.

This looks like a bug somewhere in the journaling part. Can be reproduced by failing an MDS during active IO and a standby taking over, but does not happen all the time.

Actions #1

Updated by Patrick Donnelly almost 2 years ago

  • Target version deleted (v17.0.0)
Actions

Also available in: Atom PDF