Project

General

Profile

Actions

Bug #329

closed

mds: mislinked dentry found during journal replay

Added by Sage Weil over 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There is a FIXME error that is logged during replay when we encounter a dentry that is already linked and a journal entry tries to newly link it to something new. The question is how we got into that state in the first place.

To find the problem, we need full mds logs from when the entry was originally logged, all the way through the failed replay.

Wido has hit this a couple times now with an rsync of kernel.org. The mds needs to be restarted at some point to detect the replay issue.

Actions #1

Updated by Sage Weil over 13 years ago

  • Target version set to v0.21.1
Actions #2

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.21.1 to v0.21.2
Actions #3

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.21.2 to v0.21.3
Actions #4

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.21.3 to v0.21.4
Actions #5

Updated by Sage Weil over 13 years ago

This can come up with multiple MDSs. (Wido saw it with one MDS; not sure how that happened.)

With multiple MDSs, the situation can be something like:

- mds0: /a/b > ino1
- export /a from mds0
>mds1
- mds1: /a/b relinked to ino2
- export /a from mds1->mds0
- crash
- replay journal
- mds0 replay sees /a/b link to ino1, then ino2
Actions #6

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.21.4 to v0.22

I suspect the solution (for the clustered case) is something like:

- trim_non_auth and a subtree when we replay EExport, and when we disambiguate_imports and determine a subtree is non-auth.  
- trim_non_auth() should now be a no-op, since any non-auth subtree has already been trimmed. make it warn/assert if it find any work to do.
- trim_unlinked_inodes() should also be a no-op (right?). warn/assert if it's not.
- this should make the current FIXME case not come up, since we won't have any stale subtree content from prior periods of auth-ness.

?

Actions #7

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.22 to v0.23
Actions #8

Updated by Sage Weil over 13 years ago

  • Assignee set to Greg Farnum
Actions #9

Updated by Greg Farnum over 13 years ago

  • Status changed from New to Resolved

The multi-mds fix has been pushed to mds_journal branch commit:aa83e11c67165878e1ca1b0fe66ff9b8c3a906c8. Then merged into unstable.

Closing for now. If we get a single-MDS occurrence of the original problem we should probably open a new ticket.

Actions #10

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.23)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF