Project

General

Profile

Actions

Bug #1041

closed

standby-replay fails on multi-mds fsstress journals

Added by Greg Farnum almost 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Things break, figure out why.

Actions #1

Updated by Greg Farnum almost 13 years ago

  • Subject changed from standby-replay fails on mds journals to standby-replay fails on multi-mds fsstress journals
Actions #2

Updated by Greg Farnum almost 13 years ago

  • Assignee set to Greg Farnum
Actions #3

Updated by Greg Farnum almost 13 years ago

I've got a log in kai:~gregf/logs/fsstress/standby-replay

Actions #4

Updated by Greg Farnum almost 13 years ago

  • Status changed from New to In Progress

The problem is that the journal (for mds0) refers to mds1's stray directory. It's replaying a rename operation, where the srci is in mds1's stray dir but the srcdn is not. The inode was kept in the stray dir because when it got moved there, srcdn was on mds0. But it got exported to mds1, which makes me think that the inode shouldn't live in the stray dir any longer and that's the bug?
On the other hand I'm not sure what would happen if the srcdn was still on mds0 and the srci was still in mds1's stray dir. Maybe the journal should just be able to handle stray dirs on other MDSes (though Sage says it shouldn't).

Actions #5

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_position set to 379
Actions #6

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_story_points set to 3
  • Translation missing: en.field_position deleted (380)
  • Translation missing: en.field_position set to 380
Actions #7

Updated by Greg Farnum almost 13 years ago

Back from vacation, and I'm trying to remember what's still broken here. Looking through my logs:
1) MDS 1 gets request to rename, as it's auth on srcdn
2) srci is located on mds 0
3) mds 1 requests and auth pin from mds 0 for srci
4) mds 0 is now a slave for the op and journals extra crap that it's not auth for.

Similar but not identical to the previous cause, which we dealt with by fixing up some of our branching code.

Actions #8

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.29 to v0.30
Actions #9

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_position deleted (390)
  • Translation missing: en.field_position set to 7
Actions #10

Updated by Greg Farnum almost 13 years ago

  • Status changed from In Progress to 7

All right, I went over _rename_prepare pretty carefully and reworked a lot of the checks on journaling and now i haven't seen a crash in a while. Running a few more tests with the next branch (and Sage's changes there) merged before I push.

Actions #11

Updated by Greg Farnum almost 13 years ago

  • Status changed from 7 to Resolved

Okay, after 3 or 4 more runs I've only seen #1128.

Actions #12

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.30)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF