Bug #1041: standby-replay fails on multi-mds fsstress journals - CephFS - Ceph

Actions

Copy link

Bug #1041

closed

standby-replay fails on multi-mds fsstress journals

Added by Greg Farnum almost 13 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Things break, figure out why.

Actions

Copy link

Updated by Greg Farnum almost 13 years ago

Subject changed from standby-replay fails on mds journals to standby-replay fails on multi-mds fsstress journals

Actions

Copy link

Updated by Greg Farnum almost 13 years ago

Assignee set to Greg Farnum

Actions

Copy link

Updated by Greg Farnum almost 13 years ago

I've got a log in kai:~gregf/logs/fsstress/standby-replay

Actions

Copy link

Updated by Greg Farnum almost 13 years ago

Status changed from New to In Progress

The problem is that the journal (for mds0) refers to mds1's stray directory. It's replaying a rename operation, where the srci is in mds1's stray dir but the srcdn is not. The inode was kept in the stray dir because when it got moved there, srcdn was on mds0. But it got exported to mds1, which makes me think that the inode shouldn't live in the stray dir any longer and that's the bug?
On the other hand I'm not sure what would happen if the srcdn was still on mds0 and the srci was still in mds1's stray dir. Maybe the journal should just be able to handle stray dirs on other MDSes (though Sage says it shouldn't).

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Translation missing: en.field_position set to 379

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Translation missing: en.field_story_points set to 3
Translation missing: en.field_position deleted (~~380~~)
Translation missing: en.field_position set to 380

Actions

Copy link

Updated by Greg Farnum almost 13 years ago

Back from vacation, and I'm trying to remember what's still broken here. Looking through my logs:
1) MDS 1 gets request to rename, as it's auth on srcdn
2) srci is located on mds 0
3) mds 1 requests and auth pin from mds 0 for srci
4) mds 0 is now a slave for the op and journals extra crap that it's not auth for.

Similar but not identical to the previous cause, which we dealt with by fixing up some of our branching code.

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Target version changed from v0.29 to v0.30

Actions

Copy link

Updated by Sage Weil almost 13 years ago

Translation missing: en.field_position deleted (~~390~~)
Translation missing: en.field_position set to 7

Actions

Copy link

#10

Updated by Greg Farnum almost 13 years ago

Status changed from In Progress to 7

All right, I went over _rename_prepare pretty carefully and reworked a lot of the checks on journaling and now i haven't seen a crash in a while. Running a few more tests with the next branch (and Sage's changes there) merged before I push.

Actions

Copy link

#11

Updated by Greg Farnum almost 13 years ago

Status changed from 7 to Resolved

Okay, after 3 or 4 more runs I've only seen #1128.

Actions

Copy link

#12

Updated by John Spray over 7 years ago

Project changed from Ceph to CephFS
Category deleted (1)
Target version deleted (~~v0.30~~)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #1041

standby-replay fails on multi-mds fsstress journals

Updated by Greg Farnum almost 13 years ago

Updated by Greg Farnum almost 13 years ago

Updated by Greg Farnum almost 13 years ago

Updated by Greg Farnum almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by Greg Farnum almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by Sage Weil almost 13 years ago

Updated by Greg Farnum almost 13 years ago

Updated by Greg Farnum almost 13 years ago

Updated by John Spray over 7 years ago