Project

General

Profile

Bug #3997

xfs: insert memory barriers before wake_up_bit()

Added by Alex Elder about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
xfs
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I looked at this briefly last week and found what could explain
a hang on an osd node due to a bug in XFS. I ran it by Dave
Chinner, who indicated it was plausible, so after some review
of some stuff about memory barriers and the code involving
waiting on bit locks, I concluded this hunch was probably
at the very least a good thing to fix, and may likely explain
the problem.

I have no way to reproduce this, however, so I'm just going to
post the fix, and see what others in XFS land think.

Oh, and it's possible there's another tracker issue open
for this but I couldn't find it...

History

#1 Updated by Alex Elder about 11 years ago

I have posted two patches to the XFS mailing list for review.
I am also waiting for a build to complete before doing some
testing of the result.

The first patch addresses the specific problem I identified,
a process (or more than one) waiting on the flush lock for an
inode. The second patch addresses the identical issue, but
related to a separate flag bit (the IPINNED bit, which is used
to indicate an inode is pinned in memory until a transaction
referencing it completes).

I am hoping that this second one addresses a problem I noticed
recently involving processes blocked during directory creation
during one of the tests in the xfstests suite. I'll have to
track down which test that is so I can confirm the fix.

#2 Updated by Alex Elder about 11 years ago

Sorry, I meant to include these in the last one:

[PATCH 1/2] xfs: memory barrier before wake_up_bit()
[PATCH 2/2] xfs: another memory barrier before wake_up_bit()

#3 Updated by Alex Elder about 11 years ago

And here is something Sage provided that led me to believe
this could be the source of the problem. I'm not sure how
long before this expires, but I'll record it here anyway.

http://pastebin.com/TuR09b8A

#4 Updated by Alex Elder about 11 years ago

  • Status changed from In Progress to Fix Under Review

The first patch was ACK'd by Dave Chinner.
The second one he explained wasn't needed,
because an atomic increment already implies
the needed barrier operation.

At this point we're waiting for Ben Meyers to
add them to XFS upstream code.

#5 Updated by Alex Elder about 11 years ago

  • Category set to xfs

Ben has committed my fix to the upstream XFS tree.
I'm not sure when it will hit Linus' tree, but I
think we can call this done from our point of view.

Should I mark it resolved, or should I wait until
it's really upstream?

#6 Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved

Our work is done! Thanks!

Also available in: Atom PDF