I looked at this on Tuesday, and sent a note to Sage that should
have instead been put here. Here it is.
I was looking through the system log and found a path through
XFS where things were blocking on XFS waiting for space on the
log. I sent a note to Ben Meyers (my successor at SGI) to ask
whether there was any recent reports of anything like that.
However I just looked a bit further back in the log and I see
several of these:
[ 3752.862107] XFS (rbd2): Ending clean mount
[ 3764.958031] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3764.980290] XFS (rbd2): xfs_do_force_shutdown(0x1) called from line 1020 of f
ile /srv/autobuild-ceph/gitbuilder.git/build/fs/xfs/xfs_buf.c. Return address =
0xffffffffa0317e34
[ 3765.076044] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3765.102599] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3767.485684] XFS (rbd2): Mounting Filesystem
So this means EIO got returned by xfs_log_force(). That's the
function that syncs all queued activity to the log, which should
result in a log representing a completed set of committed changes
as of a particular point in time--and a consistent file system.
I don't know the exact, complete consequences of this but I can
tell you it's not good and is likely directly related to the
subsequent problems (somehow).
You had asked me a month or two ago about what could cause EIO
to get returned from the block layer, and I this may be connected
to the source of those questions I guess.
I'll let you know if I hear anything back from Ben.
========
I did hear back from Ben, and here's what he said:
These log hangs are an ongoing saga. There was one fix that went in here:
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=76e8f1386673b864cfca3c24c4d5814740e76465
You can look for stale inodes on the ail to see if that will help you.
Otherwise, Mark and I have been discussing the issue on and off for awhile, and
there are a few workarounds running around on the list. THere are a few
different failure modes. Looks like you have a dummy transaction blocked on
reserving log space, this is blocking the ail push in the sync_worker which
would probably free up enough log space to get things progressing. I posted a
workaround for this to the list.
http://oss.sgi.com/archives/xfs/2012-05/msg00312.html
One of the challenges with this bug is that we don't have a single place with
all of the accounting for who has reserved the space.