Bug #2608
closedrbd: hung xfstest 270
0%
Description
Logs are available in ubuntu@teuthology:/a/teuthology-2012-06-19_00:00:09-regression-next-testing-basic/1792
2012-06-19T01:42:16.779 INFO:teuthology.orchestra.run.out:223 29s
2012-06-19T01:42:31.146 INFO:teuthology.orchestra.run.out:224 11s
2012-06-19T01:43:06.704 INFO:teuthology.orchestra.run.out:225 34s
2012-06-19T01:44:29.369 INFO:teuthology.orchestra.run.out:226 81s
2012-06-19T01:44:49.369 INFO:teuthology.orchestra.run.out:234 17s
2012-06-19T01:44:56.139 INFO:teuthology.orchestra.run.out:238 4s
2012-06-19T01:45:10.869 INFO:teuthology.orchestra.run.out:244 13s
2012-06-19T01:45:20.349 INFO:teuthology.orchestra.run.out:253 8s
2012-06-19T01:45:26.999 INFO:teuthology.orchestra.run.out:262 2s
2012-06-19T01:48:51.756 INFO:teuthology.orchestra.run.out:269 202s
Updated by Sage Weil almost 12 years ago
- Status changed from New to 12
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-06-23_00:00:02-regression-next-testing-basic/1471
machines are still hung. kdb is enabled.
Updated by Alex Elder almost 12 years ago
I looked at this on Tuesday, and sent a note to Sage that should
have instead been put here. Here it is.
I was looking through the system log and found a path through
XFS where things were blocking on XFS waiting for space on the
log. I sent a note to Ben Meyers (my successor at SGI) to ask
whether there was any recent reports of anything like that.
However I just looked a bit further back in the log and I see
several of these:
[ 3752.862107] XFS (rbd2): Ending clean mount
[ 3764.958031] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3764.980290] XFS (rbd2): xfs_do_force_shutdown(0x1) called from line 1020 of f
ile /srv/autobuild-ceph/gitbuilder.git/build/fs/xfs/xfs_buf.c. Return address =
0xffffffffa0317e34
[ 3765.076044] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3765.102599] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3767.485684] XFS (rbd2): Mounting Filesystem
So this means EIO got returned by xfs_log_force(). That's the
function that syncs all queued activity to the log, which should
result in a log representing a completed set of committed changes
as of a particular point in time--and a consistent file system.
I don't know the exact, complete consequences of this but I can
tell you it's not good and is likely directly related to the
subsequent problems (somehow).
You had asked me a month or two ago about what could cause EIO
to get returned from the block layer, and I this may be connected
to the source of those questions I guess.
I'll let you know if I hear anything back from Ben.
========
I did hear back from Ben, and here's what he said:
These log hangs are an ongoing saga. There was one fix that went in here:
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=76e8f1386673b864cfca3c24c4d5814740e76465
You can look for stale inodes on the ail to see if that will help you.
Otherwise, Mark and I have been discussing the issue on and off for awhile, and
there are a few workarounds running around on the list. THere are a few
different failure modes. Looks like you have a dummy transaction blocked on
reserving log space, this is blocking the ail push in the sync_worker which
would probably free up enough log space to get things progressing. I posted a
workaround for this to the list.
http://oss.sgi.com/archives/xfs/2012-05/msg00312.html
One of the challenges with this bug is that we don't have a single place with
all of the accounting for who has reserved the space.
Updated by Alex Elder almost 12 years ago
Just to summarize what I just added...
There are some recent XFS problems that might explain this,
irrespective of RBD or RADOS. I think we should wait and
see if the XFS problems settle down before investing any
time troubleshooting this problem.
We could leave this bug open as a place holder for any
future occurrences. (Or close it, I don't know what the
convention is on this sort of thing.)
Updated by Alex Elder over 11 years ago
We should re-evaluate this with XFS found in newer kernels.
Maybe this should just be closed and re-opened (or open
a new one) if we get a similar report with newer code.
Updated by Alex Elder about 11 years ago
TODO: Try xfstests #270 on recent kernel (current testing
should be fine).
Updated by Alex Elder about 11 years ago
Test 270 now doesn't run because:
270 [not run] fsgqa user not defined.
There are a few tests that require a distinct user
in the passwd file. I've been meaning to go back
and make these get added to the run_xfstests.sh script
at some point. Now, apparently, in order to run
test 270 this is going to be a requirement.
So that's I guess a prerequisite task for getting
test 270 back again. I created this to cover doing
that:
http://tracker.ceph.com/issues/4605