Bug #2608


rbd: hung xfstest 270

Added by Tamilarasi muthamizhan almost 12 years ago. Updated over 10 years ago.

Target version:
% Done:


Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Logs are available in ubuntu@teuthology:/a/teuthology-2012-06-19_00:00:09-regression-next-testing-basic/1792

2012-06-19T01:42:16.779 29s
2012-06-19T01:42:31.146 11s
2012-06-19T01:43:06.704 34s
2012-06-19T01:44:29.369 81s
2012-06-19T01:44:49.369 17s
2012-06-19T01:44:56.139 4s
2012-06-19T01:45:10.869 13s
2012-06-19T01:45:20.349 8s
2012-06-19T01:45:26.999 2s
2012-06-19T01:48:51.756 202s

Related issues 1 (0 open1 closed)

Related to rbd - Feature #4605: rbd xfstests: define qa user, group, etc.DuplicateAlex Elder04/01/2013

Actions #1

Updated by Sage Weil almost 12 years ago

  • Status changed from New to 12


machines are still hung. kdb is enabled.

Actions #2

Updated by Alex Elder almost 12 years ago

I looked at this on Tuesday, and sent a note to Sage that should
have instead been put here. Here it is.

I was looking through the system log and found a path through
XFS where things were blocking on XFS waiting for space on the
log. I sent a note to Ben Meyers (my successor at SGI) to ask
whether there was any recent reports of anything like that.

However I just looked a bit further back in the log and I see
several of these:

[ 3752.862107] XFS (rbd2): Ending clean mount
[ 3764.958031] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3764.980290] XFS (rbd2): xfs_do_force_shutdown(0x1) called from line 1020 of f
ile /srv/autobuild-ceph/gitbuilder.git/build/fs/xfs/xfs_buf.c. Return address =
[ 3765.076044] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3765.102599] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3767.485684] XFS (rbd2): Mounting Filesystem

So this means EIO got returned by xfs_log_force(). That's the
function that syncs all queued activity to the log, which should
result in a log representing a completed set of committed changes
as of a particular point in time--and a consistent file system.

I don't know the exact, complete consequences of this but I can
tell you it's not good and is likely directly related to the
subsequent problems (somehow).

You had asked me a month or two ago about what could cause EIO
to get returned from the block layer, and I this may be connected
to the source of those questions I guess.

I'll let you know if I hear anything back from Ben.


I did hear back from Ben, and here's what he said:

These log hangs are an ongoing saga. There was one fix that went in here:;a=commit;h=76e8f1386673b864cfca3c24c4d5814740e76465

You can look for stale inodes on the ail to see if that will help you.

Otherwise, Mark and I have been discussing the issue on and off for awhile, and
there are a few workarounds running around on the list. THere are a few
different failure modes. Looks like you have a dummy transaction blocked on
reserving log space, this is blocking the ail push in the sync_worker which
would probably free up enough log space to get things progressing. I posted a
workaround for this to the list.

One of the challenges with this bug is that we don't have a single place with
all of the accounting for who has reserved the space.

Actions #3

Updated by Alex Elder almost 12 years ago

Just to summarize what I just added...

There are some recent XFS problems that might explain this,
irrespective of RBD or RADOS. I think we should wait and
see if the XFS problems settle down before investing any
time troubleshooting this problem.

We could leave this bug open as a place holder for any
future occurrences. (Or close it, I don't know what the
convention is on this sort of thing.)

Actions #4

Updated by Sage Weil almost 12 years ago

  • Project changed from Ceph to rbd
Actions #5

Updated by Sage Weil over 11 years ago

  • Assignee deleted (Alex Elder)
Actions #6

Updated by Alex Elder over 11 years ago

We should re-evaluate this with XFS found in newer kernels.
Maybe this should just be closed and re-opened (or open
a new one) if we get a similar report with newer code.

Actions #7

Updated by Sage Weil over 11 years ago

  • Priority changed from High to Normal
Actions #8

Updated by Alex Elder about 11 years ago

TODO: Try xfstests #270 on recent kernel (current testing
should be fine).

Actions #9

Updated by Alex Elder about 11 years ago

Trying to run 270 right now.

Actions #10

Updated by Sage Weil about 11 years ago

  • Assignee set to Alex Elder
Actions #11

Updated by Alex Elder about 11 years ago

Test 270 now doesn't run because:
270 [not run] fsgqa user not defined.

There are a few tests that require a distinct user
in the passwd file. I've been meaning to go back
and make these get added to the script
at some point. Now, apparently, in order to run
test 270 this is going to be a requirement.

So that's I guess a prerequisite task for getting
test 270 back again. I created this to cover doing

Actions #12

Updated by Ian Colle over 10 years ago

  • Assignee deleted (Alex Elder)
Actions #13

Updated by Sage Weil over 10 years ago

  • Status changed from 12 to Closed

Also available in: Atom PDF