Project

General

Profile

Actions

Bug #2608

closed

rbd: hung xfstest 270

Added by Tamilarasi muthamizhan almost 12 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are available in ubuntu@teuthology:/a/teuthology-2012-06-19_00:00:09-regression-next-testing-basic/1792

2012-06-19T01:42:16.779 INFO:teuthology.orchestra.run.out:223 29s
2012-06-19T01:42:31.146 INFO:teuthology.orchestra.run.out:224 11s
2012-06-19T01:43:06.704 INFO:teuthology.orchestra.run.out:225 34s
2012-06-19T01:44:29.369 INFO:teuthology.orchestra.run.out:226 81s
2012-06-19T01:44:49.369 INFO:teuthology.orchestra.run.out:234 17s
2012-06-19T01:44:56.139 INFO:teuthology.orchestra.run.out:238 4s
2012-06-19T01:45:10.869 INFO:teuthology.orchestra.run.out:244 13s
2012-06-19T01:45:20.349 INFO:teuthology.orchestra.run.out:253 8s
2012-06-19T01:45:26.999 INFO:teuthology.orchestra.run.out:262 2s
2012-06-19T01:48:51.756 INFO:teuthology.orchestra.run.out:269 202s


Related issues 1 (0 open1 closed)

Related to rbd - Feature #4605: rbd xfstests: define qa user, group, etc.DuplicateAlex Elder04/01/2013

Actions
Actions #1

Updated by Sage Weil almost 12 years ago

  • Status changed from New to 12

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-06-23_00:00:02-regression-next-testing-basic/1471

machines are still hung. kdb is enabled.

Actions #2

Updated by Alex Elder almost 12 years ago

I looked at this on Tuesday, and sent a note to Sage that should
have instead been put here. Here it is.

I was looking through the system log and found a path through
XFS where things were blocking on XFS waiting for space on the
log. I sent a note to Ben Meyers (my successor at SGI) to ask
whether there was any recent reports of anything like that.

However I just looked a bit further back in the log and I see
several of these:

[ 3752.862107] XFS (rbd2): Ending clean mount
[ 3764.958031] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3764.980290] XFS (rbd2): xfs_do_force_shutdown(0x1) called from line 1020 of f
ile /srv/autobuild-ceph/gitbuilder.git/build/fs/xfs/xfs_buf.c. Return address =
0xffffffffa0317e34
[ 3765.076044] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3765.102599] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3767.485684] XFS (rbd2): Mounting Filesystem

So this means EIO got returned by xfs_log_force(). That's the
function that syncs all queued activity to the log, which should
result in a log representing a completed set of committed changes
as of a particular point in time--and a consistent file system.

I don't know the exact, complete consequences of this but I can
tell you it's not good and is likely directly related to the
subsequent problems (somehow).

You had asked me a month or two ago about what could cause EIO
to get returned from the block layer, and I this may be connected
to the source of those questions I guess.

I'll let you know if I hear anything back from Ben.

========

I did hear back from Ben, and here's what he said:

These log hangs are an ongoing saga. There was one fix that went in here:
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=76e8f1386673b864cfca3c24c4d5814740e76465

You can look for stale inodes on the ail to see if that will help you.

Otherwise, Mark and I have been discussing the issue on and off for awhile, and
there are a few workarounds running around on the list. THere are a few
different failure modes. Looks like you have a dummy transaction blocked on
reserving log space, this is blocking the ail push in the sync_worker which
would probably free up enough log space to get things progressing. I posted a
workaround for this to the list.

http://oss.sgi.com/archives/xfs/2012-05/msg00312.html

One of the challenges with this bug is that we don't have a single place with
all of the accounting for who has reserved the space.

Actions #3

Updated by Alex Elder almost 12 years ago

Just to summarize what I just added...

There are some recent XFS problems that might explain this,
irrespective of RBD or RADOS. I think we should wait and
see if the XFS problems settle down before investing any
time troubleshooting this problem.

We could leave this bug open as a place holder for any
future occurrences. (Or close it, I don't know what the
convention is on this sort of thing.)

Actions #4

Updated by Sage Weil almost 12 years ago

  • Project changed from Ceph to rbd
Actions #5

Updated by Sage Weil almost 12 years ago

  • Assignee deleted (Alex Elder)
Actions #6

Updated by Alex Elder over 11 years ago

We should re-evaluate this with XFS found in newer kernels.
Maybe this should just be closed and re-opened (or open
a new one) if we get a similar report with newer code.

Actions #7

Updated by Sage Weil over 11 years ago

  • Priority changed from High to Normal
Actions #8

Updated by Alex Elder about 11 years ago

TODO: Try xfstests #270 on recent kernel (current testing
should be fine).

Actions #9

Updated by Alex Elder about 11 years ago

Trying to run 270 right now.

Actions #10

Updated by Sage Weil about 11 years ago

  • Assignee set to Alex Elder
Actions #11

Updated by Alex Elder about 11 years ago

Test 270 now doesn't run because:
270 [not run] fsgqa user not defined.

There are a few tests that require a distinct user
in the passwd file. I've been meaning to go back
and make these get added to the run_xfstests.sh script
at some point. Now, apparently, in order to run
test 270 this is going to be a requirement.

So that's I guess a prerequisite task for getting
test 270 back again. I created this to cover doing
that:
http://tracker.ceph.com/issues/4605

Actions #12

Updated by Ian Colle over 10 years ago

  • Assignee deleted (Alex Elder)
Actions #13

Updated by Sage Weil over 10 years ago

  • Status changed from 12 to Closed
Actions

Also available in: Atom PDF