Bug #2608: rbd: hung xfstest 270 - rbd - Ceph

Actions

Copy link

Bug #2608

closed

rbd: hung xfstest 270

Added by Tamilarasi muthamizhan almost 12 years ago. Updated over 10 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are available in ubuntu@teuthology:/a/teuthology-2012-06-19_00:00:09-regression-next-testing-basic/1792

2012-06-19T01:42:16.779 INFO:teuthology.orchestra.run.out:223 29s
2012-06-19T01:42:31.146 INFO:teuthology.orchestra.run.out:224 11s
2012-06-19T01:43:06.704 INFO:teuthology.orchestra.run.out:225 34s
2012-06-19T01:44:29.369 INFO:teuthology.orchestra.run.out:226 81s
2012-06-19T01:44:49.369 INFO:teuthology.orchestra.run.out:234 17s
2012-06-19T01:44:56.139 INFO:teuthology.orchestra.run.out:238 4s
2012-06-19T01:45:10.869 INFO:teuthology.orchestra.run.out:244 13s
2012-06-19T01:45:20.349 INFO:teuthology.orchestra.run.out:253 8s
2012-06-19T01:45:26.999 INFO:teuthology.orchestra.run.out:262 2s
2012-06-19T01:48:51.756 INFO:teuthology.orchestra.run.out:269 202s

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil almost 12 years ago

Status changed from New to 12

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-06-23_00:00:02-regression-next-testing-basic/1471

machines are still hung. kdb is enabled.

Actions

Copy link

Updated by Alex Elder almost 12 years ago

I looked at this on Tuesday, and sent a note to Sage that should
have instead been put here. Here it is.

I was looking through the system log and found a path through
XFS where things were blocking on XFS waiting for space on the
log. I sent a note to Ben Meyers (my successor at SGI) to ask
whether there was any recent reports of anything like that.

However I just looked a bit further back in the log and I see
several of these:

[ 3752.862107] XFS (rbd2): Ending clean mount
[ 3764.958031] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3764.980290] XFS (rbd2): xfs_do_force_shutdown(0x1) called from line 1020 of f
ile /srv/autobuild-ceph/gitbuilder.git/build/fs/xfs/xfs_buf.c. Return address =
0xffffffffa0317e34
[ 3765.076044] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3765.102599] XFS (rbd2): xfs_log_force: error 5 returned.
[ 3767.485684] XFS (rbd2): Mounting Filesystem

So this means EIO got returned by xfs_log_force(). That's the
function that syncs all queued activity to the log, which should
result in a log representing a completed set of committed changes
as of a particular point in time--and a consistent file system.

I don't know the exact, complete consequences of this but I can
tell you it's not good and is likely directly related to the
subsequent problems (somehow).

You had asked me a month or two ago about what could cause EIO
to get returned from the block layer, and I this may be connected
to the source of those questions I guess.

I'll let you know if I hear anything back from Ben.

========

I did hear back from Ben, and here's what he said:

These log hangs are an ongoing saga. There was one fix that went in here:
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=76e8f1386673b864cfca3c24c4d5814740e76465

You can look for stale inodes on the ail to see if that will help you.

Otherwise, Mark and I have been discussing the issue on and off for awhile, and
there are a few workarounds running around on the list. THere are a few
different failure modes. Looks like you have a dummy transaction blocked on
reserving log space, this is blocking the ail push in the sync_worker which
would probably free up enough log space to get things progressing. I posted a
workaround for this to the list.

http://oss.sgi.com/archives/xfs/2012-05/msg00312.html

One of the challenges with this bug is that we don't have a single place with
all of the accounting for who has reserved the space.

Actions

Copy link

Updated by Alex Elder almost 12 years ago

Just to summarize what I just added...

There are some recent XFS problems that might explain this,
irrespective of RBD or RADOS. I think we should wait and
see if the XFS problems settle down before investing any
time troubleshooting this problem.

We could leave this bug open as a place holder for any
future occurrences. (Or close it, I don't know what the
convention is on this sort of thing.)

Actions

Copy link

Updated by Sage Weil almost 12 years ago

Project changed from Ceph to rbd

Actions

Copy link

Updated by Sage Weil almost 12 years ago

Assignee deleted (~~Alex Elder~~)

Actions

Copy link

Updated by Alex Elder over 11 years ago

We should re-evaluate this with XFS found in newer kernels.
Maybe this should just be closed and re-opened (or open
a new one) if we get a similar report with newer code.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by Alex Elder about 11 years ago

TODO: Try xfstests #270 on recent kernel (current testing
should be fine).

Actions

Copy link

Updated by Alex Elder about 11 years ago

Trying to run 270 right now.

Actions

Copy link

#10

Updated by Sage Weil about 11 years ago

Assignee set to Alex Elder

Actions

Copy link

#11

Updated by Alex Elder about 11 years ago

Test 270 now doesn't run because:
270 [not run] fsgqa user not defined.

There are a few tests that require a distinct user
in the passwd file. I've been meaning to go back
and make these get added to the run_xfstests.sh script
at some point. Now, apparently, in order to run
test 270 this is going to be a requirement.

So that's I guess a prerequisite task for getting
test 270 back again. I created this to cover doing
that:
http://tracker.ceph.com/issues/4605

Actions

Copy link

#12

Updated by Ian Colle over 10 years ago

Assignee deleted (~~Alex Elder~~)

Actions

Copy link

#13

Updated by Sage Weil over 10 years ago

Status changed from 12 to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #2608

rbd: hung xfstest 270

Updated by Sage Weil almost 12 years ago

Updated by Alex Elder almost 12 years ago

Updated by Alex Elder almost 12 years ago

Updated by Sage Weil almost 12 years ago

Updated by Sage Weil almost 12 years ago

Updated by Alex Elder over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Alex Elder about 11 years ago

Updated by Alex Elder about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Alex Elder about 11 years ago

Updated by Ian Colle over 10 years ago

Updated by Sage Weil over 10 years ago