Bug #3525
closedkclient+iozone hang on ceph-client testing
0%
Description
kernel: branch: testing kdb: true nuke-on-error: true overrides: ceph: branch: next fs: btrfs log-whitelist: - slow request roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 tasks: - chef: null - clock: null - ceph: null - kclient: null - workunit: clients: all: - suites/iozone.sh
seems to do it every time.
Updated by Sage Weil over 11 years ago
- Status changed from New to 12
also, direct io test fails on testing but passes on master. maybe the same bug? it's a shorter test, probably easier to bisect.
kernel: kdb: true branch: testing nuke-on-error: true overrides: ceph: coverage: true fs: btrfs log-whitelist: - slow request branch: master roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 tasks: - chef: null - clock: null - ceph: null - kclient: null - workunit: clients: all: - direct_io
Updated by Alex Elder over 11 years ago
I bisected the direct I/O issue down to this commit: 81f97dd7 libceph: pass num_op with ops I tried testing iozone and did hit a problem, but it's not clear from the original description ("seems to hang") what specific symptoms would confirm it's the same thing. I'm pretty sure I've found the bug, at least for the direct I/O problem. I'm testing a fix right now. That patch changed the loop that encodes provided ops into a message to look like this: while (num_op--) osd_req_encode_op(req, op, src_op++); The problem is that the target pointer (op) was not getting incremented as they got copied. So multi-op requests ended up trashed.
Updated by Alex Elder over 11 years ago
I just finished testing my fix with the iozone test and it
appears to have made the hang I saw go away. I'm now running
the direct I/O test. If it too passes I'll commit my fix as
well as the commits that follow this one back to the testing
branch.
Updated by Alex Elder over 11 years ago
The direct I/O test now passes with my fix. I'm going to do
a final test run of the rebased patches in the testing branch,
then will push the result.
Updated by Alex Elder over 11 years ago
- Status changed from 12 to Resolved
My testing did not fail for iozone or direct io using
ceph-fuse.
I get an error when using rbd to back the file system that
gets tested. I've seen this before, and have now created
a bug to track that:
http://tracker.newdream.net/issues/3547
Anyway, I've committed my fix along with its rebased
successor commits in the testing branch, so this work
is complete.