Project

General

Profile

Actions

Bug #3525

closed

kclient+iozone hang on ceph-client testing

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

kernel:
  branch: testing
  kdb: true
nuke-on-error: true
overrides:
  ceph:
    branch: next
    fs: btrfs
    log-whitelist:
    - slow request
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/iozone.sh

seems to do it every time.
Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to 12

also, direct io test fails on testing but passes on master. maybe the same bug? it's a shorter test, probably easier to bisect.

kernel:
  kdb: true
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    coverage: true
    fs: btrfs
    log-whitelist:
    - slow request
    branch: master
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- kclient: null
- workunit:
    clients:
      all:
      - direct_io
Actions #2

Updated by Alex Elder over 11 years ago

I bisected the direct I/O issue down to this commit:
    81f97dd7 libceph: pass num_op with ops
I tried testing iozone and did hit a problem, but it's
not clear from the original description ("seems to hang")
what specific symptoms would confirm it's the same thing.

I'm pretty sure I've found the bug, at least for the direct
I/O problem.  I'm testing a fix right now.

That patch changed the loop that encodes provided ops into
a message to look like this:

   while (num_op--)
           osd_req_encode_op(req, op, src_op++);

The problem is that the target pointer (op) was not
getting incremented as they got copied.  So multi-op
requests ended up trashed.

Actions #3

Updated by Alex Elder over 11 years ago

I just finished testing my fix with the iozone test and it
appears to have made the hang I saw go away. I'm now running
the direct I/O test. If it too passes I'll commit my fix as
well as the commits that follow this one back to the testing
branch.

Actions #4

Updated by Alex Elder over 11 years ago

The direct I/O test now passes with my fix. I'm going to do
a final test run of the rebased patches in the testing branch,
then will push the result.

Actions #5

Updated by Alex Elder over 11 years ago

  • Status changed from 12 to Resolved

My testing did not fail for iozone or direct io using
ceph-fuse.

I get an error when using rbd to back the file system that
gets tested. I've seen this before, and have now created
a bug to track that:
http://tracker.newdream.net/issues/3547

Anyway, I've committed my fix along with its rebased
successor commits in the testing branch, so this work
is complete.

Actions

Also available in: Atom PDF