Project

General

Profile

Actions

Bug #3795

closed

loadgen task gets into msgr loop

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

kernel:
  kdb: true
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      osd:
        debug ms: 20
    fs: btrfs
    log-whitelist:
    - slow request
    branch: next
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - rados/load-gen-big.sh

this fills up the teuthology disk each time it runs.

i ran the above once and reproduced, but don't have time right now to investigate.

Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to 12

This appears to be a simple cycle:

- objecter has lots of requests outstanding
- there is a fault (msgr failure injection)
- we resend everything outstanding, which is enough to essentially guarantee we hit another fault

What I can't figure out is why this started happening now and didn't happen always.

Actions #2

Updated by Sage Weil over 11 years ago

going to see if the recent msgr changes are to blame.. bisecting!

Actions #3

Updated by Tamilarasi muthamizhan over 11 years ago

taking a look again at the nightly runs, looks like this issue has been happening on next branch from 01-01-2013 which also means, this test passed on 12-31-2012.

Actions #4

Updated by Sage Weil over 11 years ago

I looked a bit more and I see some failures before that, and also some passes after, e.g. teuthology-2013-01-11_07:00:03-regression-testing-master-basic/38417.

I can't see that anything major changed with the behavior of the code vs the behavior when it gets stuck. The problem is just that the client has a lot of queued messages, enough so that we are virtually guaranteed to inject a failure when it resends them all after a fault. This isn't something that would ever happen in the "real world". For now I'm just going to adjust the failure injection rate down to a lower level in the suite.

Actions #5

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to Resolved
Actions

Also available in: Atom PDF