Bug #3795: loadgen task gets into msgr loop - Ceph - Ceph

Actions

Copy link

Bug #3795

closed

loadgen task gets into msgr loop

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

kernel:
  kdb: true
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      osd:
        debug ms: 20
    fs: btrfs
    log-whitelist:
    - slow request
    branch: next
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - rados/load-gen-big.sh

this fills up the teuthology disk each time it runs.

i ran the above once and reproduced, but don't have time right now to investigate.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to 12

This appears to be a simple cycle:

- objecter has lots of requests outstanding
- there is a fault (msgr failure injection)
- we resend everything outstanding, which is enough to essentially guarantee we hit another fault

What I can't figure out is why this started happening now and didn't happen always.

Actions

Copy link

Updated by Sage Weil over 11 years ago

going to see if the recent msgr changes are to blame.. bisecting!

Actions

Copy link

Updated by Tamilarasi muthamizhan over 11 years ago

taking a look again at the nightly runs, looks like this issue has been happening on next branch from 01-01-2013 which also means, this test passed on 12-31-2012.

Actions

Copy link

Updated by Sage Weil over 11 years ago

I looked a bit more and I see some failures before that, and also some passes after, e.g. teuthology-2013-01-11_07:00:03-regression-testing-master-basic/38417.

I can't see that anything major changed with the behavior of the code vs the behavior when it gets stuck. The problem is just that the client has a lot of queued messages, enough so that we are virtually guaranteed to inject a failure when it resends them all after a fault. This isn't something that would ever happen in the "real world". For now I'm just going to adjust the failure injection rate down to a lower level in the suite.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from 12 to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3795

loadgen task gets into msgr loop

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Tamilarasi muthamizhan over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago