Bug #3795
closed
loadgen task gets into msgr loop
Added by Sage Weil over 11 years ago.
Updated over 11 years ago.
Description
kernel:
kdb: true
branch: testing
nuke-on-error: true
overrides:
ceph:
conf:
global:
ms inject socket failures: 500
osd:
debug ms: 20
fs: btrfs
log-whitelist:
- slow request
branch: next
roles:
- - mon.a
- mon.c
- osd.0
- osd.1
- osd.2
- - mon.b
- mds.a
- osd.3
- osd.4
- osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- ceph-fuse: null
- workunit:
clients:
all:
- rados/load-gen-big.sh
this fills up the teuthology disk each time it runs.
i ran the above once and reproduced, but don't have time right now to investigate.
- Status changed from New to 12
This appears to be a simple cycle:
- objecter has lots of requests outstanding
- there is a fault (msgr failure injection)
- we resend everything outstanding, which is enough to essentially guarantee we hit another fault
What I can't figure out is why this started happening now and didn't happen always.
going to see if the recent msgr changes are to blame.. bisecting!
taking a look again at the nightly runs, looks like this issue has been happening on next branch from 01-01-2013 which also means, this test passed on 12-31-2012.
I looked a bit more and I see some failures before that, and also some passes after, e.g. teuthology-2013-01-11_07:00:03-regression-testing-master-basic/38417.
I can't see that anything major changed with the behavior of the code vs the behavior when it gets stuck. The problem is just that the client has a lot of queued messages, enough so that we are virtually guaranteed to inject a failure when it resends them all after a fault. This isn't something that would ever happen in the "real world". For now I'm just going to adjust the failure injection rate down to a lower level in the suite.
- Status changed from 12 to Resolved
Also available in: Atom
PDF