Bug #3795
closedloadgen task gets into msgr loop
0%
Description
kernel: kdb: true branch: testing nuke-on-error: true overrides: ceph: conf: global: ms inject socket failures: 500 osd: debug ms: 20 fs: btrfs log-whitelist: - slow request branch: next roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 tasks: - chef: null - clock: null - ceph: null - ceph-fuse: null - workunit: clients: all: - rados/load-gen-big.sh
this fills up the teuthology disk each time it runs.
i ran the above once and reproduced, but don't have time right now to investigate.
Updated by Sage Weil over 11 years ago
- Status changed from New to 12
This appears to be a simple cycle:
- objecter has lots of requests outstanding
- there is a fault (msgr failure injection)
- we resend everything outstanding, which is enough to essentially guarantee we hit another fault
What I can't figure out is why this started happening now and didn't happen always.
Updated by Sage Weil over 11 years ago
going to see if the recent msgr changes are to blame.. bisecting!
Updated by Tamilarasi muthamizhan over 11 years ago
taking a look again at the nightly runs, looks like this issue has been happening on next branch from 01-01-2013 which also means, this test passed on 12-31-2012.
Updated by Sage Weil over 11 years ago
I looked a bit more and I see some failures before that, and also some passes after, e.g. teuthology-2013-01-11_07:00:03-regression-testing-master-basic/38417.
I can't see that anything major changed with the behavior of the code vs the behavior when it gets stuck. The problem is just that the client has a lot of queued messages, enough so that we are virtually guaranteed to inject a failure when it resends them all after a fault. This isn't something that would ever happen in the "real world". For now I'm just going to adjust the failure injection rate down to a lower level in the suite.