Project

General

Profile

Actions

Bug #42705

open

Messenger/MessengerTest.ConnectionRaceTest/0 hangs

Added by Sage Weil over 4 years ago. Updated over 3 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-11-08T00:17:47.474 INFO:teuthology.orchestra.run.smithi008.stdout:[==========] Running 22 tests from 1 test suite.
2019-11-08T00:17:47.474 INFO:teuthology.orchestra.run.smithi008.stdout:[----------] Global test environment set-up.
2019-11-08T00:17:47.474 INFO:teuthology.orchestra.run.smithi008.stdout:[----------] 22 tests from Messenger/MessengerTest
2019-11-08T00:17:47.474 INFO:teuthology.orchestra.run.smithi008.stdout:[ RUN      ] Messenger/MessengerTest.ConnectionRaceTest/0
2019-11-08T00:17:47.476 INFO:teuthology.orchestra.run.smithi008.stderr:2019-11-08T00:17:47.473+0000 7f8620361700 -1  ceph_test_msgr intercept conn(0x564150e73800) intercept called on step=1
2019-11-08T00:17:47.476 INFO:teuthology.orchestra.run.smithi008.stderr:2019-11-08T00:17:47.473+0000 7f861f35f700 -1  ceph_test_msgr intercept conn(0x564150e73400) intercept called on step=1
2019-11-08T00:17:47.476 INFO:teuthology.orchestra.run.smithi008.stderr:2019-11-08T00:17:47.473+0000 7f8620361700 -1  ceph_test_msgr intercept conn(0x564150e73800) resuming step=1 with decision=0
...
2019-11-08T11:56:46.612 INFO:teuthology.orchestra.run.smithi008.stderr:2019-11-08T11:56:46.608+0000 7f861f35f700 -1  ceph_test_msgr intercept conn(0x564150e73400) resuming step=1 with decision=0
2019-11-08T11:56:46.612 INFO:teuthology.orchestra.run.smithi008.stderr:2019-11-08T11:56:46.608+0000 7f861f35f700 -1  ceph_test_msgr intercept conn(0x564150e73400) intercept called on step=3
2019-11-08T11:56:46.613 INFO:teuthology.orchestra.run.smithi008.stderr:2019-11-08T11:56:46.608+0000 7f861f35f700 -1  ceph_test_msgr intercept conn(0x564150e73400) resuming step=3 with decision=0

/a/sage-2019-11-07_22:38:52-rados-wip-sage-testing-2019-11-07-1412-distro-basic-smithi/4480741
Actions #1

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #2

Updated by Matthew Oliver over 3 years ago

I have a script that I'm been using to try and recreate this issue. I ran just the COnnectionRaceTest 2500 times without an issue or pause. I think changed the script to run the whole set of ceph_test_msgr tests over and over again until it fails, with the thinking that maybe one of the other tests doesn't always clean up after itself.. well left it going to day. Currently at 585 times and still no failure.

Either my script isn't stopping (but I've used it before, it'll stop on a return code > 0).

So maybe the error is more subtle? Maybe happens in teuthology envs?

Actions #3

Updated by Matthew Oliver over 3 years ago

  • left it going for a few days.
Actions #4

Updated by Nathan Cutler over 3 years ago

Sage's teuthology run took place in November 2019. The problem might have been fixed since then.

Actions #5

Updated by Nathan Cutler over 3 years ago

Another possibility is that the bug only ever existed in Sage's wip branch.

Be that as it may, I looked up the run in question:

https://pulpito.ceph.com/sage-2019-11-07_22:38:52-rados-wip-sage-testing-2019-11-07-1412-distro-basic-smithi/4480741/

Now, there's nothing particularly special about it. It deploys ceph on a single-node cluster with the following roles:

[u'mon.a', u'mgr.x', u'osd.0', u'osd.1', u'client.0']

Once the cluster gets to HEALTH_OK, it runs the following two binaries from the ceph-test RPM:

ceph_test_async_driver
ceph_test_msgr

The logs are gone. Most likely they were lost in the recent mass-deletion event on the LRC.

Actions #6

Updated by Matthew Oliver over 3 years ago

Cool, thanks Nathan. Will create a new script that will use sesdev (though Sage probably uses centos/redhat) to create a single node deploy and run test binaries. Let's see what happens :)

Actions #7

Updated by Matthew Oliver over 3 years ago

Nathan Cutler wrote:
...

[...]

Although, I thought the `ceph_test_msgr` just ran the msgr unit tests in general, not actually tested against the built cluster. But it's been a few weeks since I looked at this, and a lots happened in that time. So will just revisit it and see :)

As I'm pretty sure `ceph_test_msgr` is the binary I've been running without a cluster (just compiled).

Actions #8

Updated by Nathan Cutler over 3 years ago

Although, I thought the `ceph_test_msgr` just ran the msgr unit tests in general, not actually tested against the built cluster. But it's been a few weeks since I looked at this, and a lots happened in that time. So will just revisit it and see :)

As I'm pretty sure `ceph_test_msgr` is the binary I've been running without a cluster (just compiled).

The developers try pretty hard to keep the unit tests (in master) in a state where they are always passing. (PRs are not supposed to be merged unless they pass the "make check" Jenkins test, for example.) So... if it's failing (in master) I doubt it's a unit test.

To investigate that possibility further I ran the following commands in top-level directory of the master source tree:

find src/ -name '*CMakeLists*' -exec grep -3 -H add_ceph_test {} \; | grep ceph_test_msgr
find src/ -name '*.sh' -exec grep ceph_test_msgr {} \;

The first command looks for a unit test that runs the ceph_test_msgr executable directly.

Since many unit tests are implemented as shell scripts, the second command looks for shell scripts that trigger ceph_test_msgr.

Neither of these commands returned any output, so at this point I cannot say for sure that ceph_test_msgr is used in any unit test.

It's also not clear how useful it would be to test the messenger in a unit test which, by definition, only ever runs on a single node?

Actions

Also available in: Atom PDF