Project

General

Profile

Actions

Bug #13876

closed

qa: openstack MPI connection failures

Added by Greg Farnum over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Testing
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2015-11-12_18:00:02-fs-hammer---basic-openstack/14418/

2015-11-12T23:32:58.272 INFO:teuthology.orchestra.run.target088041:Running: 'mpiexec -f /home/ubuntu/cephtest/mpi-hosts /home/ubuntu/cephtest/mdtest-1.9.3/mdtest -d /home/ubuntu/cephtest/gmnt -I 20 -z 5 -b 2 -R'
2015-11-12T23:32:58.604 INFO:teuthology.orchestra.run.target088041.stderr:Warning: Permanently added '158.69.88.43' (ECDSA) to the list of known hosts.
2015-11-12T23:32:58.606 INFO:teuthology.orchestra.run.target088041.stderr:Warning: Permanently added '158.69.88.40' (ECDSA) to the list of known hosts.
2015-11-12T23:35:07.155 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:1@target088043.ovh.sepia.ceph.com] HYDU_sock_connect (./utils/sock/sock.c:174): unable to connect from "target088043.ovh.sepia.ceph.com" to "158.69.88.41" (Connection timed out)
2015-11-12T23:35:07.155 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:1@target088043.ovh.sepia.ceph.com] main (./pm/pmiserv/pmip.c:189): unable to connect to server 158.69.88.41 at port 54948 (check for firewalls!)
2015-11-12T23:35:07.207 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:2@target088040.ovh.sepia.ceph.com] HYDU_sock_connect (./utils/sock/sock.c:174): unable to connect from "target088040.ovh.sepia.ceph.com" to "158.69.88.41" (Connection timed out)
2015-11-12T23:35:07.207 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:2@target088040.ovh.sepia.ceph.com] main (./pm/pmiserv/pmip.c:189): unable to connect to server 158.69.88.41 at port 54948 (check for firewalls!)
Actions #1

Updated by Greg Farnum over 8 years ago

  • Subject changed from qa: openstack mdtest connection failures to qa: openstack MPI connection failures
Actions #2

Updated by John Spray over 8 years ago

Those runs are from a couple of weeks ago, are you sure they aren't from before Loic updated the firewall rules to be more permissive? Can't remember the date that happened. The rules used to open everything up to 10000 iirc, and were amended to go all the way to 2**16

Actions #3

Updated by Greg Farnum over 8 years ago

No, no I'm not sure. I didn't see any more recent runs that succeeded in the pulpito comparison page, but I might have missed one.

Actions #4

Updated by John Spray over 8 years ago

Here's another more recent one:
teuthology-2015-11-26_18:00:01-fs-hammer---basic-openstack/ ['20901', '20933', '20917', '20885']

I wonder if the fix was just on master and needs backporting

Actions #5

Updated by John Spray over 8 years ago

Hmm, so the update to firewall rules was in teuthology, and is this:

commit a6e705bc27090c14bcb90c6129970bbd77137977
Author: Loic Dachary <ldachary@redhat.com>
Date:   Mon Oct 19 23:56:37 2015 +0200

    openstack: open ports 1:65356 for all targets

    Signed-off-by: Loic Dachary <ldachary@redhat.com>

diff --git a/teuthology/openstack/__init__.py b/teuthology/openstack/__init__.py
index 8a575cc..d755616 100644
--- a/teuthology/openstack/__init__.py
+++ b/teuthology/openstack/__init__.py
@@ -485,7 +485,7 @@ ssh access   : ssh {identity}{username}@{ip} # logs in /usr/share/nginx/html
         # for the rest.
         misc.sh(""" 
 openstack security group create teuthology
-openstack security group rule create --dst-port 1:10000 teuthology
+openstack security group rule create --dst-port 1:65535 teuthology
 openstack security group rule create --proto udp --dst-port 53 teuthology # dns
         """)

and it is indeed the master branch in use for these tests, so something else must be going on here...

Actions #6

Updated by Greg Farnum over 8 years ago

  • Status changed from New to 12
  • Priority changed from Normal to High
Actions #7

Updated by Loïc Dachary about 8 years ago

  • Status changed from 12 to Resolved
  • Assignee set to Loïc Dachary

The firewall on the OVH lab was configured manually. The code that is quoted is only used when dynamically provisioning a teuthology cluster with teuthology-openstack. I modified the teuthology security group to change the range from 1:10000 to 1:65355.

Actions

Also available in: Atom PDF