Bug #13876
closed
qa: openstack MPI connection failures
Added by Greg Farnum over 8 years ago.
Updated about 8 years ago.
Description
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2015-11-12_18:00:02-fs-hammer---basic-openstack/14418/
2015-11-12T23:32:58.272 INFO:teuthology.orchestra.run.target088041:Running: 'mpiexec -f /home/ubuntu/cephtest/mpi-hosts /home/ubuntu/cephtest/mdtest-1.9.3/mdtest -d /home/ubuntu/cephtest/gmnt -I 20 -z 5 -b 2 -R'
2015-11-12T23:32:58.604 INFO:teuthology.orchestra.run.target088041.stderr:Warning: Permanently added '158.69.88.43' (ECDSA) to the list of known hosts.
2015-11-12T23:32:58.606 INFO:teuthology.orchestra.run.target088041.stderr:Warning: Permanently added '158.69.88.40' (ECDSA) to the list of known hosts.
2015-11-12T23:35:07.155 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:1@target088043.ovh.sepia.ceph.com] HYDU_sock_connect (./utils/sock/sock.c:174): unable to connect from "target088043.ovh.sepia.ceph.com" to "158.69.88.41" (Connection timed out)
2015-11-12T23:35:07.155 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:1@target088043.ovh.sepia.ceph.com] main (./pm/pmiserv/pmip.c:189): unable to connect to server 158.69.88.41 at port 54948 (check for firewalls!)
2015-11-12T23:35:07.207 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:2@target088040.ovh.sepia.ceph.com] HYDU_sock_connect (./utils/sock/sock.c:174): unable to connect from "target088040.ovh.sepia.ceph.com" to "158.69.88.41" (Connection timed out)
2015-11-12T23:35:07.207 INFO:teuthology.orchestra.run.target088041.stderr:[proxy:0:2@target088040.ovh.sepia.ceph.com] main (./pm/pmiserv/pmip.c:189): unable to connect to server 158.69.88.41 at port 54948 (check for firewalls!)
- Subject changed from qa: openstack mdtest connection failures to qa: openstack MPI connection failures
Those runs are from a couple of weeks ago, are you sure they aren't from before Loic updated the firewall rules to be more permissive? Can't remember the date that happened. The rules used to open everything up to 10000 iirc, and were amended to go all the way to 2**16
No, no I'm not sure. I didn't see any more recent runs that succeeded in the pulpito comparison page, but I might have missed one.
Here's another more recent one:
teuthology-2015-11-26_18:00:01-fs-hammer---basic-openstack/ ['20901', '20933', '20917', '20885']
I wonder if the fix was just on master and needs backporting
Hmm, so the update to firewall rules was in teuthology, and is this:
commit a6e705bc27090c14bcb90c6129970bbd77137977
Author: Loic Dachary <ldachary@redhat.com>
Date: Mon Oct 19 23:56:37 2015 +0200
openstack: open ports 1:65356 for all targets
Signed-off-by: Loic Dachary <ldachary@redhat.com>
diff --git a/teuthology/openstack/__init__.py b/teuthology/openstack/__init__.py
index 8a575cc..d755616 100644
--- a/teuthology/openstack/__init__.py
+++ b/teuthology/openstack/__init__.py
@@ -485,7 +485,7 @@ ssh access : ssh {identity}{username}@{ip} # logs in /usr/share/nginx/html
# for the rest.
misc.sh("""
openstack security group create teuthology
-openstack security group rule create --dst-port 1:10000 teuthology
+openstack security group rule create --dst-port 1:65535 teuthology
openstack security group rule create --proto udp --dst-port 53 teuthology # dns
""")
and it is indeed the master branch in use for these tests, so something else must be going on here...
- Status changed from New to 12
- Priority changed from Normal to High
- Status changed from 12 to Resolved
- Assignee set to Loïc Dachary
The firewall on the OVH lab was configured manually. The code that is quoted is only used when dynamically provisioning a teuthology cluster with teuthology-openstack. I modified the teuthology security group to change the range from 1:10000 to 1:65355.
Also available in: Atom
PDF