Project

General

Profile

Actions

Feature #6502

closed

provision targets using OpenStack

Added by Loïc Dachary over 10 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:

Description

Instead of relying exclusively on pre-provisionned targets, teuthology could use the OpenStack API to allocate targets dynamically.

See https://github.com/ceph/teuthology/pull/592 and https://github.com/dachary/teuthology/tree/wip-6502-openstack-v3/


Files

test-result.txt (128 KB) test-result.txt successfull integration test output Loïc Dachary, 06/20/2015 07:44 PM
tox-results.txt (121 KB) tox-results.txt results of tox -e openstack-integration Loïc Dachary, 06/24/2015 12:12 AM
upgrade-output.txt.gz (448 KB) upgrade-output.txt.gz successfull upgrade job output Loïc Dachary, 06/24/2015 04:14 PM

Related issues 3 (1 open2 closed)

Blocked by teuthology - Bug #12250: ansible /etc/fstab regexp fails on OpenStack instancesResolvedAndrew Schoen07/08/2015

Actions
Blocked by teuthology - Feature #12256: ansible should fallback to sensible defaultsResolvedAndrew Schoen07/09/2015

Actions
Blocks teuthology - Feature #12295: teuthology integration testsNew

Actions
Actions #1

Updated by Loïc Dachary over 10 years ago

  • Subject changed from provision targets using EC2 to provision targets using a cloud API
  • Description updated (diff)
Actions #2

Updated by Loïc Dachary over 10 years ago

Maybe use OpenStack heat ?

Actions #3

Updated by Loïc Dachary almost 9 years ago

  • Status changed from New to In Progress
  • Assignee set to Loïc Dachary
  • Priority changed from Low to Urgent
Actions #4

Updated by Loïc Dachary almost 9 years ago

Actions #5

Updated by Loïc Dachary almost 9 years ago

<loicd> zackc: I'd like to understand how teuthology waits for a vps to be ready to answer ssh requests. Do you know where I should look to understand that ? 
<zackc> loicd: i'm actually typing out a reply to your semi-related email right now! let me see if i can find that codepath real quick
<loicd> ah cool
<zackc> loicd: it works by using ssh-keyscan
<zackc> you'll see calls to lock.do_update_keys() and lock.ssh_keyscan() in lock.py and task/internal.py
-*- loicd exploring
<loicd> zackc: thanks for the pointer
<zackc> loicd: np! also, a point i'm going to mention in the response is using provision.Downburst as a starting point and maybe implementing a similar class for OpenStack
<zackc> it occurs to me that maybe that ssh-keyscan magic should be rolled into a Downburst.is_ready() function - or something with a more thought-out name
<zackc> i'd like to move toward using some sort of more-general API for things like this so that it's less confusing when we add things like OpenStack and bare-metal provisioning using Cobbler (i have a wip branch for the latter)
<loicd> zackc: I'm not sure how to decide to Downburst or to OpenStack. A hack would be to decide based on the machine name and have openstack machine names be openstackXXX or something. But I feel you will frown upon that and I'd like something cleaner.
<zackc> loicd: i was actually going to suggest 'openstack' as a machine type
<zackc> i'm not sure how 'vps'/'vpm' was chosen for downburst
<zackc> we could also use a random machine_type and add a field to Paddles; e.g. provision_backend
<zackc> (normally i'd think harder about the names i'm suggesting)
<loicd> openstack as a machine_type works for me
<sage> loicd: zackc: a machine_type seems a bit off because that means the instances are precreated, though, right?  it seems like ideally we want to tell teuthology to create/destroy nova instances on the fly
<zackc> maybe i misunderstand the plan. if it's not to create/destroy, what is it?
<loicd> sage: but the instances names are pre-created in paddles, even for VPS, right ? 
<zackc> they are
<loicd> not the instance themselves, but the slots are reserved in paddles
<sage> we did the precreation for vps to minimize the effort invovled.  not sure it's the cleanest approach
<loicd> sage: I'm going create the VM when it's locked, not when the name is reserved
<loicd> i.e. https://github.com/ceph/teuthology/blob/master/teuthology/provision.py#L54 will create the openstack vm 
<loicd> and
<loicd> https://github.com/ceph/teuthology/blob/master/teuthology/provision.py#L111
<sage> yeah, that makes the current locking effectively throttle the number of instances we can use, an dprobably minimizes teh changes
<loicd> will destroy it
<sage> anyway i'll leave it to you folks, no strong opinions
<zackc> it also seems like something we might be able to implement in phases
<zackc> i.e. follow the current pattern for now and potentially do what sage is suggesting later
<sage> yeah.  perfect is the enemy of good :)
<loicd> zackc: limiting the number of instances via pre-allocated names in paddles is actually confortable, I think :-D
<zackc> yeah, to do this other work we'd have to come up with some other way to answer "how many instances do we potentially have" 
<loicd> zackc: about the VM installation, I'm aiming at creating a config drive (  http://cloudinit.readthedocs.org/en/latest/topics/datasources.html#config-drive ) to install the necessary packages since it's all we seem to need. And I know very little about chef. And it's being replaced by ansible anyway. Do you think that may be a problem ? 
<loicd> the good thing being that the config drive can convenientely be changed at a later time to delegate to ansible instead of doing things itself
<zackc> i'm not clear on exactly what you aim to do with the config drive
<zackc> the only thing blocking the chef->ansible migration atm is RHEL entitlements in sepia (and additional testing)
<zackc> outside of that it should actually be feasible to use teuthology.task.ansible.cephlab
<zackc> we will of course need the proper key in ~ubuntu/.ssh/authorized_keys for that (or anything) to work
<loicd> the config drive will apt-get install packages after the machine is created. 
<zackc> it might be work that you just don't have to do though
<loicd> actually it sill install packages in an operating system agnostic way
<loicd> zackc: I should use ansible instead right away ? 
<loicd> openstack installs the ssh key, it's automagic and we have nothing to do in that regard
<loicd> we can assume ssh access once the machine is booted
<loicd> AND the ssh server is up
<zackc> if you are ok with not being able to use RHEL for now, that might be the best thing
<loicd> I don't think rhel will be an issue right now
<zackc> task.ansible uses the "new-style" task implementation so you can even use its tasks as context managers
<loicd> zackc: the issue is that I have no clue how to tell ansible : do the right thing on this target
<loicd> zackc: does that mean task.ansible needs to be activated somewhere ? 
<zackc> the CephLab task is designed to do exactly that
<zackc> https://github.com/ceph/teuthology/commit/59b7c0aeb4ef69132cfc6ee3251ba819a830210a#diff-2313ccf8d658fc1076ca2c2e5c66f855R743
<zackc> ^ a commit in wip-drop-chef that is not merged into master
<zackc> but CephLab itself is in master
<loicd> zackc: should I wait for it to be commited or is it going to take time ? 
<zackc> hmm. i'm unsure. i need to talk to some RH people about logistics of getting RHEL entitlements in sepia before it can be merged
<zackc> there are only two commits in that branch
<zackc> just thinking aloud, i wonder if it might be reasonable to just have you cherry-pick those commits into your branch for now, and remove/rebase once the transition is done
<zackc> i certainly won't demand that you go that route though :)
<zackc> on the other hand if it's not a ton of work to go your original route, that might be just fine as well
<zackc> but at some point we'd of course want to get it to use ansible
<loicd> zackc: I'll give it a try
<loicd> https://github.com/ceph/ceph-cm-ansible/blob/master/tools/vmlist.py#L128 what is this ? 
<loicd> zackc: ^
<zackc> that's a handy tool dmick wrote, i don't actually know too much about it
<loicd> ok
<loicd> I was kind of hopeful to find some ansible embeded knowledge around it :-D
<zackc> ah i think those are unrelated
<dmick> it's just in ceph-cm-ansible because "a lab machine tool" 
Actions #6

Updated by Loïc Dachary almost 9 years ago

Actions #7

Updated by Loïc Dachary almost 9 years ago

  • setup dnsmaq for a local DNS to name all machines in advance, using IPs from the desired subnet
    $ neutron subnet-list                                                               
    +--------------------------------------+---------------+-------------+--------------------------------------------+                                                                                                 
    | id                                   | name          | cidr        | allocation_pools                           |                                                                                                 
    +--------------------------------------+---------------+-------------+--------------------------------------------+
    | 145dd132-3016-4607-a0ad-a58d569e53fa | fsf-lan       | 10.0.3.0/24 | {"start": "10.0.3.2", "end": "10.0.3.254"} |
    
    
  • when creating the VM the IP can be set explicitly
        server = client_manager.compute.servers.create('test', image, flavor, key_name=keypair.name, nics=[{'net-id': network.id, 'v4-fixed-ip': '10.0.3.51'}], userdata=io.open('user-data.txt'))
    
  • this is however assuming no other tenant on the same subnet will grab this ip. Instead the IP should be reserved permanently using
        n = client_manager.network
        net = n.list_networks(name='fsf-lan')
        print str(net)
        network_id = net['networks'][0]['id']
        body_value = {
                         "port": {
                                 "admin_state_up": True,
    #                             "device_id": server_id,                                                                                                                                                               
                                 "name": "port1",
                                 "network_id": network_id
                          }
                     }
        p = n.create_port(body_value)
        print str(p)
    
Actions #8

Updated by Loïc Dachary almost 9 years ago

setting the resolver of the virtual machine to a designated name server where the names have been predefined should be done using the manage_resolv_conf cloud-init plugin but this does not work for some reason and a workaround such as

#cloud-config
bootcmd:
 - echo "nameserver 10.0.3.31" | sudo tee -a /etc/resolvconf/resolv.conf.d/head
 - sudo resolvconf -u

can be used instead

Actions #9

Updated by Loïc Dachary almost 9 years ago

  • Subject changed from provision targets using a cloud API to provision targets using a OpenStack
  • Description updated (diff)
Actions #10

Updated by Loïc Dachary almost 9 years ago

Figured out that creating a network dedicated to teuthology would make things simpler:

  • no other tenant can use the IP of the subnet and we do not need http://dachary.org/?p=3702
  • we can choose a subnet that is amicable to an existing teuthology cluster and interconnect them via a VPN
Actions #11

Updated by Loïc Dachary almost 9 years ago

  • Status changed from In Progress to Fix Under Review
Actions #12

Updated by Loïc Dachary almost 9 years ago

  • File test-result.txt added

Attaching the output of the integration tests, to compare when trying to run against another OpenStack cluster.

Actions #13

Updated by Loïc Dachary almost 9 years ago

wget http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud.qcow2
glance add name="centos-7" disk_format=qcow2 container_format=bare < CentOS-7-x86_64-GenericCloud.qcow2
glance image-update --is-public True centos-7
Actions #14

Updated by Loïc Dachary almost 9 years ago

  • File deleted (test-result.txt)
Actions #15

Updated by Loïc Dachary almost 9 years ago

  • File test-result.txt added

update test results with one that passes ubuntu-14.04 and centos-7

Actions #16

Updated by Loïc Dachary almost 9 years ago

successfully scheduled & run an empty job with --machine-type openstack

Actions #17

Updated by Loïc Dachary almost 9 years ago

  • Subject changed from provision targets using a OpenStack to provision targets using OpenStack
Actions #18

Updated by Loïc Dachary almost 9 years ago

  • Status changed from Fix Under Review to In Progress

will propose for review at a later time, when all preliminary pull requests have been merged.

Actions #21

Updated by Loïc Dachary almost 9 years ago

fixes so that teuthology-suite works

Actions #23

Updated by Loïc Dachary almost 9 years ago

  • File deleted (test-result.txt)
Actions #25

Updated by Loïc Dachary almost 9 years ago

  • Description updated (diff)
Actions #26

Updated by Loïc Dachary almost 9 years ago

Trying

./virtualenv/bin/teuthology-suite --filter='upgrade:firefly/newer/{4-finish-upgrade.yaml 0-cluster/start.yaml 1-install/v0.80.8.yaml 2-workload/{blogbench.yaml rbd.yaml s3tests.yaml testrados.yaml} 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 5-final/{monthrash.yaml osdthrash.yaml rbd.yaml testrgw.yaml} distros/ubuntu_14.04.yaml}' --suite upgrade/firefly --suite-branch firefly --machine-type openstack --ceph firefly

ran into utf-8 decoding bug

Try on the sepia lab with

./virtualenv/bin/teuthology-suite --priority 101 --filter='upgrade:firefly/newer/{4-finish-upgrade.yaml 0-cluster/start.yaml 1-install/v0.80.8.yaml 2-workload/{blogbench.yaml rbd.yaml s3tests.yaml testrados.yaml} 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 5-final/{monthrash.yaml osdthrash.yaml rbd.yaml testrgw.yaml} distros/ubuntu_14.04.yaml}' --suite upgrade/firefly --suite-branch firefly --machine-type vps --ceph firefly

  • failed because of environmental issues sepia
  • failed because of environmental issues sepia

Give up on this attempt because the environmental issue is persistent

Actions #27

Updated by Loïc Dachary almost 9 years ago

explored http://www.openstack.org/marketplace/public-clouds/ and found http://entercloudsuite.com/ that provides neutron API and can create networks. Verified with Nathan that his cloud also does. Verified that OS1 does not provide the feature.

Actions #28

Updated by Loïc Dachary almost 9 years ago

Trying

./virtualenv/bin/teuthology-suite --priority 101 --filter 'upgrade:firefly-x/stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/ec-rados-default.yaml 6-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-rados-plugin=jerasure-k=3-m=1.yaml distros/ubuntu_14.04.yaml}' -c hammer -k distro -m vps -s upgrade/firefly-x

  • failed because of environmental issues on sepia

It looks like this is a problem with the ubuntu mirror, all ubuntu upgrade tests are going to fail.

Actions #29

Updated by Loïc Dachary almost 9 years ago

./virtualenv/bin/teuthology-suite --priority 101 --filter 'upgrade:firefly-x/stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/ec-rados-default.yaml 6-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-rados-plugin=jerasure-k=3-m=1.yaml distros/centos_6.5.yaml}' -c hammer -k distro -m vps -s upgrade/firefly-x
./virtualenv/bin/teuthology-suite --filter 'upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/{ec-rados-parallel.yaml rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml} 3-upgrade-sequence/upgrade-all.yaml 4-final-workload/{rados-snaps-few-objects.yaml rados_loadgenmix.yaml rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_swift.yaml} distros/debian_7.0.yaml}' -c hammer -k distro -m vps -s upgrade/firefly-x
  • failed on sepia (not sure why, it shows lots of errors)
Actions #30

Updated by Loïc Dachary almost 9 years ago

Trying to re-run the successful test from yesterday: http://pulpito.ceph.com/teuthology-2015-06-22_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/944390/

./virtualenv/bin/teuthology-suite --filter 'upgrade:firefly-x/stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/ec-rados-default.yaml 6-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-rados-plugin=jerasure-k=3-m=1.yaml distros/debian_7.0.yaml}' -c hammer -k distro -m vps -s upgrade/firefly-x
Actions #31

Updated by Loïc Dachary almost 9 years ago

Integration tests run successfully on http://entercloudsuite.com/, see the output at tox-results.txt

Actions #34

Updated by Loïc Dachary almost 9 years ago

The ssh server must be configured with more than 10 simultaneous sessions

  File "/home/ubuntu/teuthology/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 363, in exec_command
    chan = self._transport.open_session()
  File "/home/ubuntu/teuthology/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 658, in open_session
    return self.open_channel('session')
  File "/home/ubuntu/teuthology/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 755, in open_channel
    raise e
ChannelException: Administratively prohibited

The chef run (not run for openstack) takes care of that https://github.com/ceph/ceph-qa-chef/blob/master/cookbooks/ceph-qa/recipes/ubuntu.rb#L800

Actions #35

Updated by Loïc Dachary almost 9 years ago

./virtualenv/bin/teuthology-suite --priority 101 --filter 'upgrade:firefly-x/stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/ec-rados-default.yaml 6-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-rados-plugin=jerasure-k=3-m=1.yaml distros/ubuntu_14.04.yaml}' -c hammer -k distro -m openstack -s upgrade/firefly-x
Actions #37

Updated by Loïc Dachary almost 9 years ago

teuthology-suite --priority 101 --filter ubuntu_14.04 -c hammer -k distro -m openstack -s upgrade/firefly-x
Actions #38

Updated by Loïc Dachary almost 9 years ago

Running upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/{ec-rados-parallel.yaml rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml} 3-upgrade-sequence/upgrade-all.yaml 4-final-workload/{rados-snaps-few-objects.yaml rados_loadgenmix.yaml rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_swift.yaml} distros/ubuntu_14.04.yaml}

2015-06-24 06:58:04,835.835 INFO:tasks.workunit.client.4.entercloudsuite012.stderr:/home/ubuntu/cephtest/workunit.client.4/rbd/test_librbd_python.sh: 7: /home/ubuntu/cephtest/workunit\
.client.4/rbd/test_librbd_python.sh: nosetests: not found
2015-06-24 06:58:04,835.835 INFO:tasks.workunit:Stopping ['rbd/test_librbd_python.sh'] on client.4...

looks like nodes are expected to have nosetests pre-installed ?
Actions #39

Updated by Loïc Dachary almost 9 years ago

  • Description updated (diff)
Actions #40

Updated by Loïc Dachary almost 9 years ago

Reworked the implementation entirely to not require the neutron API because most OpenStack cluster do not have it. Integration tests ran successfully on OVH and the internal Red Hat cluster.

Actions #41

Updated by Loïc Dachary almost 9 years ago

  • Description updated (diff)
Actions #42

Updated by Loïc Dachary almost 9 years ago

handle lab_domain instead of ignoring it

Actions #43

Updated by Loïc Dachary almost 9 years ago

cloud-init regular expression failed to capture the case when the instance ID is automatically append to the hostname (--num 3 for instance). As a result the name is longer than 80 chars and triggers bug http://tracker.ceph.com/issues/12205. Fix the regular expression.

Actions #44

Updated by Loïc Dachary almost 9 years ago

Do not deactivate chef: there is a lot more to it that installing a few additional packages and it breaks in various mysterious ways.

Actions #45

Updated by Loïc Dachary almost 9 years ago

An active instance may not have an IP assigned to it, i.e. openstack server create --wait does not wait for an IP to be assigned to the instance. Retry a few times, until the IP is assigned instead of assuming it always is available.

Actions #46

Updated by Loïc Dachary almost 9 years ago

teuthology-suite --filter='upgrade:firefly/singleton/versions-steps/{versions-steps.yaml distros/ubuntu_14.04.yaml}' --suite upgrade/firefly --suite-branch wip-chef-firefly --machine-type openstack --ceph firefly

passed
Actions #47

Updated by Loïc Dachary almost 9 years ago

Running

teuthology-suite --filter=ubuntu_14.04 --suite upgrade/firefly --suite-branch wip-chef-firefly --machine-type openstack --ceph firefly

Actions #48

Updated by Loïc Dachary almost 9 years ago

Actions #49

Updated by Loïc Dachary almost 9 years ago

  • Status changed from Fix Under Review to 12
  • Assignee deleted (Loïc Dachary)
Actions #50

Updated by Loïc Dachary almost 9 years ago

switching to ansible generated problems I can't deal with

Actions #51

Updated by Loïc Dachary almost 9 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Loïc Dachary
Actions #52

Updated by Loïc Dachary almost 9 years ago

  • Status changed from In Progress to Fix Under Review

Now that it actually works, it's time for review :-)

Actions #53

Updated by Loïc Dachary over 8 years ago

  • Description updated (diff)
Actions #54

Updated by Loïc Dachary over 8 years ago

  • Description updated (diff)
Actions #55

Updated by Loïc Dachary over 8 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF