Bug #9409
closedlibvirt: QEMU errors operation is not valid: domain is not running -- cannot undefine transient domain
0%
Description
I git cloned the latest teuthology, ran ./bootstrap and ran the following command:
./virtualenv/bin/teuthology --lock --machine-type vps --os-type ubuntu --os-version precise ~/tests/test94.yaml
the output was:
./virtualenv/bin/teuthology --lock --machine-type vps --os-type ubuntu --os-version precise ~/tests/test94.yaml 2014-09-09 16:06:03,973.973 WARNING:teuthology.report:No job_id found; not reporting results 2014-09-09 16:06:03,976.976 INFO:teuthology.run:Tasks not found; will attempt to fetch 2014-09-09 16:06:03,976.976 INFO:teuthology.repo_utils:Fetching from upstream into /home/wusui/src/ceph-qa-suite_master 2014-09-09 16:06:04,808.808 INFO:teuthology.repo_utils:Resetting repo at /home/wusui/src/ceph-qa-suite_master to branch master 2014-09-09 16:06:04,823.823 INFO:teuthology.run_tasks:Running task internal.lock_machines... 2014-09-09 16:06:04,824.824 INFO:teuthology.task.internal:Locking machines... 2014-09-09 16:06:06,936.936 INFO:teuthology.provision:Downburst completed on ubuntu@vpm144.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com 2014-09-09 16:06:07,880.880 INFO:teuthology.provision:Downburst completed on ubuntu@vpm115.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com downburst: Virtual machine with this name exists already: vpm115 2014-09-09 16:06:07,880.880 INFO:teuthology.provision:Guest files exist. Re-creating guest: ubuntu@vpm115.front.sepia.ceph.com 2014-09-09 16:06:08,443.443 ERROR:teuthology.provision:libvir: QEMU error : Requested operation is not valid: domain is not running libvir: QEMU error : Requested operation is not valid: cannot undefine transient domain Traceback (most recent call last): File "/home/wusui/src/downburst/virtualenv/bin/downburst", line 9, in <module> load_entry_point('downburst==0.0.1', 'console_scripts', 'downburst')() File "/home/wusui/src/downburst/downburst/cli.py", line 59, in main return args.func(args) File "/home/wusui/src/downburst/downburst/destroy.py", line 70, in destroy | libvirt.VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA, File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1386, in undefineFlags if ret == -1: raise libvirtError ('virDomainUndefineFlags() failed', dom=self) libvirt.libvirtError: Requested operation is not valid: cannot undefine transient domain 2014-09-09 16:06:09,216.216 INFO:teuthology.provision:Downburst completed on ubuntu@vpm115.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com downburst: Virtual machine with this name exists already: vpm115 2014-09-09 16:06:09,216.216 INFO:teuthology.provision:Guest files exist. Re-creating guest: ubuntu@vpm115.front.sepia.ceph.com 2014-09-09 16:06:09,767.767 ERROR:teuthology.provision:libvir: QEMU error : Requested operation is not valid: domain is not running libvir: QEMU error : Requested operation is not valid: cannot undefine transient domain Traceback (most recent call last): File "/home/wusui/src/downburst/virtualenv/bin/downburst", line 9, in <module> load_entry_point('downburst==0.0.1', 'console_scripts', 'downburst')() File "/home/wusui/src/downburst/downburst/cli.py", line 59, in main return args.func(args) File "/home/wusui/src/downburst/downburst/destroy.py", line 70, in destroy | libvirt.VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA, File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1386, in undefineFlags if ret == -1: raise libvirtError ('virDomainUndefineFlags() failed', dom=self) libvirt.libvirtError: Requested operation is not valid: cannot undefine transient domain 2014-09-09 16:06:10,527.527 INFO:teuthology.provision:Downburst completed on ubuntu@vpm115.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com downburst: Virtual machine with this name exists already: vpm115 2014-09-09 16:06:10,528.528 INFO:teuthology.provision:Guest files exist. Re-creating guest: ubuntu@vpm115.front.sepia.ceph.com 2014-09-09 16:06:11,206.206 ERROR:teuthology.provision:libvir: QEMU error : Requested operation is not valid: domain is not running libvir: QEMU error : Requested operation is not valid: cannot undefine transient domain Traceback (most recent call last): File "/home/wusui/src/downburst/virtualenv/bin/downburst", line 9, in <module> load_entry_point('downburst==0.0.1', 'console_scripts', 'downburst')() File "/home/wusui/src/downburst/downburst/cli.py", line 59, in main return args.func(args) File "/home/wusui/src/downburst/downburst/destroy.py", line 70, in destroy | libvirt.VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA, File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1386, in undefineFlags if ret == -1: raise libvirtError ('virDomainUndefineFlags() failed', dom=self) libvirt.libvirtError: Requested operation is not valid: cannot undefine transient domain ^C2014-09-09 16:06:11,598.598 INFO:teuthology.run:Summary data: {owner: wusui@aardvark, success: true} 2014-09-09 16:06:11,598.598 WARNING:teuthology.report:No job_id found; not reporting results 2014-09-09 16:06:11,598.598 INFO:teuthology.run:pass
~/tests/test94.yaml is fairly simple:
roles: - [mon.a, mds.a, osd.0, osd.1,] - [mon.b, client.0, osd.2, osd.3,] tasks: - install: branch: dumpling - ceph: fs: xfs
Updated by Anonymous over 9 years ago
I have locked vpm115 and vpm144. I ran this again on another set of vms and things worked fine.
I suspect that this is an issue with these machines. I may have gotten them into a bad state
by locking and freeing them in the past with downburst available some times and not others.
I will continue to keep these vms locked, rather than release them back into the wild.
Updated by Sandon Van Ness over 9 years ago
Basically we should be hiding exceptions that are not fatal and not causing the machine to come up.
Updated by Christina Meno over 9 years ago
I saw something kinda like this #10140
Updated by Zack Cerza about 9 years ago
- Status changed from New to Need More Info
I'm strongly leaning toward just disabling this "Re-creating guest" feature, as I've never seen it work. I have, however, seen a lot of this nonsense:
http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-11_17:00:03-upgrade:firefly-firefly-distro-basic-vps/752626/teuthology.log
Sandon, objections? I'm prepared to do the work myself since I'm already working on the relevant functions anyway.
Updated by Sandon Van Ness about 9 years ago
I have a strong objection to disabling that. For whatever reason on rhel6/centos6 (haven't seen it on any other distros including centos7/rhel7) a guest will sometimes simply will not boot correctly (gets a kernel panic on boot). Simply powercycling the guest will fix the issue (or re-creating it will). This was added because originally when this code did not exist we could get a lot of failures.
Sure it would be nice to fix the underlying issue but considering it being limited to rhel/centos 6.x makes me think it is some sort of kernel issue and not going to be an easy process to find and fix.
I believe it should atempt it 2 (maybe 3 times) and be a bit smarter on checking which guest is still not up (as it currently just checks $upcount/$totalcount and re-creates them all) and mark it down with description of 'teuthology: guest did not boot after X loops' or something.
Look at logs for jobs 750105, 750120, 750121 for this run:
These are all passes that would have otherwise failed had it not been for the 're-creating guest' feature.
It would be awesome if teuthology could then unlock the remaining guests and and go back into its 'waiting for machines' state (so the job would not fail due to say a failed disk) but that is probably a bit too much to ask.
Updated by Zack Cerza about 9 years ago
- Status changed from Need More Info to In Progress
- Assignee changed from Sandon Van Ness to Zack Cerza
Thanks for the great explanation.
Okay, so I think task.internal.lock_machines()
, provision.create_if_vm()
and provision.destroy_if_vm()
can be refactored to give up after a few tries.
Updated by Zack Cerza about 9 years ago
- Status changed from In Progress to Fix Under Review
Updated by Zack Cerza about 9 years ago
- Status changed from Fix Under Review to Resolved