Bug #9409: libvirt: QEMU errors operation is not valid: domain is not running -- cannot undefine transient domain - teuthology - Ceph

Actions

Copy link

Bug #9409

closed

libvirt: QEMU errors operation is not valid: domain is not running -- cannot undefine transient domain

Added by Anonymous over 9 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Zack Cerza

Category:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

I git cloned the latest teuthology, ran ./bootstrap and ran the following command:

./virtualenv/bin/teuthology --lock --machine-type vps --os-type ubuntu --os-version precise ~/tests/test94.yaml

the output was:

./virtualenv/bin/teuthology --lock --machine-type vps --os-type ubuntu --os-version precise ~/tests/test94.yaml
2014-09-09 16:06:03,973.973 WARNING:teuthology.report:No job_id found; not reporting results
2014-09-09 16:06:03,976.976 INFO:teuthology.run:Tasks not found; will attempt to fetch
2014-09-09 16:06:03,976.976 INFO:teuthology.repo_utils:Fetching from upstream into /home/wusui/src/ceph-qa-suite_master
2014-09-09 16:06:04,808.808 INFO:teuthology.repo_utils:Resetting repo at /home/wusui/src/ceph-qa-suite_master to branch master
2014-09-09 16:06:04,823.823 INFO:teuthology.run_tasks:Running task internal.lock_machines...
2014-09-09 16:06:04,824.824 INFO:teuthology.task.internal:Locking machines...
2014-09-09 16:06:06,936.936 INFO:teuthology.provision:Downburst completed on ubuntu@vpm144.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com

2014-09-09 16:06:07,880.880 INFO:teuthology.provision:Downburst completed on ubuntu@vpm115.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com
downburst: Virtual machine with this name exists already: vpm115

2014-09-09 16:06:07,880.880 INFO:teuthology.provision:Guest files exist. Re-creating guest: ubuntu@vpm115.front.sepia.ceph.com
2014-09-09 16:06:08,443.443 ERROR:teuthology.provision:libvir: QEMU error : Requested operation is not valid: domain is not running
libvir: QEMU error : Requested operation is not valid: cannot undefine transient domain
Traceback (most recent call last):
  File "/home/wusui/src/downburst/virtualenv/bin/downburst", line 9, in <module>
    load_entry_point('downburst==0.0.1', 'console_scripts', 'downburst')()
  File "/home/wusui/src/downburst/downburst/cli.py", line 59, in main
    return args.func(args)
  File "/home/wusui/src/downburst/downburst/destroy.py", line 70, in destroy
    | libvirt.VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA,
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1386, in undefineFlags
    if ret == -1: raise libvirtError ('virDomainUndefineFlags() failed', dom=self)
libvirt.libvirtError: Requested operation is not valid: cannot undefine transient domain

2014-09-09 16:06:09,216.216 INFO:teuthology.provision:Downburst completed on ubuntu@vpm115.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com
downburst: Virtual machine with this name exists already: vpm115

2014-09-09 16:06:09,216.216 INFO:teuthology.provision:Guest files exist. Re-creating guest: ubuntu@vpm115.front.sepia.ceph.com
2014-09-09 16:06:09,767.767 ERROR:teuthology.provision:libvir: QEMU error : Requested operation is not valid: domain is not running
libvir: QEMU error : Requested operation is not valid: cannot undefine transient domain
Traceback (most recent call last):
  File "/home/wusui/src/downburst/virtualenv/bin/downburst", line 9, in <module>
    load_entry_point('downburst==0.0.1', 'console_scripts', 'downburst')()
  File "/home/wusui/src/downburst/downburst/cli.py", line 59, in main
    return args.func(args)
  File "/home/wusui/src/downburst/downburst/destroy.py", line 70, in destroy
    | libvirt.VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA,
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1386, in undefineFlags
    if ret == -1: raise libvirtError ('virDomainUndefineFlags() failed', dom=self)
libvirt.libvirtError: Requested operation is not valid: cannot undefine transient domain

2014-09-09 16:06:10,527.527 INFO:teuthology.provision:Downburst completed on ubuntu@vpm115.front.sepia.ceph.com: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): ceph.com
downburst: Virtual machine with this name exists already: vpm115

2014-09-09 16:06:10,528.528 INFO:teuthology.provision:Guest files exist. Re-creating guest: ubuntu@vpm115.front.sepia.ceph.com
2014-09-09 16:06:11,206.206 ERROR:teuthology.provision:libvir: QEMU error : Requested operation is not valid: domain is not running
libvir: QEMU error : Requested operation is not valid: cannot undefine transient domain
Traceback (most recent call last):
  File "/home/wusui/src/downburst/virtualenv/bin/downburst", line 9, in <module>
    load_entry_point('downburst==0.0.1', 'console_scripts', 'downburst')()
  File "/home/wusui/src/downburst/downburst/cli.py", line 59, in main
    return args.func(args)
  File "/home/wusui/src/downburst/downburst/destroy.py", line 70, in destroy
    | libvirt.VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA,
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1386, in undefineFlags
    if ret == -1: raise libvirtError ('virDomainUndefineFlags() failed', dom=self)
libvirt.libvirtError: Requested operation is not valid: cannot undefine transient domain

^C2014-09-09 16:06:11,598.598 INFO:teuthology.run:Summary data:
{owner: wusui@aardvark, success: true}

2014-09-09 16:06:11,598.598 WARNING:teuthology.report:No job_id found; not reporting results
2014-09-09 16:06:11,598.598 INFO:teuthology.run:pass

~/tests/test94.yaml is fairly simple:

roles:
- [mon.a, mds.a, osd.0, osd.1,]
- [mon.b, client.0, osd.2, osd.3,]
tasks:
- install:
   branch: dumpling
- ceph:
   fs: xfs

Actions

Copy link

Updated by Anonymous over 9 years ago

I have locked vpm115 and vpm144. I ran this again on another set of vms and things worked fine.

I suspect that this is an issue with these machines. I may have gotten them into a bad state
by locking and freeing them in the past with downburst available some times and not others.

I will continue to keep these vms locked, rather than release them back into the wild.

Actions

Copy link

Updated by Zack Cerza over 9 years ago

Assignee set to Sandon Van Ness

Actions

Copy link

Updated by Sandon Van Ness over 9 years ago

Basically we should be hiding exceptions that are not fatal and not causing the machine to come up.

Actions

Copy link

Updated by Christina Meno over 9 years ago

I saw something kinda like this #10140

Actions

Copy link

Updated by Zack Cerza about 9 years ago

Status changed from New to Need More Info

I'm strongly leaning toward just disabling this "Re-creating guest" feature, as I've never seen it work. I have, however, seen a lot of this nonsense:
http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-11_17:00:03-upgrade:firefly-firefly-distro-basic-vps/752626/teuthology.log

Sandon, objections? I'm prepared to do the work myself since I'm already working on the relevant functions anyway.

Actions

Copy link

Updated by Sandon Van Ness about 9 years ago

I have a strong objection to disabling that. For whatever reason on rhel6/centos6 (haven't seen it on any other distros including centos7/rhel7) a guest will sometimes simply will not boot correctly (gets a kernel panic on boot). Simply powercycling the guest will fix the issue (or re-creating it will). This was added because originally when this code did not exist we could get a lot of failures.

Sure it would be nice to fix the underlying issue but considering it being limited to rhel/centos 6.x makes me think it is some sort of kernel issue and not going to be an easy process to find and fix.

I believe it should atempt it 2 (maybe 3 times) and be a bit smarter on checking which guest is still not up (as it currently just checks $upcount/$totalcount and re-creates them all) and mark it down with description of 'teuthology: guest did not boot after X loops' or something.

Look at logs for jobs 750105, 750120, 750121 for this run:

http://pulpito.front.sepia.ceph.com/teuthology-2015-02-11_01:10:02-ceph-deploy-firefly-distro-basic-vps/

These are all passes that would have otherwise failed had it not been for the 're-creating guest' feature.

It would be awesome if teuthology could then unlock the remaining guests and and go back into its 'waiting for machines' state (so the job would not fail due to say a failed disk) but that is probably a bit too much to ask.

Actions

Copy link

Updated by Zack Cerza about 9 years ago

Status changed from Need More Info to In Progress
Assignee changed from Sandon Van Ness to Zack Cerza

Thanks for the great explanation.

Okay, so I think task.internal.lock_machines(), provision.create_if_vm() and provision.destroy_if_vm() can be refactored to give up after a few tries.

Actions

Copy link