Bug #19078
closedDownburst failing to complete sometimes
0%
Description
scheduled_yuriw@teuthology
========
2017-02-24 16:49:22,111.111 INFO:teuthology.nuke:targets:
vpm069.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9unY+2im/4Tj6LcOZ1FtOALrbHoL/GOrzaDeDPUeHT/GzihtWyqeTMF/9NyKXfgs2zPUbd+zy03UkbQuQa3n0zdJ0iIMiIAF5YIWuyu06OD7rsRV9gMTamJ9xS/I+JpCJiZyGEnlgetEaklqlDA5c94CSXLSLUsA+2ArebosCuCXVXj4JrbC/phqA1+zi6CSVI6aaQdxBM8l0eHfncAk86fRrj9u/D9siSHqGWzk5Kw6qcHw2nOGrg0rKWfry8i4Oz69H6mOXiqyvK68sR8T6l9U2QGXQxUoOC9rF6lBapieQHu6KJpL4PjASUiItIfBYzrEw2xCDJ1nlQ2kzSCC7
vpm175.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDFQfKUv0vsZmQMxwKYHkv59z0dgaDfUZCb/0IdI6jk77eW60MMyMBEH9EnMDl1gYqC7ThOyTFCfVOzErmieFlsqgs67VyGot8UjTjHEEZfT5vM5vuVfzhbq/CbVq+pE0LnDBnWh1o6cLjAJIgDjctLGX0u3ep3QTCtoJlglw4ZNTcf5SynkmGqoLcXlRNKp15dvPFE8KhLEqBhJiZFlFgAfyrFgV8G2i7rRPADy9jm9mVGVfgqSru13grEEOhH9mzLoINY6UeNms3B7uaknwzZVY6gU5OKy7lzt3nqwNObOQUcFEpVJbqXKJZ0POkDFNbQX0vFXFZBld3dItIPW0+F
2017-02-24 16:49:23,368.368 ERROR:teuthology.provision.downburst:Error destroying vpm069.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm069'
2017-02-24 16:49:23,368.368 ERROR:teuthology.lock.ops:destroy failed for vpm069.front.sepia.ceph.com
2017-02-24 16:49:23,447.447 ERROR:teuthology.provision.downburst:Error destroying vpm175.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm175'
Updated by David Galloway about 7 years ago
- Description updated (diff)
Looks like they never got created...?
Log just stops here:
2017-02-24T00:01:59.387 INFO:teuthology.task.internal.lock_machines:waiting for more vps machines to be free (need 2 + 3, have 2)... 2017-02-24T00:02:09.920 DEBUG:teuthology.lock:lock_many request: {'count': 3, 'locked_by': 'scheduled_yuriw@teuthology', 'description': '/home/teuthworker/archive/yuriw-2017-02-23_22:44:01-upgrade:hammer-x-jewel-distro-basic-vps/854374', 'machine_type': 'vps'} 2017-02-24T00:02:09.950 DEBUG:teuthology.lock:locked ubuntu@vpm099.front.sepia.ceph.com, ubuntu@vpm175.front.sepia.ceph.com, ubuntu@vpm069.front.sepia.ceph.com 2017-02-24T00:02:10.013 INFO:teuthology.provision.downburst:Provisioning a ubuntu 14.04 vps
From the worker log:
2017-02-23T23:39:04.671 INFO:teuthology.worker:Job PID: 16704 2017-02-23T23:39:04.672 INFO:teuthology.worker:Running with watchdog 2017-02-23T23:41:04.673 DEBUG:teuthology.worker:Worker log: /home/teuthworker/archive/worker_logs/worker.vps.3022 2017-02-24T11:39:20.910 WARNING:teuthology.worker:Job ran longer than 43200s. Killing... 2017-02-24T11:39:21.186 INFO:teuthology.kill:Killing Pids: set([16704]) 2017-02-24T11:39:23.837 INFO:teuthology.kill:No locked machines. Not nuking anything 2017-02-24T11:41:23.936 ERROR:teuthology.worker:Child exited with code -15
Why didn't it see it had the machines locked?
Updated by David Galloway about 7 years ago
Looks like vpm099 did come up.. or at least tried.
From mira017:/var/log/libvirt/qemu/vpm099.log
2017-02-24 00:02:11.916+0000: starting up LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name vpm099 -S -machine pc-1.0,accel=kvm,usb=off -m 3892 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 4a2de258-c2d2-490e-92ea-c2d6bf22a0fb -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vpm099.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/srv/libvirtpool/vpm099/vpm099.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/srv/libvirtpool/vpm099/vpm099-2.img,if=none,id=drive-virtio-disk1,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=/srv/libvirtpool/vpm099/vpm099-3.img,if=none,id=drive-virtio-disk2,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio-disk2 -drive file=/srv/libvirtpool/vpm099/vpm099-4.img,if=none,id=drive-virtio-disk3,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk3,id=virtio-disk3 -drive file=/srv/libvirtpool/vpm099/cloud-init.vpm099.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=34,id=hostnet0,vhost=on,vhostfd=37 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:46:7b:2b,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 W: kvm binary is deprecated, please use qemu-system-x86_64 instead char device redirected to /dev/pts/1 (label charserial0) Domain id=3085 is tainted: custom-monitor qemu: terminating on signal 15 from pid 1549 2017-02-24 16:04:59.972+0000: shutting down
Updated by Dan Mick about 7 years ago
downburst might need to download a new image from ubuntu.com to provision a vps. If that process hangs or fails, it may not recover well?...
14.04 was just updated recently (today)
Updated by Dan Mick about 7 years ago
downburst only gets the newer image if --forcenew is passed to create, which it isn't. Still mystified.
Updated by David Galloway about 7 years ago
- Has duplicate Bug #19082: "Error destroying vpmXXX libvirt: QEMU Driver error : Domain not found" added
Updated by David Galloway about 7 years ago
- Subject changed from can't nuke vpm069 and vpm175 (hosts mira011 and mira079) to Downburst failing to complete sometimes
Updated by David Galloway about 7 years ago
AIUI, there are a few options here:
- Log downburst output so we can get an idea of what's happening / why downburst hangs during the provisioning state
- Not worry about the real problem and just have -nuke unlock VPSes since they don't need to be nuked anyway
downburst.py should also probably just unlock the VPS if it senses a failure to provision so it doesn't show up in "-nuke --stale"
Updated by David Galloway about 7 years ago
This is interesting...
2017-02-28T20:26:44.170 INFO:teuthology.nuke:Checking targets against current locks 2017-02-28T20:26:54.458 ERROR:teuthology.provision.downburst:Error destroying vpm095.front.sepia.ceph.com: libvirt: QEMU Driver error : Requested operation is not valid: domain is not running 2017-02-28T20:26:54.459 ERROR:teuthology.lock:destroy failed for vpm095.front.sepia.ceph.com 2017-02-28T20:26:54.499 INFO:teuthology.lock:unlocked vpm095.front.sepia.ceph.com
So nuke is just unlocking when the VPS doesn't exist.
The problem arises when the downburst provision hangs for 12 hours and the job gets marked dead in the middle of it.
Dan made a valid point. Why is Yuri's manual -nuke failing?
Updated by Yuri Weinstein about 7 years ago
Latest log:
yuriw@teuthology ~ [23:22:44]> for owner in $(teuthology-lock --summary | grep -i scheduled | awk '{ print $4 }' | sort -u); do echo $owner; echo "========"; teuthology-nuke --stale --owner $owner --unlock; echo "======== "; done scheduled_dzafman@TrustyTahr ======== 2017-02-28 23:22:47,286.286 INFO:teuthology.nuke:targets: {} ======== scheduled_jspray@teuthology ======== 2017-02-28 23:22:48,038.038 INFO:teuthology.nuke:targets: {} ======== scheduled_kchai@teuthology ======== 2017-02-28 23:22:48,782.782 INFO:teuthology.nuke:targets: {} ======== scheduled_teuthology@teuthology ======== 2017-02-28 23:22:50,545.545 INFO:teuthology.nuke:targets: vpm011.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQD6mslB1uOWcylQkEGirT1T/m5km/omOsOQSF7bADHptFp1EaE4gxY2/NJ0GOcZNI44gnKQYkhlwKCJbXbZS578ag/zKl2Wwdt3Roh1htneeRImvm/jmWlQY4vIYTQ7d6lvtVGFpJoIXKl70/G6Pj9xNxI5r1cWI/UiAX7FIT/J1vzmKIRL5vVEPhdqqDgZR4t6/RlguQUfd9PiBSYhHZ/Xmb/1kWIiFLa1kIFGbwk/1qXO30d+tmZvyj3WMy+hrJ4nm8hIMttynmH7Vaah2+lH+7IsD4GFexKavw2O+jcem1tLRUkJQR5u7vr8wDWW27n64lic56lyIz1NJKpP64Jf vpm073.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+/LNd4ophlrGy43bP66UhGsCbP2bmJpmPwHXcXlDGDaC2mkrjVgTw4ai/XX4JomkwXZ5HKVQ8Fy32/aEywwTRagn0HIeM0mhnf7txwe2NXq/RJpN/MZ5UmrJXBNMgSNdXD6Qrcu36iabKxtofS1zDQuuv01ZdqPELdmFzyGbyP2sr+pRrgBsJLRPFlP9RcyUe5oPCLxkV1acI0Hvzmbhjo4khd97cf3T4kfdRpWo7ImsG7dm45PIVoKMkyUjNpAJIuLxWMe62M22JTVxM6yS7XQQ4p3RraKDzVwG3PJt+vgAYCkDOAT5Jjgn61awx3rPv1rKQWQhtZ9wjb21at98H 2017-02-28 23:22:50,967.967 ERROR:teuthology.provision.downburst:Error destroying vpm073.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm073' 2017-02-28 23:22:50,967.967 ERROR:teuthology.lock.ops:destroy failed for vpm073.front.sepia.ceph.com 2017-02-28 23:22:51,072.072 ERROR:teuthology.provision.downburst:Error destroying vpm011.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm011' 2017-02-28 23:22:51,072.072 ERROR:teuthology.lock.ops:destroy failed for vpm011.front.sepia.ceph.com ======== scheduled_yuriw@teuthology ======== 2017-02-28 23:22:52,212.212 INFO:teuthology.nuke:targets: {} ======== scheduled_zack@zwork.local ======== 2017-02-28 23:22:52,950.950 INFO:teuthology.nuke:targets: vpm159.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1WFx48NAjBUewIml77EbFogkf25+TNW3ewT4mS+ZOn9xOhKK05Dnz+ovWm6Vme8h5z4BOlN/G5briEtOyJA9wE3l6yRmOZGursZ+7jURGycQw79OI7GjNOhIwTt5OVmweBmg1FGGQ6koCOjK03C1gIdFSudcpqd8YYkXPgXk3S7+RRB74i5qXx2G9YPLVd/wmVky7sxhKX7/2Ymp+xF3TTdzHWlOW51ta+ORY3B2kIotk6HCK4rFSDok3SSZEiCe2iJ1mbEcSO57R9gx9VTPxBi/f2rnYRAGh5v56ChAYPjAPbdFKUPMC/hID9/pEUMP8yE7YXFvSAh8A4moKfwRV 2017-02-28 23:22:53,453.453 ERROR:teuthology.provision.downburst:Error destroying vpm159.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm159' libvirt: Storage Driver error : cannot unlink file '/srv/libvirtpool/vpm159/vpm159.img': Input/output error Traceback (most recent call last): File "/home/yuriw/downburst/virtualenv/bin/downburst", line 11, in <module> load_entry_point('downburst', 'console_scripts', 'downburst')() File "/home/yuriw/downburst/downburst/cli.py", line 64, in main return args.func(args) File "/home/yuriw/downburst/downburst/destroy.py", line 111, in destroy vol.delete(flags=0) File "/home/yuriw/downburst/virtualenv/local/lib/python2.7/site-packages/libvirt.py", line 3359, in delete if ret == -1: raise libvirtError ('virStorageVolDelete() failed', vol=self) libvirt.libvirtError: cannot unlink file '/srv/libvirtpool/vpm159/vpm159.img': Input/output error 2017-02-28 23:22:53,453.453 ERROR:teuthology.lock.ops:destroy failed for vpm159.front.sepia.ceph.com
Updated by David Galloway about 7 years ago
Okay, don't do anything with those machines for now.
dgalloway@w541 ~ ()$ tl --brief vpm073 2017-02-28 19:04:49,762.762 INFO:teuthology.lock.cli:updating host keys for vpm073.front.sepia.ceph.com vpm073.front.sepia.ceph.com locked scheduled_teuthology@teuthology "/home/teuthworker/archive/teuthology-2017-02-26_19:25:18-upgrade:kraken-x-master-distro-basic-vps/862356" dgalloway@w541 ~ ()$ tl --brief vpm011 2017-02-28 19:05:11,892.892 INFO:teuthology.lock.cli:updating host keys for vpm011.front.sepia.ceph.com vpm011.front.sepia.ceph.com locked scheduled_teuthology@teuthology "/home/teuthworker/archive/teuthology-2017-02-26_19:25:18-upgrade:kraken-x-master-distro-basic-vps/862356" dgalloway@w541 ~ ()$ tl --brief vpm159 2017-02-28 19:05:38,780.780 INFO:teuthology.lock.cli:updating host keys for vpm159.front.sepia.ceph.com vpm159.front.sepia.ceph.com locked scheduled_zack@zwork.local "/home/teuthworker/archive/zack-2017-02-27_16:24:39-ceph-deploy-master---basic-vps/866850"
Updated by David Galloway about 7 years ago
So aside from the downburst hang, there's also a problem with unlocking a VPS if the VM doesn't exist.
vpm159 was down because its disk died.
dgalloway@w541 ~ ()$ teuthology-lock -v --unlock --owner scheduled_zack@zwork.local vpm159 2017-03-03 12:15:46,653.653 ERROR:teuthology.provision.downburst:Error destroying vpm159.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm159' 2017-03-03 12:15:46,653.653 ERROR:teuthology.lock.ops:destroy failed for vpm159.front.sepia.ceph.com dgalloway@w541 ~ ()$ teuthology-lock --brief vpm159 2017-03-03 12:16:06,156.156 INFO:teuthology.lock.cli:updating host keys for vpm159.front.sepia.ceph.com vpm159.front.sepia.ceph.com locked scheduled_zack@zwork.local "locked for maintenance"
Oh... https://github.com/ceph/teuthology/commit/77185fdd2d19b286263a6c3ad24aaff7e9208715
Updated by Zack Cerza about 7 years ago
teuthology will now unlock if downburst's destroy failure is because the instance doesn't exist
Updated by David Galloway about 7 years ago
Zack Cerza wrote:
teuthology will now unlock if downburst's destroy failure is because the instance doesn't exist
This is good. So now we just need to keep an eye on jobs that fail or die on VPSes now that we have better logging for the downburst process: https://github.com/ceph/teuthology/pull/1041
Updated by David Galloway almost 7 years ago
- Category set to Infrastructure Service
- Status changed from New to Closed
Unfortunately haven't seen this anymore. Will reopen if we start seeing it again and want to sink more time into VPSes.