Bug #19078: Downburst failing to complete sometimes - sepia - Ceph

Actions

Copy link

Bug #19078

closed

Downburst failing to complete sometimes

Added by Yuri Weinstein about 7 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

David Galloway

Category:

Infrastructure Service

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

scheduled_yuriw@teuthology ========
2017-02-24 16:49:22,111.111 INFO:teuthology.nuke:targets:
vpm069.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9unY+2im/4Tj6LcOZ1FtOALrbHoL/GOrzaDeDPUeHT/GzihtWyqeTMF/9NyKXfgs2zPUbd+zy03UkbQuQa3n0zdJ0iIMiIAF5YIWuyu06OD7rsRV9gMTamJ9xS/I+JpCJiZyGEnlgetEaklqlDA5c94CSXLSLUsA+2ArebosCuCXVXj4JrbC/phqA1+zi6CSVI6aaQdxBM8l0eHfncAk86fRrj9u/D9siSHqGWzk5Kw6qcHw2nOGrg0rKWfry8i4Oz69H6mOXiqyvK68sR8T6l9U2QGXQxUoOC9rF6lBapieQHu6KJpL4PjASUiItIfBYzrEw2xCDJ1nlQ2kzSCC7
vpm175.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDFQfKUv0vsZmQMxwKYHkv59z0dgaDfUZCb/0IdI6jk77eW60MMyMBEH9EnMDl1gYqC7ThOyTFCfVOzErmieFlsqgs67VyGot8UjTjHEEZfT5vM5vuVfzhbq/CbVq+pE0LnDBnWh1o6cLjAJIgDjctLGX0u3ep3QTCtoJlglw4ZNTcf5SynkmGqoLcXlRNKp15dvPFE8KhLEqBhJiZFlFgAfyrFgV8G2i7rRPADy9jm9mVGVfgqSru13grEEOhH9mzLoINY6UeNms3B7uaknwzZVY6gU5OKy7lzt3nqwNObOQUcFEpVJbqXKJZ0POkDFNbQX0vFXFZBld3dItIPW0+F
2017-02-24 16:49:23,368.368 ERROR:teuthology.provision.downburst:Error destroying vpm069.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm069'

2017-02-24 16:49:23,368.368 ERROR:teuthology.lock.ops:destroy failed for vpm069.front.sepia.ceph.com
2017-02-24 16:49:23,447.447 ERROR:teuthology.provision.downburst:Error destroying vpm175.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm175'

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Yuri Weinstein about 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Yuri Weinstein about 7 years ago

Description updated (diff)

Actions

Copy link

Updated by David Galloway about 7 years ago

Description updated (diff)

Looks like they never got created...?

http://qa-proxy.ceph.com/teuthology/yuriw-2017-02-23_22:44:01-upgrade:hammer-x-jewel-distro-basic-vps/854374/teuthology.log

Log just stops here:

2017-02-24T00:01:59.387 INFO:teuthology.task.internal.lock_machines:waiting for more vps machines to be free (need 2 + 3, have 2)...
2017-02-24T00:02:09.920 DEBUG:teuthology.lock:lock_many request: {'count': 3, 'locked_by': 'scheduled_yuriw@teuthology', 'description': '/home/teuthworker/archive/yuriw-2017-02-23_22:44:01-upgrade:hammer-x-jewel-distro-basic-vps/854374', 'machine_type': 'vps'}
2017-02-24T00:02:09.950 DEBUG:teuthology.lock:locked ubuntu@vpm099.front.sepia.ceph.com, ubuntu@vpm175.front.sepia.ceph.com, ubuntu@vpm069.front.sepia.ceph.com
2017-02-24T00:02:10.013 INFO:teuthology.provision.downburst:Provisioning a ubuntu 14.04 vps

From the worker log:

2017-02-23T23:39:04.671 INFO:teuthology.worker:Job PID: 16704
2017-02-23T23:39:04.672 INFO:teuthology.worker:Running with watchdog
2017-02-23T23:41:04.673 DEBUG:teuthology.worker:Worker log: /home/teuthworker/archive/worker_logs/worker.vps.3022
2017-02-24T11:39:20.910 WARNING:teuthology.worker:Job ran longer than 43200s. Killing...
2017-02-24T11:39:21.186 INFO:teuthology.kill:Killing Pids: set([16704])
2017-02-24T11:39:23.837 INFO:teuthology.kill:No locked machines. Not nuking anything
2017-02-24T11:41:23.936 ERROR:teuthology.worker:Child exited with code -15

Why didn't it see it had the machines locked?

Actions

Copy link

Updated by David Galloway about 7 years ago

Looks like vpm099 did come up.. or at least tried.

From mira017:/var/log/libvirt/qemu/vpm099.log

2017-02-24 00:02:11.916+0000: starting up
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name vpm099 -S -machine pc-1.0,accel=kvm,usb=off -m 3892 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 4a2de258-c2d2-490e-92ea-c2d6bf22a0fb -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vpm099.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/srv/libvirtpool/vpm099/vpm099.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/srv/libvirtpool/vpm099/vpm099-2.img,if=none,id=drive-virtio-disk1,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=/srv/libvirtpool/vpm099/vpm099-3.img,if=none,id=drive-virtio-disk2,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio-disk2 -drive file=/srv/libvirtpool/vpm099/vpm099-4.img,if=none,id=drive-virtio-disk3,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk3,id=virtio-disk3 -drive file=/srv/libvirtpool/vpm099/cloud-init.vpm099.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=34,id=hostnet0,vhost=on,vhostfd=37 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:46:7b:2b,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9
W: kvm binary is deprecated, please use qemu-system-x86_64 instead
char device redirected to /dev/pts/1 (label charserial0)
Domain id=3085 is tainted: custom-monitor
qemu: terminating on signal 15 from pid 1549
2017-02-24 16:04:59.972+0000: shutting down

Actions

Copy link

Updated by Dan Mick about 7 years ago

downburst might need to download a new image from ubuntu.com to provision a vps. If that process hangs or fails, it may not recover well?...

14.04 was just updated recently (today)

Actions

Copy link

Updated by Dan Mick about 7 years ago

downburst only gets the newer image if --forcenew is passed to create, which it isn't. Still mystified.

Actions

Copy link

Updated by David Galloway about 7 years ago

Has duplicate Bug #19082: "Error destroying vpmXXX libvirt: QEMU Driver error : Domain not found" added

Actions

Copy link

Updated by David Galloway about 7 years ago

Subject changed from can't nuke vpm069 and vpm175 (hosts mira011 and mira079) to Downburst failing to complete sometimes

Actions

Copy link

Updated by David Galloway about 7 years ago

AIUI, there are a few options here:

Log downburst output so we can get an idea of what's happening / why downburst hangs during the provisioning state
Not worry about the real problem and just have -nuke unlock VPSes since they don't need to be nuked anyway

downburst.py should also probably just unlock the VPS if it senses a failure to provision so it doesn't show up in "-nuke --stale"

Actions

Copy link

#10

Updated by David Galloway about 7 years ago

This is interesting...

From: http://qa-proxy.ceph.com/teuthology/teuthology-2017-02-28_19:25:04-upgrade:kraken-x-master-distro-basic-vps/869272/teuthology.log

2017-02-28T20:26:44.170 INFO:teuthology.nuke:Checking targets against current locks
2017-02-28T20:26:54.458 ERROR:teuthology.provision.downburst:Error destroying vpm095.front.sepia.ceph.com: libvirt: QEMU Driver error : Requested operation is not valid: domain is not running

2017-02-28T20:26:54.459 ERROR:teuthology.lock:destroy failed for vpm095.front.sepia.ceph.com
2017-02-28T20:26:54.499 INFO:teuthology.lock:unlocked vpm095.front.sepia.ceph.com

So nuke is just unlocking when the VPS doesn't exist.

The problem arises when the downburst provision hangs for 12 hours and the job gets marked dead in the middle of it.

Dan made a valid point. Why is Yuri's manual -nuke failing?

Actions

Copy link

#11

Updated by Yuri Weinstein about 7 years ago

Latest log:

  yuriw@teuthology ~ [23:22:44]> for owner in $(teuthology-lock --summary | grep -i scheduled | awk '{ print $4 }' | sort -u); do echo $owner; echo "========"; teuthology-nuke --stale --owner $owner --unlock; echo "======== "; done
scheduled_dzafman@TrustyTahr
========
2017-02-28 23:22:47,286.286 INFO:teuthology.nuke:targets:
  {}
======== 
scheduled_jspray@teuthology
========
2017-02-28 23:22:48,038.038 INFO:teuthology.nuke:targets:
  {}
======== 
scheduled_kchai@teuthology
========
2017-02-28 23:22:48,782.782 INFO:teuthology.nuke:targets:
  {}
======== 
scheduled_teuthology@teuthology
========
2017-02-28 23:22:50,545.545 INFO:teuthology.nuke:targets:
  vpm011.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQD6mslB1uOWcylQkEGirT1T/m5km/omOsOQSF7bADHptFp1EaE4gxY2/NJ0GOcZNI44gnKQYkhlwKCJbXbZS578ag/zKl2Wwdt3Roh1htneeRImvm/jmWlQY4vIYTQ7d6lvtVGFpJoIXKl70/G6Pj9xNxI5r1cWI/UiAX7FIT/J1vzmKIRL5vVEPhdqqDgZR4t6/RlguQUfd9PiBSYhHZ/Xmb/1kWIiFLa1kIFGbwk/1qXO30d+tmZvyj3WMy+hrJ4nm8hIMttynmH7Vaah2+lH+7IsD4GFexKavw2O+jcem1tLRUkJQR5u7vr8wDWW27n64lic56lyIz1NJKpP64Jf
  vpm073.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+/LNd4ophlrGy43bP66UhGsCbP2bmJpmPwHXcXlDGDaC2mkrjVgTw4ai/XX4JomkwXZ5HKVQ8Fy32/aEywwTRagn0HIeM0mhnf7txwe2NXq/RJpN/MZ5UmrJXBNMgSNdXD6Qrcu36iabKxtofS1zDQuuv01ZdqPELdmFzyGbyP2sr+pRrgBsJLRPFlP9RcyUe5oPCLxkV1acI0Hvzmbhjo4khd97cf3T4kfdRpWo7ImsG7dm45PIVoKMkyUjNpAJIuLxWMe62M22JTVxM6yS7XQQ4p3RraKDzVwG3PJt+vgAYCkDOAT5Jjgn61awx3rPv1rKQWQhtZ9wjb21at98H
2017-02-28 23:22:50,967.967 ERROR:teuthology.provision.downburst:Error destroying vpm073.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm073'

2017-02-28 23:22:50,967.967 ERROR:teuthology.lock.ops:destroy failed for vpm073.front.sepia.ceph.com
2017-02-28 23:22:51,072.072 ERROR:teuthology.provision.downburst:Error destroying vpm011.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm011'

2017-02-28 23:22:51,072.072 ERROR:teuthology.lock.ops:destroy failed for vpm011.front.sepia.ceph.com
======== 
scheduled_yuriw@teuthology
========
2017-02-28 23:22:52,212.212 INFO:teuthology.nuke:targets:
  {}
======== 
scheduled_zack@zwork.local
========
2017-02-28 23:22:52,950.950 INFO:teuthology.nuke:targets:
  vpm159.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1WFx48NAjBUewIml77EbFogkf25+TNW3ewT4mS+ZOn9xOhKK05Dnz+ovWm6Vme8h5z4BOlN/G5briEtOyJA9wE3l6yRmOZGursZ+7jURGycQw79OI7GjNOhIwTt5OVmweBmg1FGGQ6koCOjK03C1gIdFSudcpqd8YYkXPgXk3S7+RRB74i5qXx2G9YPLVd/wmVky7sxhKX7/2Ymp+xF3TTdzHWlOW51ta+ORY3B2kIotk6HCK4rFSDok3SSZEiCe2iJ1mbEcSO57R9gx9VTPxBi/f2rnYRAGh5v56ChAYPjAPbdFKUPMC/hID9/pEUMP8yE7YXFvSAh8A4moKfwRV
2017-02-28 23:22:53,453.453 ERROR:teuthology.provision.downburst:Error destroying vpm159.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm159'
libvirt: Storage Driver error : cannot unlink file '/srv/libvirtpool/vpm159/vpm159.img': Input/output error
Traceback (most recent call last):
  File "/home/yuriw/downburst/virtualenv/bin/downburst", line 11, in <module>
    load_entry_point('downburst', 'console_scripts', 'downburst')()
  File "/home/yuriw/downburst/downburst/cli.py", line 64, in main
    return args.func(args)
  File "/home/yuriw/downburst/downburst/destroy.py", line 111, in destroy
    vol.delete(flags=0)
  File "/home/yuriw/downburst/virtualenv/local/lib/python2.7/site-packages/libvirt.py", line 3359, in delete
    if ret == -1: raise libvirtError ('virStorageVolDelete() failed', vol=self)
libvirt.libvirtError: cannot unlink file '/srv/libvirtpool/vpm159/vpm159.img': Input/output error

2017-02-28 23:22:53,453.453 ERROR:teuthology.lock.ops:destroy failed for vpm159.front.sepia.ceph.com

Actions

Copy link

#12

Updated by David Galloway about 7 years ago

Okay, don't do anything with those machines for now.

dgalloway@w541 ~ ()$ tl --brief vpm073
2017-02-28 19:04:49,762.762 INFO:teuthology.lock.cli:updating host keys for vpm073.front.sepia.ceph.com
vpm073.front.sepia.ceph.com   locked scheduled_teuthology@teuthology "/home/teuthworker/archive/teuthology-2017-02-26_19:25:18-upgrade:kraken-x-master-distro-basic-vps/862356" 

dgalloway@w541 ~ ()$ tl --brief vpm011
2017-02-28 19:05:11,892.892 INFO:teuthology.lock.cli:updating host keys for vpm011.front.sepia.ceph.com
vpm011.front.sepia.ceph.com   locked scheduled_teuthology@teuthology "/home/teuthworker/archive/teuthology-2017-02-26_19:25:18-upgrade:kraken-x-master-distro-basic-vps/862356" 

dgalloway@w541 ~ ()$ tl --brief vpm159
2017-02-28 19:05:38,780.780 INFO:teuthology.lock.cli:updating host keys for vpm159.front.sepia.ceph.com
vpm159.front.sepia.ceph.com   locked scheduled_zack@zwork.local "/home/teuthworker/archive/zack-2017-02-27_16:24:39-ceph-deploy-master---basic-vps/866850"

Actions

Copy link

#13

Updated by David Galloway about 7 years ago

So aside from the downburst hang, there's also a problem with unlocking a VPS if the VM doesn't exist.

vpm159 was down because its disk died.

dgalloway@w541 ~ ()$ teuthology-lock -v --unlock --owner scheduled_zack@zwork.local vpm159
2017-03-03 12:15:46,653.653 ERROR:teuthology.provision.downburst:Error destroying vpm159.front.sepia.ceph.com: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'vpm159'

2017-03-03 12:15:46,653.653 ERROR:teuthology.lock.ops:destroy failed for vpm159.front.sepia.ceph.com

dgalloway@w541 ~ ()$ teuthology-lock --brief vpm159
2017-03-03 12:16:06,156.156 INFO:teuthology.lock.cli:updating host keys for vpm159.front.sepia.ceph.com
vpm159.front.sepia.ceph.com   locked scheduled_zack@zwork.local "locked for maintenance"

Oh... https://github.com/ceph/teuthology/commit/77185fdd2d19b286263a6c3ad24aaff7e9208715

Actions

Copy link

#14

Updated by Zack Cerza about 7 years ago

teuthology will now unlock if downburst's destroy failure is because the instance doesn't exist

https://github.com/ceph/teuthology/pull/1040

Actions

Copy link

#15

Updated by David Galloway about 7 years ago

Zack Cerza wrote:

teuthology will now unlock if downburst's destroy failure is because the instance doesn't exist

https://github.com/ceph/teuthology/pull/1040

This is good. So now we just need to keep an eye on jobs that fail or die on VPSes now that we have better logging for the downburst process: https://github.com/ceph/teuthology/pull/1041

Actions

Copy link

#16

Updated by David Galloway almost 7 years ago

Category set to Infrastructure Service
Status changed from New to Closed

Unfortunately haven't seen this anymore. Will reopen if we start seeing it again and want to sink more time into VPSes.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Infrastructure » sepia

Custom queries

Bug #19078

Downburst failing to complete sometimes

Updated by Yuri Weinstein about 7 years ago

Updated by Yuri Weinstein about 7 years ago

Updated by David Galloway about 7 years ago

Updated by David Galloway about 7 years ago

Updated by Dan Mick about 7 years ago

Updated by Dan Mick about 7 years ago

Updated by David Galloway about 7 years ago

Updated by David Galloway about 7 years ago

Updated by David Galloway about 7 years ago

Updated by David Galloway about 7 years ago

Updated by Yuri Weinstein about 7 years ago

Updated by David Galloway about 7 years ago

Updated by David Galloway about 7 years ago

Updated by Zack Cerza about 7 years ago

Updated by David Galloway about 7 years ago

Updated by David Galloway almost 7 years ago