Bug #7282

Updated by Florian Haas about 6 years ago

This isn't fully confirmed yet, because we haven't found a reliable way to reproduce. In short, it seems that if you have either an rbd-backed libvirt storage pool or an rbd-backed Qemu/KVM domain, and some borked objects in your RBD, then something seems to spin inside libvirtd, blocking *all* libvirtd connections on that host.

Which would mean that a handful of bad RADOS objects could take down an entire virtualization host potentially running scores or hundreds of guests, which would be sufficiently painful to merit reporting an issue even if not fully confirmed.

What's extremely painful is that the issue produces zero logs. Even strace is unhelpful.

Platform: Ubuntu 12.04.3
Ceph client tools including librados and librbd: Dumpling (0.67.5-1precise)
Libvirt: 1.0.2 (1.0.2-0ubuntu11.13.04.5~cloud1) from Ubuntu Cloud Archive from Grizzly (can't run 1.1.1, being bitten by

Steps to (possibly) reproduce:

* - Create a problematic state in your cluster, for example, have some unfound objects in a pool used by your guests' RBD volumes
* - Create an rbd storage pool XML definition
* - Fire up libvirtd
* - Run @virsh@

* observe that @list@ quickly responds with a list of currently running VMs

* run @pool-define <path-to-definition.xml>@

* observe that @list@ is still snappy

* run @pool-start <name-of-pool>@

* try @list@ again. If you hit the problem, it will block indefinitely.

The problem is not with the pool though. Remove the pool, try the same with a domain definition using an rbd backed drive. Once you start a guest backed by a problematic RBD, virsh starts blocking.

Of course, that also means once you've hit the problem, a reboot is the only way out (because the only way to kill the pool or domain would be through a libvirtd socket again).

Because it's practically impossible for a user to enumerate all the RADOS objects in a PG, it's almost infeasible to determine ahead of time which RBDs might be affected by something like an unfound object, as the latter are always reported by PG.

All this makes the issue possibly extremely tricky to reproduce, but when it hits, its impact is frightening. I'll leave this at sev 3 since it's not reliably confirmed.