Project

General

Profile

Actions

Bug #5955

closed

qemu deadlock when librbd caching enabled (writethru or writeback).

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From Mike Dawson on ML:

Logs are uploaded to cephdrop with the file name mikedawson-rbd-qemu-deadlock.

- At about 2013-08-05 19:46 or 47, we hit the issue, traffic went to 0
- At about 2013-08-05 19:53:51, ran a 'virsh screenshot'

Environment is:

- Ceph 0.61.7 (client is co-mingled with three OSDs)
- rbd cache = true and cache=writeback
- qemu 1.4.0 1.4.0+dfsg-1expubuntu4
- Ubuntu Raring with 3.8.0-25-generic

This issue is reproducible in my environment, and I'm willing to run any wip
branch you need. What else can I provide to help?


We've had a similar situation occur. For about three months, we've run several
Windows 2008 R2 guests with virtio drivers that record video surveillance. We
have long suffered an issue where the guest appears to hang indefinitely (or
until we intervene). For the sake of this conversation, we call this state
"wedged", because it appears something (rbd, qemu, virtio, etc) gets stuck on a
deadlock. When a guest gets wedged, we see the following:

- the guest will not respond to pings
- the qemu-system-x86_64 process drops to 0% cpu
- graphite graphs show the interface traffic dropping to 0bps
- the guest will stay wedged forever (or until we intervene)
- strace of qemu-system-x86_64 shows QEMU is making progress [1]2

We can "un-wedge" the guest by opening a NoVNC session or running a 'virsh
screenshot' command. After that, the guest resumes and runs as expected. At that
point we can examine the guest. Each time we'll see:

- No Windows error logs whatsoever while the guest is wedged
- A time sync typically occurs right after the guest gets un-wedged
- Scheduled tasks do not run while wedged
- Windows error logs do not show any evidence of suspend, sleep, etc

We had so many issue with guests becoming wedged, we wrote a script to 'virsh
screenshot' them via cron. Then we installed some updates and had a month or so
of higher stability (wedging happened maybe 1/10th as often). Until today we
couldn't figure out why.

Yesterday, I realized qemu was starting the instances without specifying
cache=writeback. We corrected that, and let them run overnight. With RBD
writeback re-enabled, wedging came back as often as we had seen in the past.
I've counted ~40 occurrences in the past 12-hour period. So I feel like
writeback caching in RBD certainly makes the deadlock more likely to occur.

Joshd asked us to gather RBD client logs:

"joshd> it could very well be the writeback cache not doing a callback at some
point - if you could gather logs of a vm getting stuck with debug rbd = 20,
debug ms = 1, and debug objectcacher = 30 that would be great"

We'll do that over the weekend. If you could as well, we'd love the help!

[1] http://www.gammacode.com/kvm/wedged-with-timestamps.txt
[2] http://www.gammacode.com/kvm/not-wedged.txt

Actions

Also available in: Atom PDF