Bug #5919
closedqemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process
0%
Description
Hi,
we had a number of tickets raising, where users reported problems with latest debian-7.[01] and kernel 3.2.x/ Ubuntu 12 LTS and 3.2.0-51-amd in their VM's.
Problem currently observed on qemu with 1.4.0 and onwards incl. latest qemu-1.6.0-rc2.
No problem with upgraded kernel 3.8. for example.
No problem with qemu-1.2.2.
No problem with qcow2.
Problem there with rbd_cache=false/true, aio=native/none, cache=writeback/none.
Some brave assumption: s/t broke with RBD-cache-aio/async-patch triggered by broken client kernel 3.2 handling virtio?
Reproducable with high load in VM, effect: 120 hung_task_timeout seen on console, after a loop ala:
"while true; do apt-get install -y ntp libopts25; apt-get remove -y remove ntp libopts25; done"
+ parallel executed:
"spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat"
The session with the loop gets stuck. The spew-test is still executable, though?!
Attached is a logfile with some debug-stuff enabled.
Some timestamps from observations:
14:18:29 => start loop
14:19:00 => start spew
~ 14:19:50 => loop stuck/no output
14:20:24 => spew stopped
14:23:05 => "120 sec" message on console
14:23:31 => tried to kill dpkg/apt
14:25:00 => "halt -p" -> qemu-session is stuck, had to kill process with SIGKILL
Reproducable in lab with ceph-0.56.6-26... latest bobtail.
Hopefully not forgot s/t.
Best regards,
Oliver.
Files