Project

General

Profile

Actions

Bug #5919

closed

qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

Added by Oliver Francke over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

we had a number of tickets raising, where users reported problems with latest debian-7.[01] and kernel 3.2.x/ Ubuntu 12 LTS and 3.2.0-51-amd in their VM's.
Problem currently observed on qemu with 1.4.0 and onwards incl. latest qemu-1.6.0-rc2.
No problem with upgraded kernel 3.8. for example.
No problem with qemu-1.2.2.
No problem with qcow2.
Problem there with rbd_cache=false/true, aio=native/none, cache=writeback/none.

Some brave assumption: s/t broke with RBD-cache-aio/async-patch triggered by broken client kernel 3.2 handling virtio?

Reproducable with high load in VM, effect: 120 hung_task_timeout seen on console, after a loop ala:
"while true; do apt-get install -y ntp libopts25; apt-get remove -y remove ntp libopts25; done"
+ parallel executed:
"spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat"

The session with the loop gets stuck. The spew-test is still executable, though?!
Attached is a logfile with some debug-stuff enabled.

Some timestamps from observations:

14:18:29 => start loop
14:19:00 => start spew
~ 14:19:50 => loop stuck/no output
14:20:24 => spew stopped
14:23:05 => "120 sec" message on console
14:23:31 => tried to kill dpkg/apt
14:25:00 => "halt -p" -> qemu-session is stuck, had to kill process with SIGKILL

Reproducable in lab with ceph-0.56.6-26... latest bobtail.

Hopefully not forgot s/t.

Best regards,

Oliver.


Files

760_root.log.xz (4.79 MB) 760_root.log.xz Oliver Francke, 08/09/2013 02:07 AM
760_wo_rbd_cache.log.xz (3.02 MB) 760_wo_rbd_cache.log.xz Oliver Francke, 08/13/2013 02:01 AM
Actions #1

Updated by Sage Weil over 10 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Josh Durgin over 10 years ago

The log shows one aio_flush (that started at 2013-08-08 14:19:36.425555) never completing. This log is without caching, so it's probably a bug in librados' async flush handling. When caching is used, the code path is quite different, but there could be a similar bug in the caching module.

Could you get logs (debug rbd=20, debug objectcacher=30, debug ms=1) of this happening with caching enabled so we can see if it's the same kind of bug there?

Actions #3

Updated by Oliver Francke over 10 years ago

Hi Josh,

even under high ceph-load I was not able to break things down with rbd_cache=true after 2 hours, whereas it fails after couple of minutes without. So there is a strong indication... ;)

Anything else I could/should try?

Regards,

Oliver.

Actions #4

Updated by Josh Durgin over 10 years ago

Thanks Oliver, that's good to hear. If you use cache=writethrough instead of no cache, it will have the same write properties (waiting for them to go to the cluster) but should avoid this bug.

One more log would help narrow it down - with cache off, logs of it happening with debug ms = 1, debug rados = 20, debug rbd = 20. Thanks!

Actions #5

Updated by Josh Durgin over 10 years ago

I found one bug - there's a fix on top of the current bobtail branch in wip-librados-aio-flush. Could you see if that fixes the hangs?

Actions #6

Updated by Oliver Francke over 10 years ago

Hi Josh,

update/upgrade our cluster with ceph version 0.56.6-28-gf07447f (f07447f28c6e468c1dc76149f51d7958b10482c2), no luck. I attach the requested logfile with debug_rados. Let me know, if you need some more stuff...

The complete device/drive section looks like the following:

-device virtio-blk-pci,drive=virtio0 -drive format=raw,file=rbd:123/vm-760-disk-1.rbd:rbd_cache=false:rbd_cache_size=33554432:rbd_cache_max_dirty=16777216:rbd_cache_target_dirty=8388608:log_file=/home/fcms/760_wo_rbd_cache.log:admin_socket=/home/fcms/760_root.asok:debug_ms=1:debug_rbd=20:debug_rados=20:debug_objectcacher=0,cache=writeback,if=none,id=virtio0,media=disk,index=0

Thnx n regards,

Oliver.

Actions #7

Updated by Sage Weil over 10 years ago

  • Status changed from New to 7
  • Assignee changed from Josh Durgin to Sage Weil
  • Source changed from other to Community (user)

Hi Oliver,

Okay, I think we spotted the (other) bug. I pushed another patch to the same branch; can you try again?

Thanks!

Actions #8

Updated by Oliver Francke over 10 years ago

Hi Josh, Sage,

it looks very good right now, could not get it to break, even with 4 more VM's producing I/O, deep-scrub, some OSD-killing and remapping/backfilling ( talking about 3 nodes small lab-cluster ;) )

Well, that means we should be able to deploy this load into productive environment without any hassle?

Thanks very much for that fix...

Oliver.

Actions #9

Updated by Sage Weil over 10 years ago

  • Status changed from 7 to Pending Backport

Oliver Francke wrote:

Hi Josh, Sage,

it looks very good right now, could not get it to break, even with 4 more VM's producing I/O, deep-scrub, some OSD-killing and remapping/backfilling ( talking about 3 nodes small lab-cluster ;) )

Well, that means we should be able to deploy this load into productive environment without any hassle?

Thanks very much for that fix...

Oliver.

Excellent! Yeah go ahead and use it. I'll get the fix in mainline and backported to bobtail proper.

Actions #10

Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF