Project

General

Profile

Actions

Support #6455

closed

dumpling cephFS on 3.12-rc1 qemu hung tasks

Added by Jens-Christian Fischer over 10 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:

Description

We run 3.12rc1 kernels, ceph 0.67.3, and OpenStack Folsom with /var/lib/nova/instances on CephFS. The physical hosts share Ceph and OpenStack duties (ie. they run OSD, some run MON as well and have nova-compute nodes)

After the update to dumpling we see a instability on all hosts, here's a capture from Syslog on one of them:

Oct 2 15:45:57 h4 ntpd5944: peers refreshed
Oct 2 15:46:38 h4 kernel: [ 5595.799694] device vnet1 entered promiscuous mode
Oct 2 15:46:38 h4 kernel: [ 5595.823800] br100: port 3(vnet1) entered forwarding state
Oct 2 15:46:38 h4 kernel: [ 5595.823818] br100: port 3(vnet1) entered forwarding state
Oct 2 15:46:40 h4 ntpd5944: Listen normally on 17 vnet1 fe80::fc16:3eff:fe0d:2eb4 UDP 123
Oct 2 15:46:40 h4 ntpd5944: peers refreshed
Oct 2 15:46:40 h4 ntpd5944: new interface(s) found: waking up resolver
Oct 2 15:46:45 h4 kernel: [ 5602.945964] kvm [8232]: vcpu0 unhandled rdmsr: 0x345
Oct 2 15:46:45 h4 kernel: [ 5602.948088] kvm_set_msr_common: 187 callbacks suppressed
Oct 2 15:46:45 h4 kernel: [ 5602.948090] kvm [8232]: vcpu0 unhandled wrmsr: 0x680 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.950215] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c0 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.952322] kvm [8232]: vcpu0 unhandled wrmsr: 0x681 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.954439] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c1 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.956587] kvm [8232]: vcpu0 unhandled wrmsr: 0x682 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.958723] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c2 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.960855] kvm [8232]: vcpu0 unhandled wrmsr: 0x683 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.962973] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c3 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.965095] kvm [8232]: vcpu0 unhandled wrmsr: 0x684 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.967233] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c4 data 0
Oct 2 15:53:17 h4 kernel: [ 5994.406445] br100: port 3(vnet1) entered disabled state
Oct 2 15:53:17 h4 kernel: [ 5994.406604] device vnet1 left promiscuous mode
Oct 2 15:53:17 h4 kernel: [ 5994.406607] br100: port 3(vnet1) entered disabled state
Oct 2 15:53:19 h4 ntpd5944: Deleting interface #17 vnet1, fe80::fc16:3eff:fe0d:2eb4#123, interface stats: received=0, sent=0, dropped=0, active_time=399 secs
Oct 2 15:53:19 h4 ntpd5944: peers refreshed
Oct 2 15:53:51 h4 kernel: [ 6028.944544] libceph: osd4 down
Oct 2 15:53:55 h4 kernel: [ 6032.506465] libceph: osd34 down
Oct 2 15:53:55 h4 kernel: [ 6032.506471] libceph: osd36 down
Oct 2 15:53:59 h4 kernel: [ 6036.307228] libceph: osd34 up
Oct 2 15:53:59 h4 kernel: [ 6036.307246] libceph: osd5 down
Oct 2 15:53:59 h4 kernel: [ 6036.307247] libceph: osd7 down
Oct 2 15:54:21 h4 kernel: [ 6058.320486] libceph: osd6 down
Oct 2 15:58:52 h4 kernel: [ 6329.264386] libceph: osd4 weight 0x0 (out)
Oct 2 15:58:59 h4 kernel: [ 6336.980025] libceph: osd36 weight 0x0 (out)
Oct 2 15:59:19 h4 kernel: [ 6356.963857] libceph: osd5 weight 0x0 (out)
Oct 2 15:59:19 h4 kernel: [ 6356.963861] libceph: osd7 weight 0x0 (out)
Oct 2 15:59:19 h4 kernel: [ 6356.963896] libceph: osd6 weight 0x0 (out)
Oct 2 15:59:58 h4 kernel: [ 6395.581026] libceph: mon2 [2001:620:0:6::10c]:6789 socket closed (con state OPEN)
Oct 2 15:59:58 h4 kernel: [ 6395.581063] libceph: mon2 [2001:620:0:6::10c]:6789 session lost, hunting for new mon
Oct 2 16:00:27 h4 kernel: [ 6424.145194] libceph: mon0 [2001:620:0:6::106]:6789 socket closed (con state OPEN)
Oct 2 16:00:48 h4 kernel: [ 6445.262728] ceph: mds0 caps stale
Oct 2 16:00:48 h4 kernel: [ 6445.777796] libceph: mon1 [2001:620:0:6::108]:6789 session established
Oct 2 16:01:08 h4 kernel: [ 6465.248891] ceph: mds0 caps stale
Oct 2 16:03:19 h4 kernel: [ 6596.670226] INFO: task qemu-system-x86:10536 blocked for more than 120 seconds.
Oct 2 16:03:19 h4 kernel: [ 6596.672586] Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 2 16:03:19 h4 kernel: [ 6596.675103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 16:03:19 h4 kernel: [ 6596.677725] qemu-system-x86 D 0000000000000000 0 10536 1 0x00000000
Oct 2 16:03:19 h4 kernel: [ 6596.677733] ffff8809a76e1cf8 0000000000000002 0000000000000000 0000000000000000
Oct 2 16:03:19 h4 kernel: [ 6596.677739] ffff8809a76e1fd8 ffff8809a76e1fd8 ffff8809a76e1fd8 00000000000144c0
Oct 2 16:03:19 h4 kernel: [ 6596.677743] ffff88102978af40 ffff8807ea6e46e0 0000000000000000 ffff881f417fa4e8
Oct 2 16:03:19 h4 kernel: [ 6596.677748] Call Trace:
Oct 2 16:03:19 h4 kernel: [ 6596.677774] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 2 16:03:19 h4 kernel: [ 6596.677778] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 2 16:03:19 h4 kernel: [ 6596.677783] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 2 16:03:19 h4 kernel: [ 6596.677786] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 2 16:03:19 h4 kernel: [ 6596.677813] [<ffffffffa04adb3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 2 16:03:19 h4 kernel: [ 6596.677828] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 2 16:03:19 h4 kernel: [ 6596.677832] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 2 16:03:19 h4 kernel: [ 6596.677848] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.677854] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 2 16:03:19 h4 kernel: [ 6596.677857] INFO: task qemu-system-x86:10538 blocked for more than 120 seconds.
Oct 2 16:03:19 h4 kernel: [ 6596.680643] Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 2 16:03:19 h4 kernel: [ 6596.683622] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 16:03:19 h4 kernel: [ 6596.686821] qemu-system-x86 D 0000000000000000 0 10538 1 0x00000000
Oct 2 16:03:19 h4 kernel: [ 6596.686825] ffff881758123cf8 0000000000000002 ffffffff81157ece ffff88107fffbf00
Oct 2 16:03:19 h4 kernel: [ 6596.686829] ffff881758123fd8 ffff881758123fd8 ffff881758123fd8 00000000000144c0
Oct 2 16:03:19 h4 kernel: [ 6596.686833] ffff8810297797a0 ffff881f8a9e5e80 5350492074756f20 ffff881f417fa4e8
Oct 2 16:03:19 h4 kernel: [ 6596.686836] Call Trace:
Oct 2 16:03:19 h4 kernel: [ 6596.686842] [<ffffffff81157ece>] ? __alloc_pages_nodemask+0x18e/0xa30
Oct 2 16:03:19 h4 kernel: [ 6596.686846] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 2 16:03:19 h4 kernel: [ 6596.686857] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 2 16:03:19 h4 kernel: [ 6596.686862] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 2 16:03:19 h4 kernel: [ 6596.686867] [<ffffffff810a2251>] ? sched_slice.isra.41+0x51/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.686870] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 2 16:03:19 h4 kernel: [ 6596.686886] [<ffffffffa04adb3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 2 16:03:19 h4 kernel: [ 6596.686890] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 2 16:03:19 h4 kernel: [ 6596.686894] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 2 16:03:19 h4 kernel: [ 6596.686897] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.686907] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 2 16:03:19 h4 kernel: [ 6596.686909] INFO: task qemu-system-x86:10539 blocked for more than 120 seconds.
Oct 2 16:03:19 h4 kernel: [ 6596.690306] Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 2 16:03:19 h4 kernel: [ 6596.693823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 16:03:19 h4 kernel: [ 6596.697600] qemu-system-x86 D 0000000000000000 0 10539 1 0x00000000
Oct 2 16:03:19 h4 kernel: [ 6596.697604] ffff8807a5749cf8 0000000000000002 90666666fffffec3 c3f64b75021845f6
Oct 2 16:03:19 h4 kernel: [ 6596.697608] ffff8807a5749fd8 ffff8807a5749fd8 ffff8807a5749fd8 00000000000144c0
Oct 2 16:03:19 h4 kernel: [ 6596.697612] ffff88102978af40 ffff88096982de80 0000000000000000 ffff881f417fa4e8
Oct 2 16:03:19 h4 kernel: [ 6596.697615] Call Trace:
Oct 2 16:03:19 h4 kernel: [ 6596.697620] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 2 16:03:19 h4 kernel: [ 6596.697623] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 2 16:03:19 h4 kernel: [ 6596.697626] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 2 16:03:19 h4 kernel: [ 6596.697629] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 2 16:03:19 h4 kernel: [ 6596.697638] [<ffffffffa04adb3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 2 16:03:19 h4 kernel: [ 6596.697642] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 2 16:03:19 h4 kernel: [ 6596.697646] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 2 16:03:19 h4 kernel: [ 6596.697649] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.697652] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 2 16:03:30 h4 kernel: [ 6607.597118] ceph: mds0 reconnect start

Oct 2 16:04:19 h4 kernel: [ 6655.990924] ceph: mds0 reconnect start
Oct 2 16:04:19 h4 kernel: [ 6656.005471] ceph: mds0 reconnect success
Oct 2 16:05:05 h4 kernel: [ 6702.600585] ceph: mds0 recovery completed

Is that something that is kernel specific (I have heard that you recommend the 3.11 kernel with the vfs direntry backport) and can be solved by downgrading the kernel, or ceph specific?

cheers
jc


Files

h4_dmesg.txt (242 KB) h4_dmesg.txt Jens-Christian Fischer, 10/02/2013 07:49 AM
s0_syslog (497 KB) s0_syslog Jens-Christian Fischer, 10/03/2013 02:13 AM
s0_osd_logs.zip (29.5 MB) s0_osd_logs.zip Jens-Christian Fischer, 10/03/2013 02:13 AM
Actions

Also available in: Atom PDF