Project

General

Profile

Actions

Support #6455

closed

dumpling cephFS on 3.12-rc1 qemu hung tasks

Added by Jens-Christian Fischer over 10 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:

Description

We run 3.12rc1 kernels, ceph 0.67.3, and OpenStack Folsom with /var/lib/nova/instances on CephFS. The physical hosts share Ceph and OpenStack duties (ie. they run OSD, some run MON as well and have nova-compute nodes)

After the update to dumpling we see a instability on all hosts, here's a capture from Syslog on one of them:

Oct 2 15:45:57 h4 ntpd5944: peers refreshed
Oct 2 15:46:38 h4 kernel: [ 5595.799694] device vnet1 entered promiscuous mode
Oct 2 15:46:38 h4 kernel: [ 5595.823800] br100: port 3(vnet1) entered forwarding state
Oct 2 15:46:38 h4 kernel: [ 5595.823818] br100: port 3(vnet1) entered forwarding state
Oct 2 15:46:40 h4 ntpd5944: Listen normally on 17 vnet1 fe80::fc16:3eff:fe0d:2eb4 UDP 123
Oct 2 15:46:40 h4 ntpd5944: peers refreshed
Oct 2 15:46:40 h4 ntpd5944: new interface(s) found: waking up resolver
Oct 2 15:46:45 h4 kernel: [ 5602.945964] kvm [8232]: vcpu0 unhandled rdmsr: 0x345
Oct 2 15:46:45 h4 kernel: [ 5602.948088] kvm_set_msr_common: 187 callbacks suppressed
Oct 2 15:46:45 h4 kernel: [ 5602.948090] kvm [8232]: vcpu0 unhandled wrmsr: 0x680 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.950215] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c0 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.952322] kvm [8232]: vcpu0 unhandled wrmsr: 0x681 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.954439] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c1 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.956587] kvm [8232]: vcpu0 unhandled wrmsr: 0x682 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.958723] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c2 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.960855] kvm [8232]: vcpu0 unhandled wrmsr: 0x683 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.962973] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c3 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.965095] kvm [8232]: vcpu0 unhandled wrmsr: 0x684 data 0
Oct 2 15:46:45 h4 kernel: [ 5602.967233] kvm [8232]: vcpu0 unhandled wrmsr: 0x6c4 data 0
Oct 2 15:53:17 h4 kernel: [ 5994.406445] br100: port 3(vnet1) entered disabled state
Oct 2 15:53:17 h4 kernel: [ 5994.406604] device vnet1 left promiscuous mode
Oct 2 15:53:17 h4 kernel: [ 5994.406607] br100: port 3(vnet1) entered disabled state
Oct 2 15:53:19 h4 ntpd5944: Deleting interface #17 vnet1, fe80::fc16:3eff:fe0d:2eb4#123, interface stats: received=0, sent=0, dropped=0, active_time=399 secs
Oct 2 15:53:19 h4 ntpd5944: peers refreshed
Oct 2 15:53:51 h4 kernel: [ 6028.944544] libceph: osd4 down
Oct 2 15:53:55 h4 kernel: [ 6032.506465] libceph: osd34 down
Oct 2 15:53:55 h4 kernel: [ 6032.506471] libceph: osd36 down
Oct 2 15:53:59 h4 kernel: [ 6036.307228] libceph: osd34 up
Oct 2 15:53:59 h4 kernel: [ 6036.307246] libceph: osd5 down
Oct 2 15:53:59 h4 kernel: [ 6036.307247] libceph: osd7 down
Oct 2 15:54:21 h4 kernel: [ 6058.320486] libceph: osd6 down
Oct 2 15:58:52 h4 kernel: [ 6329.264386] libceph: osd4 weight 0x0 (out)
Oct 2 15:58:59 h4 kernel: [ 6336.980025] libceph: osd36 weight 0x0 (out)
Oct 2 15:59:19 h4 kernel: [ 6356.963857] libceph: osd5 weight 0x0 (out)
Oct 2 15:59:19 h4 kernel: [ 6356.963861] libceph: osd7 weight 0x0 (out)
Oct 2 15:59:19 h4 kernel: [ 6356.963896] libceph: osd6 weight 0x0 (out)
Oct 2 15:59:58 h4 kernel: [ 6395.581026] libceph: mon2 [2001:620:0:6::10c]:6789 socket closed (con state OPEN)
Oct 2 15:59:58 h4 kernel: [ 6395.581063] libceph: mon2 [2001:620:0:6::10c]:6789 session lost, hunting for new mon
Oct 2 16:00:27 h4 kernel: [ 6424.145194] libceph: mon0 [2001:620:0:6::106]:6789 socket closed (con state OPEN)
Oct 2 16:00:48 h4 kernel: [ 6445.262728] ceph: mds0 caps stale
Oct 2 16:00:48 h4 kernel: [ 6445.777796] libceph: mon1 [2001:620:0:6::108]:6789 session established
Oct 2 16:01:08 h4 kernel: [ 6465.248891] ceph: mds0 caps stale
Oct 2 16:03:19 h4 kernel: [ 6596.670226] INFO: task qemu-system-x86:10536 blocked for more than 120 seconds.
Oct 2 16:03:19 h4 kernel: [ 6596.672586] Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 2 16:03:19 h4 kernel: [ 6596.675103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 16:03:19 h4 kernel: [ 6596.677725] qemu-system-x86 D 0000000000000000 0 10536 1 0x00000000
Oct 2 16:03:19 h4 kernel: [ 6596.677733] ffff8809a76e1cf8 0000000000000002 0000000000000000 0000000000000000
Oct 2 16:03:19 h4 kernel: [ 6596.677739] ffff8809a76e1fd8 ffff8809a76e1fd8 ffff8809a76e1fd8 00000000000144c0
Oct 2 16:03:19 h4 kernel: [ 6596.677743] ffff88102978af40 ffff8807ea6e46e0 0000000000000000 ffff881f417fa4e8
Oct 2 16:03:19 h4 kernel: [ 6596.677748] Call Trace:
Oct 2 16:03:19 h4 kernel: [ 6596.677774] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 2 16:03:19 h4 kernel: [ 6596.677778] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 2 16:03:19 h4 kernel: [ 6596.677783] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 2 16:03:19 h4 kernel: [ 6596.677786] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 2 16:03:19 h4 kernel: [ 6596.677813] [<ffffffffa04adb3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 2 16:03:19 h4 kernel: [ 6596.677828] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 2 16:03:19 h4 kernel: [ 6596.677832] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 2 16:03:19 h4 kernel: [ 6596.677848] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.677854] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 2 16:03:19 h4 kernel: [ 6596.677857] INFO: task qemu-system-x86:10538 blocked for more than 120 seconds.
Oct 2 16:03:19 h4 kernel: [ 6596.680643] Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 2 16:03:19 h4 kernel: [ 6596.683622] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 16:03:19 h4 kernel: [ 6596.686821] qemu-system-x86 D 0000000000000000 0 10538 1 0x00000000
Oct 2 16:03:19 h4 kernel: [ 6596.686825] ffff881758123cf8 0000000000000002 ffffffff81157ece ffff88107fffbf00
Oct 2 16:03:19 h4 kernel: [ 6596.686829] ffff881758123fd8 ffff881758123fd8 ffff881758123fd8 00000000000144c0
Oct 2 16:03:19 h4 kernel: [ 6596.686833] ffff8810297797a0 ffff881f8a9e5e80 5350492074756f20 ffff881f417fa4e8
Oct 2 16:03:19 h4 kernel: [ 6596.686836] Call Trace:
Oct 2 16:03:19 h4 kernel: [ 6596.686842] [<ffffffff81157ece>] ? __alloc_pages_nodemask+0x18e/0xa30
Oct 2 16:03:19 h4 kernel: [ 6596.686846] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 2 16:03:19 h4 kernel: [ 6596.686857] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 2 16:03:19 h4 kernel: [ 6596.686862] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 2 16:03:19 h4 kernel: [ 6596.686867] [<ffffffff810a2251>] ? sched_slice.isra.41+0x51/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.686870] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 2 16:03:19 h4 kernel: [ 6596.686886] [<ffffffffa04adb3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 2 16:03:19 h4 kernel: [ 6596.686890] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 2 16:03:19 h4 kernel: [ 6596.686894] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 2 16:03:19 h4 kernel: [ 6596.686897] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.686907] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 2 16:03:19 h4 kernel: [ 6596.686909] INFO: task qemu-system-x86:10539 blocked for more than 120 seconds.
Oct 2 16:03:19 h4 kernel: [ 6596.690306] Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 2 16:03:19 h4 kernel: [ 6596.693823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 16:03:19 h4 kernel: [ 6596.697600] qemu-system-x86 D 0000000000000000 0 10539 1 0x00000000
Oct 2 16:03:19 h4 kernel: [ 6596.697604] ffff8807a5749cf8 0000000000000002 90666666fffffec3 c3f64b75021845f6
Oct 2 16:03:19 h4 kernel: [ 6596.697608] ffff8807a5749fd8 ffff8807a5749fd8 ffff8807a5749fd8 00000000000144c0
Oct 2 16:03:19 h4 kernel: [ 6596.697612] ffff88102978af40 ffff88096982de80 0000000000000000 ffff881f417fa4e8
Oct 2 16:03:19 h4 kernel: [ 6596.697615] Call Trace:
Oct 2 16:03:19 h4 kernel: [ 6596.697620] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 2 16:03:19 h4 kernel: [ 6596.697623] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 2 16:03:19 h4 kernel: [ 6596.697626] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 2 16:03:19 h4 kernel: [ 6596.697629] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 2 16:03:19 h4 kernel: [ 6596.697638] [<ffffffffa04adb3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 2 16:03:19 h4 kernel: [ 6596.697642] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 2 16:03:19 h4 kernel: [ 6596.697646] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 2 16:03:19 h4 kernel: [ 6596.697649] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 2 16:03:19 h4 kernel: [ 6596.697652] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 2 16:03:30 h4 kernel: [ 6607.597118] ceph: mds0 reconnect start

Oct 2 16:04:19 h4 kernel: [ 6655.990924] ceph: mds0 reconnect start
Oct 2 16:04:19 h4 kernel: [ 6656.005471] ceph: mds0 reconnect success
Oct 2 16:05:05 h4 kernel: [ 6702.600585] ceph: mds0 recovery completed

Is that something that is kernel specific (I have heard that you recommend the 3.11 kernel with the vfs direntry backport) and can be solved by downgrading the kernel, or ceph specific?

cheers
jc


Files

h4_dmesg.txt (242 KB) h4_dmesg.txt Jens-Christian Fischer, 10/02/2013 07:49 AM
s0_syslog (497 KB) s0_syslog Jens-Christian Fischer, 10/03/2013 02:13 AM
s0_osd_logs.zip (29.5 MB) s0_osd_logs.zip Jens-Christian Fischer, 10/03/2013 02:13 AM
Actions #1

Updated by Zheng Yan over 10 years ago

I need following information to debug the issue.

dmesg -c >/dev/null; echo w >/proc/sysrq-trigger; dmesg
cat /sys/kernel/debug/ceph/*/mdsc
cat /sys/kernel/debug/ceph/*/osdc

Actions #2

Updated by Jens-Christian Fischer over 10 years ago

dmesgs attached, the mdsc and osdc files are empty:

root@h4:~# ll /sys/kernel/debug/ceph/2fe1a358-7f88-4a1b-a31b-ba7501870c80.client202732/
total 0
drwxr-xr-x 2 root root 0 Oct 2 14:17 ./
drwxr-xr-x 3 root root 0 Oct 2 14:17 ../
lrwxrwxrwx 1 root root 0 Oct 2 14:17 bdi > ../../bdi/ceph-3/
-r-------
1 root root 0 Oct 2 14:17 caps
rw------ 1 root root 0 Oct 2 14:17 dentry_lru
rw------ 1 root root 0 Oct 2 14:17 mdsc
rw------ 1 root root 0 Oct 2 14:17 mdsmap
rw------ 1 root root 0 Oct 2 14:17 monc
rw------ 1 root root 0 Oct 2 14:17 monmap
rw------ 1 root root 0 Oct 2 14:17 osdc
rw------ 1 root root 0 Oct 2 14:17 osdmap
rw------ 1 root root 0 Oct 2 14:17 writeback_congestion_kb

Actions #3

Updated by Greg Farnum over 10 years ago

  • Tracker changed from Bug to Support
  • Project changed from Ceph to Linux kernel client

What's the status of the cluster when you see those hangs? Do you have any symptoms that let you detect the hang while it's in-progress and get the contents of those files then?

Actions #4

Updated by Zheng Yan over 10 years ago

dmesg shows there was no blocked task. It's likely the hung task message was caused by slow requests or lock contention.

Actions #5

Updated by Jens-Christian Fischer over 10 years ago

The cluster has been quite busy yesterday (bc we had a number of hosts that had to be rebooted). Right now it is HEALTH_OK so we will see how it fares today.

We suspect OpenStack related file/disk i/o issues (as I said, we have our VM images on CephFS shared among the hosts)

Updated by Jens-Christian Fischer over 10 years ago

Some more information:

In a healthy cluster, I have spun up 12 VMs that installed some software, compiled and did cpu intensive things (fs intensive would have been next). That ran fine, and I could create snapshots from some of those images. When I terminated those instances, things went bad and one of the physical servers locked up. Attached is the syslog of that server that shows the problem. The problems start 10:37:57

On this server, we have two OSD (OSD.2 and OSD.3) that in general take a lot longer to come up and/or don't come up without a lot of kicking. I have attached the log files of those two OSDs as well.

I will now try to reproduce that behaviour as soon as the cluster is clean again

Actions #7

Updated by Jens-Christian Fischer over 10 years ago

Even more info: We did a bit of stress testing with OpenStack (without the server s0, the failed server), creating, terminating a lot of instances while the cluster was degraded, and things held up nicely. Bringing back s0 (the host that failed) back into the OpenStack cluster and doing the same tests immediately killed s0 again)

Not sure if this is a hardware problem or some software problem (ceph -s on s0 causes the following failure:

root@s0:~# ceph -s
Traceback (most recent call last):
File "/usr/bin/ceph", line 774, in <module>
sys.exit(main())
File "/usr/bin/ceph", line 559, in main
conf_defaults=conf_defaults, conffile=conffile)
TypeError: init() got an unexpected keyword argument 'clustername'

I will now re-install ceph on s0, and if that doesn't help, take the server out in the backyard with a 2x4.

Actions #8

Updated by Jens-Christian Fischer over 10 years ago

more info:

Disabled ceph on s0, but still use it as OpenStack compute node. Started 20 VMs, and terminated them. s0 and two other servers go down;

Oct 3 15:18:38 s0 kernel: [ 6924.389640] BUG: unable to handle kernel paging request at 000060dfd9004518
Oct 3 15:18:38 s0 kernel: [ 6924.389652] IP: [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 3 15:18:38 s0 kernel: [ 6924.389666] PGD 0
Oct 3 15:18:38 s0 kernel: [ 6924.389670] Oops: 0000 [#1] SMP
Oct 3 15:18:38 s0 kernel: [ 6924.389675] Modules linked in: vhost_net vhost macvtap macvlan ebt_arp ebt_ip xt_state ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xfs ceph libceph joydev hid_generic usbhid hid bonding sp5100_tco 8021q garp stp mrp llc dm_multipath scsi_dh psmouse amd64_edac_mod edac_core serio_raw fam15h_power k10temp edac_mce_amd i2c_piix4 ohci_pci kvm_amd mac_hid kvm lp parport btrfs xor raid6_pq libcrc32c pata_acpi igb ixgbe pata_atiixp i2c_algo_bit dca ahci ptp libahci pps_core mdio
Oct 3 15:18:38 s0 kernel: [ 6924.389789] CPU: 2 PID: 15882 Comm: kworker/2:1 Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 3 15:18:38 s0 kernel: [ 6924.389798] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012
Oct 3 15:18:38 s0 kernel: [ 6924.389807] Workqueue: events css_killed_work_fn
Oct 3 15:18:38 s0 kernel: [ 6924.389813] task: ffff8806bb2797a0 ti: ffff8816a6f9a000 task.ti: ffff8816a6f9a000
Oct 3 15:18:38 s0 kernel: [ 6924.389819] RIP: 0010:[<ffffffff811b1fba>] [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 3 15:18:38 s0 kernel: [ 6924.389830] RSP: 0018:ffff8816a6f9bc48 EFLAGS: 00010002
Oct 3 15:18:38 s0 kernel: [ 6924.389835] RAX: 0000000000000246 RBX: ffff881ffec5fe10 RCX: 00000000ffffffff
Oct 3 15:18:38 s0 kernel: [ 6924.389840] RDX: 0000000000000000 RSI: 000060dfd9004500 RDI: ffffc9001e64022c
Oct 3 15:18:38 s0 kernel: [ 6924.389846] RBP: ffff8816a6f9bca8 R08: ffffc9000c066000 R09: 0000000000000001
Oct 3 15:18:38 s0 kernel: [ 6924.389852] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea007d97f840
Oct 3 15:18:38 s0 kernel: [ 6924.389857] R13: 0000000000000001 R14: ffffc9001e640000 R15: ffffc9001e640000
Oct 3 15:18:38 s0 kernel: [ 6924.389864] FS: 00007fdbadc4d700(0000) GS:ffff880807c80000(0000) knlGS:0000000000000000
Oct 3 15:18:38 s0 kernel: [ 6924.389870] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct 3 15:18:38 s0 kernel: [ 6924.389875] CR2: 000060dfd9004518 CR3: 0000000714196000 CR4: 00000000000407e0
Oct 3 15:18:38 s0 kernel: [ 6924.389881] Stack:
Oct 3 15:18:38 s0 kernel: [ 6924.389884] ffffea007d97f840 ffffea007d97f840 ffff8816a6f9bc00 ffffc9000c066000
Oct 3 15:18:38 s0 kernel: [ 6924.389902] ffff8816a6f9bc78 ffffc9001e64022c ffff8816a6f9bca8 ffffea007d97f840
Oct 3 15:18:38 s0 kernel: [ 6924.389916] ffffc9001e640000 0000000000000001 ffff881ffec5fe10 0000000000000000
Oct 3 15:18:38 s0 kernel: [ 6924.389930] Call Trace:
Oct 3 15:18:38 s0 kernel: [ 6924.389939] [<ffffffff811b21e3>] mem_cgroup_move_parent+0xd3/0x1a0
Oct 3 15:18:38 s0 kernel: [ 6924.389948] [<ffffffff811b2d1a>] mem_cgroup_force_empty_list+0xaa/0x130
Oct 3 15:18:38 s0 kernel: [ 6924.389957] [<ffffffff811b3425>] mem_cgroup_reparent_charges+0xb5/0x140
Oct 3 15:18:38 s0 kernel: [ 6924.389967] [<ffffffff811b3609>] mem_cgroup_css_offline+0x59/0xc0
Oct 3 15:18:38 s0 kernel: [ 6924.389975] [<ffffffff810e6aef>] css_killed_work_fn+0x4f/0xe0
Oct 3 15:18:38 s0 kernel: [ 6924.389984] [<ffffffff81083d0f>] process_one_work+0x17f/0x4d0
Oct 3 15:18:38 s0 kernel: [ 6924.389992] [<ffffffff81084f4b>] worker_thread+0x11b/0x3d0
Oct 3 15:18:38 s0 kernel: [ 6924.390000] [<ffffffff81084e30>] ? manage_workers.isra.20+0x1b0/0x1b0
Oct 3 15:18:38 s0 kernel: [ 6924.390010] [<ffffffff8108c0d0>] kthread+0xc0/0xd0
Oct 3 15:18:38 s0 kernel: [ 6924.390018] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 3 15:18:38 s0 kernel: [ 6924.390028] [<ffffffff8175876c>] ret_from_fork+0x7c/0xb0
Oct 3 15:18:38 s0 kernel: [ 6924.390036] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 3 15:18:38 s0 kernel: [ 6924.390042] Code: 45 c8 e8 8a d5 59 00 0f b6 55 b0 44 89 e9 4c 8b 45 b8 f7 d9 84 d2 75 37 41 8b 74 24 18 85 f6 78 2e 49 8b b6 30 02 00 00 45 89 e9 <4c> 39 4e 18 0f 8c af 00 00 00 49 8b b7 30 02 00 00 89 cf 65 48
Oct 3 15:18:38 s0 kernel: [ 6924.390097] RIP [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 3 15:18:38 s0 kernel: [ 6924.390105] RSP <ffff8816a6f9bc48>
Oct 3 15:18:38 s0 kernel: [ 6924.390108] CR2: 000060dfd9004518
Oct 3 15:18:38 s0 kernel: [ 6924.390116] ---[ end trace 6939d1043ae5891c ]---

This is from the host that had no ceph processes running...

as a kernel newbie, this seems to me to point towards a Kernel problem. I'm now looking at downgrading the kernel to 3.11 and including the vfs_direntry patch...

Actions #9

Updated by Zheng Yan about 8 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF