Bug #6450: Kernel bugs in 3.12-rc1, taking 2 hosts (and 1 following) down - CephFS - Ceph

Actions

Copy link

Bug #6450

closed

Kernel bugs in 3.12-rc1, taking 2 hosts (and 1 following) down

Added by Jens-Christian Fischer over 10 years ago. Updated over 10 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are running 10 hosts with 74 OSDs on Ubuntu 13.04, Ceph 0.61.8 and Kernel 3.12-rc1

root@h5:~# ceph --version
ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b)
root@h5:~# uname -a
Linux h5 3.12.0-031200rc1-generic #201309161735 SMP Mon Sep 16 21:38:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux)

This morning I put all 10 hosts back under control of Puppet which caused 4 hosts to update from 0.61.7 to 0.61.8 and a restart of all OSD processes (due to changes in ceph.conf that got distributed to all servers). The restarts were staggered, and I had a healthy cluster.

I then started 50 VMs on OpenStack (backed by RDB devices and CephFS as the shared storage for instance images). They started, ran for a while and I terminated them. I saw ceph doing up to 5000 iops during this run.

After this (maybe 15-30 minutes later), two of our hosts died: The syslog on one of them shows this. This must have been around the time when I shut down the VMs:

Oct 1 11:24:59 h4 kernel: [325337.712260] ------------[ cut here ]------------
Oct 1 11:24:59 h4 kernel: [325337.712285] WARNING: CPU: 4 PID: 20343 at /home/apw/COD/linux/fs/ceph/inode.c:468 ceph_fill_file_size+0x1b2/0x1e0 [ceph]()
Oct 1 11:24:59 h4 kernel: [325337.712296] Modules linked in: xt_state vhost_net vhost macvtap macvlan ebt_arp ebt_ip ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs ceph libceph joydev hid_generic gpio_ich 8021q garp stp dm_multipath mrp scsi_dh llc ast ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt usbhid mei_me hid ses mei sb_edac enclosure edac_core shpchp lpc_ich ipmi_si ipmi_msghandler kvm_intel mac_hid kvm lp parport btrfs xor raid6_pq libcrc32c ixgbe mpt2sas isci mdio igb raid_class ahci libsas libahci i2c_algo_bit dca scsi_transport_sas ptp pps_core
Oct 1 11:24:59 h4 kernel: [325337.712359] CPU: 4 PID: 20343 Comm: kworker/4:2 Not tainted 3.12.0-031200rc1-generic #201309161735
Oct 1 11:24:59 h4 kernel: [325337.712361] Hardware name: Quanta S210-X22RQ/S210-X22RQ, BIOS S2RQ3A20 01/25/2013
Oct 1 11:24:59 h4 kernel: [325337.712370] Workqueue: ceph-msgr con_work [libceph]
Oct 1 11:24:59 h4 kernel: [325337.712373] 00000000000001d4 ffff880be80f1aa8 ffffffff817420dd 0000000000000007
Oct 1 11:24:59 h4 kernel: [325337.712377] 0000000000000000 ffff880be80f1ae8 ffffffff8106759c ffff880be80f1af8
Oct 1 11:24:59 h4 kernel: [325337.712381] ffff8805212e1940 0000000000000002 0000000000000000 0000000000000000
Oct 1 11:24:59 h4 kernel: [325337.712384] Call Trace:
Oct 1 11:24:59 h4 kernel: [325337.712391] [<ffffffff817420dd>] dump_stack+0x46/0x58
Oct 1 11:24:59 h4 kernel: [325337.712397] [<ffffffff8106759c>] warn_slowpath_common+0x8c/0xc0
Oct 1 11:24:59 h4 kernel: [325337.712400] [<ffffffff810675ea>] warn_slowpath_null+0x1a/0x20
Oct 1 11:24:59 h4 kernel: [325337.712407] [<ffffffffa04a4da2>] ceph_fill_file_size+0x1b2/0x1e0 [ceph]
Oct 1 11:24:59 h4 kernel: [325337.712417] [<ffffffffa04b36fd>] handle_cap_trunc.isra.28+0x7d/0xe0 [ceph]
Oct 1 11:24:59 h4 kernel: [325337.712436] [<ffffffffa04b7e47>] ceph_handle_caps+0x347/0x440 [ceph]
Oct 1 11:24:59 h4 kernel: [325337.712450] [<ffffffffa04c345d>] dispatch+0xcd/0x180 [ceph]
Oct 1 11:24:59 h4 kernel: [325337.712460] [<ffffffffa0457d75>] process_message+0x95/0x190 [libceph]
Oct 1 11:24:59 h4 kernel: [325337.712470] [<ffffffffa045c4e0>] ? read_partial_message+0x170/0x4f0 [libceph]
Oct 1 11:24:59 h4 kernel: [325337.712476] [<ffffffff81625c16>] ? kernel_recvmsg+0x46/0x60
Oct 1 11:24:59 h4 kernel: [325337.712484] [<ffffffffa0458598>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Oct 1 11:24:59 h4 kernel: [325337.712494] [<ffffffffa045cb39>] try_read+0x2d9/0x5a0 [libceph]
Oct 1 11:24:59 h4 kernel: [325337.712503] [<ffffffffa045cedb>] con_work+0xdb/0x3d0 [libceph]
Oct 1 11:24:59 h4 kernel: [325337.712509] [<ffffffff81083d0f>] process_one_work+0x17f/0x4d0
Oct 1 11:24:59 h4 kernel: [325337.712514] [<ffffffff81084f4b>] worker_thread+0x11b/0x3d0
Oct 1 11:24:59 h4 kernel: [325337.712518] [<ffffffff81084e30>] ? manage_workers.isra.20+0x1b0/0x1b0
Oct 1 11:24:59 h4 kernel: [325337.712524] [<ffffffff8108c0d0>] kthread+0xc0/0xd0
Oct 1 11:24:59 h4 kernel: [325337.712529] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:24:59 h4 kernel: [325337.712535] [<ffffffff8175876c>] ret_from_fork+0x7c/0xb0
Oct 1 11:24:59 h4 kernel: [325337.712540] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:24:59 h4 kernel: [325337.712543] ---[ end trace 4d6281075935d94a ]---
Oct 1 11:25:00 h4 kernel: [325339.142250] device vnet7 entered promiscuous mode

followed by this

Oct 1 11:52:16 h4 kernel: [326974.043752] br100: port 4(vnet2) entered disabled state
Oct 1 11:52:17 h4 kernel: [326974.406035] BUG: unable to handle kernel paging request at 000060df80006818
Oct 1 11:52:17 h4 kernel: [326974.408396] IP: [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 1 11:52:17 h4 kernel: [326974.410871] PGD 0
Oct 1 11:52:17 h4 kernel: [326974.413312] Oops: 0000 [#1] SMP
Oct 1 11:52:17 h4 kernel: [326974.415848] Modules linked in: xt_state vhost_net vhost macvtap macvlan ebt_arp ebt_ip ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs ceph libceph joydev hid_generic gpio_ich 8021q garp stp dm_multipath mrp scsi_dh llc ast ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt usbhid mei_me hid ses mei sb_edac enclosure edac_core shpchp lpc_ich ipmi_si ipmi_msghandler kvm_intel mac_hid kvm lp parport btrfs xor raid6_pq libcrc32c ixgbe mpt2sas isci mdio igb raid_class ahci libsas libahci i2c_algo_bit dca scsi_transport_sas ptp pps_core
Oct 1 11:52:17 h4 kernel: [326974.444715] CPU: 11 PID: 2084 Comm: kworker/11:8 Tainted: G W 3.12.0-031200rc1-generic #201309161735
Oct 1 11:52:17 h4 kernel: [326974.449973] Hardware name: Quanta S210-X22RQ/S210-X22RQ, BIOS S2RQ3A20 01/25/2013
Oct 1 11:52:17 h4 kernel: [326974.455274] Workqueue: events css_killed_work_fn
Oct 1 11:52:17 h4 kernel: [326974.460634] task: ffff881a992717a0 ti: ffff881530e7c000 task.ti: ffff881530e7c000
Oct 1 11:52:17 h4 kernel: [326974.466264] RIP: 0010:[<ffffffff811b1fba>] [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 1 11:52:17 h4 kernel: [326974.472220] RSP: 0018:ffff881530e7dc48 EFLAGS: 00010046
Oct 1 11:52:17 h4 kernel: [326974.478345] RAX: 0000000000000246 RBX: ffff88103ca85070 RCX: 00000000ffffffff
Oct 1 11:52:17 h4 kernel: [326974.484650] RDX: 0000000000000000 RSI: 000060df80006800 RDI: ffffc9001ecee22c
Oct 1 11:52:17 h4 kernel: [326974.491101] RBP: ffff881530e7dca8 R08: ffffc9000c066000 R09: 0000000000000001
Oct 1 11:52:17 h4 kernel: [326974.497663] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea000d8141c0
Oct 1 11:52:17 h4 kernel: [326974.504404] R13: 0000000000000001 R14: ffffc9001ecee000 R15: ffffc9001ecee000
Oct 1 11:52:17 h4 kernel: [326974.511170] FS: 0000000000000000(0000) GS:ffff88207fca0000(0000) knlGS:0000000000000000
Oct 1 11:52:17 h4 kernel: [326974.518114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 1 11:52:17 h4 kernel: [326974.525073] CR2: 000060df80006818 CR3: 0000000001c0d000 CR4: 00000000000427e0
Oct 1 11:52:17 h4 kernel: [326974.532255] Stack:
Oct 1 11:52:17 h4 kernel: [326974.539363] ffffea000d8141c0 ffffea000d8141c0 ffff881530e7dc00 ffffc9000c066000
Oct 1 11:52:17 h4 kernel: [326974.546831] ffff881530e7dc78 ffffc9001ecee22c ffff881530e7dca8 ffffea000d8141c0
Oct 1 11:52:17 h4 kernel: [326974.554354] ffffc9001ecee000 0000000000000001 ffff88103ca85070 0000000000000000
Oct 1 11:52:17 h4 kernel: [326974.562031] Call Trace:
Oct 1 11:52:17 h4 kernel: [326974.569700] [<ffffffff811b21e3>] mem_cgroup_move_parent+0xd3/0x1a0
Oct 1 11:52:17 h4 kernel: [326974.577633] [<ffffffff811b2d1a>] mem_cgroup_force_empty_list+0xaa/0x130
Oct 1 11:52:17 h4 kernel: [326974.585645] [<ffffffff811b3425>] mem_cgroup_reparent_charges+0xb5/0x140
Oct 1 11:52:17 h4 kernel: [326974.593732] [<ffffffff811b3609>] mem_cgroup_css_offline+0x59/0xc0
Oct 1 11:52:17 h4 kernel: [326974.601877] [<ffffffff810e6aef>] css_killed_work_fn+0x4f/0xe0
Oct 1 11:52:17 h4 kernel: [326974.610097] [<ffffffff81083d0f>] process_one_work+0x17f/0x4d0
Oct 1 11:52:17 h4 kernel: [326974.618254] [<ffffffff81084f4b>] worker_thread+0x11b/0x3d0
Oct 1 11:52:17 h4 kernel: [326974.626352] [<ffffffff81084e30>] ? manage_workers.isra.20+0x1b0/0x1b0
Oct 1 11:52:17 h4 kernel: [326974.634571] [<ffffffff8108c0d0>] kthread+0xc0/0xd0
Oct 1 11:52:17 h4 kernel: [326974.642832] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:52:17 h4 kernel: [326974.651237] [<ffffffff8175876c>] ret_from_fork+0x7c/0xb0
Oct 1 11:52:17 h4 kernel: [326974.659720] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:52:17 h4 kernel: [326974.668329] Code: 45 c8 e8 8a d5 59 00 0f b6 55 b0 44 89 e9 4c 8b 45 b8 f7 d9 84 d2 75 37 41 8b 74 24 18 85 f6 78 2e 49 8b b6 30 02 00 00 45 89 e9 <4c> 39 4e 18 0f 8c af 00 00 00 49 8b b7 30 02 00 00 89 cf 65 48
Oct 1 11:52:17 h4 kernel: [326974.686948] RIP [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 1 11:52:17 h4 kernel: [326974.696696] RSP <ffff881530e7dc48>
Oct 1 11:52:17 h4 kernel: [326974.706420] CR2: 000060df80006818
Oct 1 11:52:17 h4 kernel: [326974.791247] ---[ end trace 4d6281075935d94b ]---
Oct 1 11:52:18 h4 ntpd²⁷⁰⁹¹: Deleting interface #16 vnet3, fe80::fc16:3eff:fe6c:25c1#123, interface stats: received=0, sent=0, dropped=0, active_time=1748 secs

Oct 1 11:52:45 h4 kernel: [326974.957733] BUG: unable to handle kernel paging request at ffffffffffffffd8
Oct 1 11:52:45 h4 kernel: [326974.967334] IP: [<ffffffff8108c480>] kthread_data+0x10/0x20
Oct 1 11:52:45 h4 kernel: [326974.976848] PGD 1c10067 PUD 1c12067 PMD 0
Oct 1 11:52:45 h4 kernel: [326974.986433] Oops: 0000 [#2] SMP
Oct 1 11:52:45 h4 kernel: [326974.995768] Modules linked in: xt_state vhost_net vhost macvtap macvlan ebt_arp ebt_ip ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs ceph libceph joydev hid_generic gpio_ich 8021q garp stp dm_multipath mrp scsi_dh llc ast ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt usbhid mei_me hid ses mei sb_edac enclosure edac_core shpchp lpc_ich ipmi_si ipmi_msghandler kvm_intel mac_hid kvm lp parport btrfs xor raid6_pq libcrc32c ixgbe mpt2sas isci mdio igb raid_class ahci libsas libahci i2c_algo_bit dca scsi_transport_sas ptp pps_core
Oct 1 11:52:45 h4 kernel: [326975.063307] CPU: 11 PID: 2084 Comm: kworker/11:8 Tainted: G D W 3.12.0-031200rc1-generic #201309161735
Oct 1 11:52:45 h4 kernel: [326975.072867] Hardware name: Quanta S210-X22RQ/S210-X22RQ, BIOS S2RQ3A20 01/25/2013
Oct 1 11:52:45 h4 kernel: [326975.082483] task: ffff881a992717a0 ti: ffff881530e7c000 task.ti: ffff881530e7c000
Oct 1 11:52:45 h4 kernel: [326975.092075] RIP: 0010:[<ffffffff8108c480>] [<ffffffff8108c480>] kthread_data+0x10/0x20
Oct 1 11:52:45 h4 kernel: [326975.101642] RSP: 0018:ffff881530e7d858 EFLAGS: 00010046
Oct 1 11:52:45 h4 kernel: [326975.111048] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81eb3f40
Oct 1 11:52:45 h4 kernel: [326975.120544] RDX: 0000000000000000 RSI: 000000000000000b RDI: ffff881a992717a0
Oct 1 11:52:45 h4 kernel: [326975.129937] RBP: ffff881530e7d858 R08: 0000000000000004 R09: ffffea0011c7e600
Oct 1 11:52:45 h4 kernel: [326975.139295] R10: 0000000000000000 R11: 00012961ca33c2a1 R12: 000000000000000b
Oct 1 11:52:45 h4 kernel: [326975.148653] R13: ffff881a99271bd0 R14: ffff881a992717a0 R15: 0000000000000046
Oct 1 11:52:45 h4 kernel: [326975.158031] FS: 0000000000000000(0000) GS:ffff88207fca0000(0000) knlGS:0000000000000000
Oct 1 11:52:45 h4 kernel: [326975.167495] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 1 11:52:45 h4 kernel: [326975.176860] CR2: 0000000000000028 CR3: 0000000001c0d000 CR4: 00000000000427e0
Oct 1 11:52:45 h4 kernel: [326975.186293] Stack:
Oct 1 11:52:45 h4 kernel: [326975.195534] ffff881530e7d878 ffffffff810858a6 ffff88207fcb44c0 000000000000000b
Oct 1 11:52:45 h4 kernel: [326975.204990] ffff881530e7d8f8 ffffffff8174ceef ffffea0011c7e400 ffff881a99271df0
Oct 1 11:52:45 h4 kernel: [326975.214428] ffff881530e7dfd8 ffff881530e7dfd8 ffff881530e7dfd8 00000000000144c0
Oct 1 11:52:45 h4 kernel: [326975.223844] Call Trace:
Oct 1 11:52:45 h4 kernel: [326975.233079] [<ffffffff810858a6>] wq_worker_sleeping+0x16/0x90
Oct 1 11:52:45 h4 kernel: [326975.242421] [<ffffffff8174ceef>] __schedule+0x5ff/0x730
Oct 1 11:52:45 h4 kernel: [326975.251762] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 1 11:52:45 h4 kernel: [326975.261128] [<ffffffff8106a375>] do_exit+0x2b5/0x460
Oct 1 11:52:45 h4 kernel: [326975.270438] [<ffffffff810bdfac>] ? kmsg_dump+0x9c/0xc0
Oct 1 11:52:45 h4 kernel: [326975.279749] [<ffffffff81750bc3>] oops_end+0xc3/0x160
Oct 1 11:52:45 h4 kernel: [326975.289049] [<ffffffff81733724>] no_context+0x1ab/0x1ba
Oct 1 11:52:45 h4 kernel: [326975.298337] [<ffffffff81733906>] __bad_area_nosemaphore+0x1d3/0x1f2
Oct 1 11:52:45 h4 kernel: [326975.307660] [<ffffffff81733938>] bad_area_nosemaphore+0x13/0x15
Oct 1 11:52:45 h4 kernel: [326975.316989] [<ffffffff81753b82>] __do_page_fault+0x3d2/0x580
Oct 1 11:52:45 h4 kernel: [326975.326298] [<ffffffff8101cfb3>] ? native_sched_clock+0x13/0x80
Oct 1 11:52:45 h4 kernel: [326975.335625] [<ffffffff8101d029>] ? sched_clock+0x9/0x10
Oct 1 11:52:45 h4 kernel: [326975.344895] [<ffffffff8109fd7d>] ? sched_clock_cpu+0xbd/0x110
Oct 1 11:52:45 h4 kernel: [326975.353956] [<ffffffff810a0a8a>] ? arch_vtime_task_switch+0x8a/0x90
Oct 1 11:52:45 h4 kernel: [326975.362876] [<ffffffff810a0acd>] ? vtime_common_task_switch+0x3d/0x50
Oct 1 11:52:45 h4 kernel: [326975.371671] [<ffffffff81099ab8>] ? finish_task_switch+0x108/0x170
Oct 1 11:52:45 h4 kernel: [326975.380388] [<ffffffff81753d4a>] do_page_fault+0x1a/0x70
Oct 1 11:52:45 h4 kernel: [326975.389137] [<ffffffff8174fe58>] page_fault+0x28/0x30
Oct 1 11:52:45 h4 kernel: [326975.397862] [<ffffffff811b1fba>] ? mem_cgroup_move_account+0xda/0x230
Oct 1 11:52:45 h4 kernel: [326975.406586] [<ffffffff811b1f96>] ? mem_cgroup_move_account+0xb6/0x230
Oct 1 11:52:45 h4 kernel: [326975.415139] [<ffffffff811b21e3>] mem_cgroup_move_parent+0xd3/0x1a0
Oct 1 11:52:45 h4 kernel: [326975.423657] [<ffffffff811b2d1a>] mem_cgroup_force_empty_list+0xaa/0x130
Oct 1 11:52:45 h4 kernel: [326975.432199] [<ffffffff811b3425>] mem_cgroup_reparent_charges+0xb5/0x140
Oct 1 11:52:45 h4 kernel: [326975.440517] [<ffffffff811b3609>] mem_cgroup_css_offline+0x59/0xc0
Oct 1 11:52:45 h4 kernel: [326975.448576] [<ffffffff810e6aef>] css_killed_work_fn+0x4f/0xe0
Oct 1 11:52:45 h4 kernel: [326975.456374] [<ffffffff81083d0f>] process_one_work+0x17f/0x4d0
Oct 1 11:52:45 h4 kernel: [326975.463956] [<ffffffff81084f4b>] worker_thread+0x11b/0x3d0
Oct 1 11:52:45 h4 kernel: [326975.471246] [<ffffffff81084e30>] ? manage_workers.isra.20+0x1b0/0x1b0
Oct 1 11:52:45 h4 kernel: [326975.478337] [<ffffffff8108c0d0>] kthread+0xc0/0xd0
Oct 1 11:52:45 h4 kernel: [326975.485184] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:52:45 h4 kernel: [326975.491889] [<ffffffff8175876c>] ret_from_fork+0x7c/0xb0
Oct 1 11:52:45 h4 kernel: [326975.498375] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:52:45 h4 kernel: [326975.504813] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 c0 03 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
Oct 1 11:52:45 h4 kernel: [326975.518699] RIP [<ffffffff8108c480>] kthread_data+0x10/0x20
Oct 1 11:52:45 h4 kernel: [326975.525353] RSP <ffff881530e7d858>
Oct 1 11:52:45 h4 kernel: [326975.531976] CR2: ffffffffffffffd8
Oct 1 11:52:45 h4 kernel: [326975.538542] ---[ end trace 4d6281075935d94c ]---
Oct 1 11:52:45 h4 kernel: [326975.550057] Fixing recursive fault but reboot is needed!
Oct 1 11:52:45 h4 kernel: [326997.151503] ------------[ cut here ]------------
Oct 1 11:52:45 h4 kernel: [326997.152170] WARNING: CPU: 11 PID: 2084 at /home/apw/COD/linux/kernel/watchdog.c:245 watchdog_overflow_callback+0x9a/0xc0()
Oct 1 11:52:45 h4 kernel: [326997.152832] Watchdog detected hard LOCKUP on cpu 11
Oct 1 11:52:45 h4 kernel: [326997.152850] Modules linked in: xt_state vhost_net vhost macvtap macvlan ebt_arp ebt_ip ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs ceph libceph joydev hid_generic gpio_ich 8021q garp stp dm_multipath mrp scsi_dh llc ast ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt usbhid mei_me hid ses mei sb_edac enclosure edac_core shpchp lpc_ich ipmi_si ipmi_msghandler kvm_intel mac_hid kvm lp parport btrfs xor raid6_pq libcrc32c ixgbe mpt2sas isci mdio igb raid_class ahci libsas libahci i2c_algo_bit dca scsi_transport_sas ptp pps_core
Oct 1 11:52:45 h4 kernel: [326997.158790] CPU: 11 PID: 2084 Comm: kworker/11:8 Tainted: G D W 3.12.0-031200rc1-generic #201309161735
Oct 1 11:52:45 h4 kernel: [326997.159594] Hardware name: Quanta S210-X22RQ/S210-X22RQ, BIOS S2RQ3A20 01/25/2013
Oct 1 11:52:45 h4 kernel: [326997.160434] 00000000000000f5 ffff88207fca7ba8 ffffffff817420dd 0000000000000007
Oct 1 11:52:45 h4 kernel: [326997.161269] ffff88207fca7bf8 ffff88207fca7be8 ffffffff8106759c 0000000000000000
Oct 1 11:52:45 h4 kernel: [326997.162109] ffff882028878000 0000000000000000 ffff88207fca7d20 0000000000000000
Oct 1 11:52:45 h4 kernel: [326997.162954] Call Trace:
Oct 1 11:52:45 h4 kernel: [326997.163781] <NMI> [<ffffffff817420dd>] dump_stack+0x46/0x58
Oct 1 11:52:45 h4 kernel: [326997.164634] [<ffffffff8106759c>] warn_slowpath_common+0x8c/0xc0
Oct 1 11:52:45 h4 kernel: [326997.165474] [<ffffffff81067686>] warn_slowpath_fmt+0x46/0x50
Oct 1 11:52:45 h4 kernel: [326997.166315] [<ffffffff81107e9a>] watchdog_overflow_callback+0x9a/0xc0
Oct 1 11:52:45 h4 kernel: [326997.167163] [<ffffffff8114560c>] __perf_event_overflow+0x9c/0x230
Oct 1 11:52:45 h4 kernel: [326997.168023] [<ffffffff8102acb8>] ? x86_perf_event_set_period+0xd8/0x150
Oct 1 11:52:45 h4 kernel: [326997.168868] [<ffffffff81145f14>] perf_event_overflow+0x14/0x20
Oct 1 11:52:45 h4 kernel: [326997.169721] [<ffffffff8103250e>] intel_pmu_handle_irq+0x1ae/0x2a0
Oct 1 11:52:45 h4 kernel: [326997.170597] [<ffffffff81189261>] ? unmap_kernel_range_noflush+0x11/0x20
Oct 1 11:52:45 h4 kernel: [326997.171448] [<ffffffff8143066c>] ? ghes_copy_tofrom_phys+0x11c/0x220
Oct 1 11:52:45 h4 kernel: [326997.172303] [<ffffffff81751684>] perf_event_nmi_handler+0x34/0x60
Oct 1 11:52:45 h4 kernel: [326997.173152] [<ffffffff81750dda>] nmi_handle.isra.3+0x8a/0x1a0
Oct 1 11:52:45 h4 kernel: [326997.174022] [<ffffffff81431730>] ? ghes_print_estatus.constprop.10+0x70/0x70
Oct 1 11:52:45 h4 kernel: [326997.174883] [<ffffffff81750fd8>] default_do_nmi+0x58/0x240
Oct 1 11:52:45 h4 kernel: [326997.175745] [<ffffffff81751250>] do_nmi+0x90/0xd0
Oct 1 11:52:45 h4 kernel: [326997.176611] [<ffffffff817501c1>] end_repeat_nmi+0x1e/0x2e
Oct 1 11:52:45 h4 kernel: [326997.177487] [<ffffffff8174f72f>] ? _raw_spin_lock_irq+0x3f/0x60
Oct 1 11:52:45 h4 kernel: [326997.178356] [<ffffffff8174f72f>] ? _raw_spin_lock_irq+0x3f/0x60
Oct 1 11:52:45 h4 kernel: [326997.179209] [<ffffffff8174f72f>] ? _raw_spin_lock_irq+0x3f/0x60
Oct 1 11:52:45 h4 kernel: [326997.180052] <<EOE>> [<ffffffff8174c99a>] __schedule+0xaa/0x730
Oct 1 11:52:45 h4 kernel: [326997.180918] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 1 11:52:45 h4 kernel: [326997.181777] [<ffffffff8106a4f5>] do_exit+0x435/0x460
Oct 1 11:52:45 h4 kernel: [326997.182628] [<ffffffff81750bc3>] oops_end+0xc3/0x160
Oct 1 11:52:45 h4 kernel: [326997.183489] [<ffffffff81733724>] no_context+0x1ab/0x1ba
Oct 1 11:52:45 h4 kernel: [326997.184332] [<ffffffff81733906>] __bad_area_nosemaphore+0x1d3/0x1f2
Oct 1 11:52:45 h4 kernel: [326997.185154] [<ffffffff81732f5f>] ? pmd_offset+0x1a/0x20
Oct 1 11:52:45 h4 kernel: [326997.185956] [<ffffffff81733938>] bad_area_nosemaphore+0x13/0x15
Oct 1 11:52:45 h4 kernel: [326997.186764] [<ffffffff81753b82>] __do_page_fault+0x3d2/0x580
Oct 1 11:52:45 h4 kernel: [326997.187554] [<ffffffff8110b50a>] ? kfree_call_rcu+0x1a/0x30
Oct 1 11:52:45 h4 kernel: [326997.188348] [<ffffffff810a48f1>] ? update_curr+0x141/0x200
Oct 1 11:52:45 h4 kernel: [326997.189138] [<ffffffff81753d4a>] do_page_fault+0x1a/0x70
Oct 1 11:52:45 h4 kernel: [326997.189929] [<ffffffff8174fe58>] page_fault+0x28/0x30
Oct 1 11:52:45 h4 kernel: [326997.190712] [<ffffffff8108c480>] ? kthread_data+0x10/0x20
Oct 1 11:52:45 h4 kernel: [326997.191480] [<ffffffff810858a6>] wq_worker_sleeping+0x16/0x90
Oct 1 11:52:45 h4 kernel: [326997.192228] [<ffffffff8174ceef>] __schedule+0x5ff/0x730
Oct 1 11:52:45 h4 kernel: [326997.192957] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 1 11:52:45 h4 kernel: [326997.193653] [<ffffffff8106a375>] do_exit+0x2b5/0x460
Oct 1 11:52:45 h4 kernel: [326997.194328] [<ffffffff810bdfac>] ? kmsg_dump+0x9c/0xc0
Oct 1 11:52:45 h4 kernel: [326997.194977] [<ffffffff81750bc3>] oops_end+0xc3/0x160
Oct 1 11:52:45 h4 kernel: [326997.195609] [<ffffffff81733724>] no_context+0x1ab/0x1ba
Oct 1 11:52:45 h4 kernel: [326997.196218] [<ffffffff81733906>] __bad_area_nosemaphore+0x1d3/0x1f2
Oct 1 11:52:45 h4 kernel: [326997.196803] [<ffffffff81733938>] bad_area_nosemaphore+0x13/0x15
Oct 1 11:52:45 h4 kernel: [326997.197366] [<ffffffff81753b82>] __do_page_fault+0x3d2/0x580
Oct 1 11:52:45 h4 kernel: [326997.197926] [<ffffffff8101cfb3>] ? native_sched_clock+0x13/0x80
Oct 1 11:52:45 h4 kernel: [326997.198467] [<ffffffff8101d029>] ? sched_clock+0x9/0x10
Oct 1 11:52:45 h4 kernel: [326997.199018] [<ffffffff8109fd7d>] ? sched_clock_cpu+0xbd/0x110
Oct 1 11:52:45 h4 kernel: [326997.199556] [<ffffffff810a0a8a>] ? arch_vtime_task_switch+0x8a/0x90
Oct 1 11:52:45 h4 kernel: [326997.200094] [<ffffffff810a0acd>] ? vtime_common_task_switch+0x3d/0x50
Oct 1 11:52:45 h4 kernel: [326997.200630] [<ffffffff81099ab8>] ? finish_task_switch+0x108/0x170
Oct 1 11:52:45 h4 kernel: [326997.201162] [<ffffffff81753d4a>] do_page_fault+0x1a/0x70
Oct 1 11:52:45 h4 kernel: [326997.201690] [<ffffffff8174fe58>] page_fault+0x28/0x30
Oct 1 11:52:45 h4 kernel: [326997.202226] [<ffffffff811b1fba>] ? mem_cgroup_move_account+0xda/0x230
Oct 1 11:52:45 h4 kernel: [326997.202754] [<ffffffff811b1f96>] ? mem_cgroup_move_account+0xb6/0x230
Oct 1 11:52:45 h4 kernel: [326997.203279] [<ffffffff811b21e3>] mem_cgroup_move_parent+0xd3/0x1a0
Oct 1 11:52:45 h4 kernel: [326997.203809] [<ffffffff811b2d1a>] mem_cgroup_force_empty_list+0xaa/0x130
Oct 1 11:52:45 h4 kernel: [326997.204326] [<ffffffff811b3425>] mem_cgroup_reparent_charges+0xb5/0x140
Oct 1 11:52:45 h4 kernel: [326997.204837] [<ffffffff811b3609>] mem_cgroup_css_offline+0x59/0xc0
Oct 1 11:52:45 h4 kernel: [326997.205357] [<ffffffff810e6aef>] css_killed_work_fn+0x4f/0xe0
Oct 1 11:52:45 h4 kernel: [326997.205875] [<ffffffff81083d0f>] process_one_work+0x17f/0x4d0
Oct 1 11:52:45 h4 kernel: [326997.206383] [<ffffffff81084f4b>] worker_thread+0x11b/0x3d0
Oct 1 11:52:45 h4 kernel: [326997.206887] [<ffffffff81084e30>] ? manage_workers.isra.20+0x1b0/0x1b0
Oct 1 11:52:45 h4 kernel: [326997.207391] [<ffffffff8108c0d0>] kthread+0xc0/0xd0
Oct 1 11:52:45 h4 kernel: [326997.207895] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:52:45 h4 kernel: [326997.208411] [<ffffffff8175876c>] ret_from_fork+0x7c/0xb0
Oct 1 11:52:45 h4 kernel: [326997.208909] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 11:52:45 h4 kernel: [326997.209426] ---[ end trace 4d6281075935d94d ]---
Oct 1 11:52:45 h4 kernel: [326997.209951] perf samples too long (462690 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
Oct 1 11:52:45 h4 kernel: [326997.210501] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 58.992 msecs
Oct 1 11:52:45 h4 kernel: [327002.690411] libceph: osd56 down
Oct 1 11:52:45 h4 kernel: [327002.696142] libceph: osd58 down

Oct 1 11:52:48 h4 kernel: [327005.680087] BUG: soft lockup - CPU#2 stuck for 22s! [qemu-system-x86:26887]
Oct 1 11:52:48 h4 kernel: [327005.680710] Modules linked in: xt_state vhost_net vhost macvtap macvlan ebt_arp ebt_ip ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs ceph libceph joydev hid_generic gpio_ich 8021q garp stp dm_multipath mrp scsi_dh llc ast ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt usbhid mei_me hid ses mei sb_edac enclosure edac_core shpchp lpc_ich ipmi_si ipmi_msghandler kvm_intel mac_hid kvm lp parport btrfs xor raid6_pq libcrc32c ixgbe mpt2sas isci mdio igb raid_class ahci libsas libahci i2c_algo_bit dca scsi_transport_sas ptp pps_core
Oct 1 11:52:48 h4 kernel: [327005.685816] CPU: 2 PID: 26887 Comm: qemu-system-x86 Tainted: G D W 3.12.0-031200rc1-generic #201309161735
Oct 1 11:52:48 h4 kernel: [327005.686625] Hardware name: Quanta S210-X22RQ/S210-X22RQ, BIOS S2RQ3A20 01/25/2013
Oct 1 11:52:48 h4 kernel: [327005.687418] task: ffff880d9d118000 ti: ffff880212d7e000 task.ti: ffff880212d7e000
Oct 1 11:52:48 h4 kernel: [327005.688226] RIP: 0010:[<ffffffff810d6ec2>] [<ffffffff810d6ec2>] generic_exec_single+0x82/0xb0
Oct 1 11:52:48 h4 kernel: [327005.689079] RSP: 0018:ffff880212d7fb48 EFLAGS: 00000202
Oct 1 11:52:48 h4 kernel: [327005.689914] RAX: 0000000000000100 RBX: 0000000000000296 RCX: 0000000000000010
Oct 1 11:52:48 h4 kernel: [327005.690773] RDX: 0000000000000010 RSI: 0000000000000100 RDI: 0000000000000296
Oct 1 11:52:48 h4 kernel: [327005.691606] RBP: ffff880212d7fb88 R08: ffff88103fc4dda0 R09: 0000000000000100
Oct 1 11:52:48 h4 kernel: [327005.692456] R10: ffff88103fc4ebb0 R11: 0000000000000002 R12: 0000000000000296
Oct 1 11:52:48 h4 kernel: [327005.693319] R13: ffff880212d7fb18 R14: ffff88103fd2dda0 R15: ffff88103fc4dd80
Oct 1 11:52:48 h4 kernel: [327005.694194] FS: 00007f7ee5b95700(0000) GS:ffff88103fc40000(0000) knlGS:0000000000000000
Oct 1 11:52:48 h4 kernel: [327005.695080] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 1 11:52:48 h4 kernel: [327005.695926] CR2: 00007fe443777000 CR3: 0000001b63557000 CR4: 00000000000427e0
Oct 1 11:52:48 h4 kernel: [327005.696817] Stack:
Oct 1 11:52:48 h4 kernel: [327005.697693] ffff881027927e00 ffff88103fd34f80 ffff880212d7fbc8 000000000000000f
Oct 1 11:52:48 h4 kernel: [327005.698615] ffffffffa02c6700 0000000000000002 ffffffff81d0e740 0000000000000001
Oct 1 11:52:48 h4 kernel: [327005.699543] ffff880212d7fbf8 ffffffff810d6fc5 0000000000000000 0000000000000000
Oct 1 11:52:48 h4 kernel: [327005.700472] Call Trace:
Oct 1 11:52:48 h4 kernel: [327005.701383] [<ffffffffa02c6700>] ? init_rmode_identity_map+0x120/0x120 [kvm_intel]
Oct 1 11:52:48 h4 kernel: [327005.702324] [<ffffffff810d6fc5>] smp_call_function_single+0xd5/0x160
Oct 1 11:52:48 h4 kernel: [327005.703289] [<ffffffffa02c6700>] ? init_rmode_identity_map+0x120/0x120 [kvm_intel]
Oct 1 11:52:48 h4 kernel: [327005.704244] [<ffffffffa02cac75>] vmx_vcpu_load+0x1a5/0x1c0 [kvm_intel]
Oct 1 11:52:48 h4 kernel: [327005.705208] [<ffffffff8109fd7d>] ? sched_clock_cpu+0xbd/0x110
Oct 1 11:52:48 h4 kernel: [327005.706188] [<ffffffffa026bb9c>] kvm_arch_vcpu_load+0x3c/0x1f0 [kvm]
Oct 1 11:52:48 h4 kernel: [327005.707216] [<ffffffffa02559f5>] kvm_sched_in+0x25/0x30 [kvm]
Oct 1 11:52:48 h4 kernel: [327005.708177] [<ffffffff81099a31>] finish_task_switch+0x81/0x170
Oct 1 11:52:48 h4 kernel: [327005.709129] [<ffffffff8174ccc5>] __schedule+0x3d5/0x730
Oct 1 11:52:48 h4 kernel: [327005.710074] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 1 11:52:48 h4 kernel: [327005.711099] [<ffffffffa025a13d>] kvm_vcpu_block+0x6d/0xb0 [kvm]
Oct 1 11:52:48 h4 kernel: [327005.712076] [<ffffffff8108cc30>] ? add_wait_queue+0x60/0x60
Oct 1 11:52:48 h4 kernel: [327005.713054] [<ffffffffa026fdbd>] __vcpu_run+0xdd/0x2f0 [kvm]
Oct 1 11:52:48 h4 kernel: [327005.714026] [<ffffffffa027006d>] kvm_arch_vcpu_ioctl_run+0x9d/0x170 [kvm]
Oct 1 11:52:48 h4 kernel: [327005.715007] [<ffffffffa025830b>] kvm_vcpu_ioctl+0x43b/0x600 [kvm]
Oct 1 11:52:48 h4 kernel: [327005.715974] [<ffffffff811104ac>] ? acct_account_cputime+0x1c/0x20
Oct 1 11:52:48 h4 kernel: [327005.716932] [<ffffffff810a0249>] ? account_user_time+0x99/0xb0
Oct 1 11:52:48 h4 kernel: [327005.717894] [<ffffffff811cee8a>] do_vfs_ioctl+0x7a/0x2e0
Oct 1 11:52:48 h4 kernel: [327005.718833] [<ffffffff81022695>] ? syscall_trace_enter+0x165/0x280
Oct 1 11:52:48 h4 kernel: [327005.719757] [<ffffffff811cf181>] SyS_ioctl+0x91/0xb0
Oct 1 11:52:48 h4 kernel: [327005.720689] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 1 11:52:48 h4 kernel: [327005.721613] Code: 8b 55 08 49 89 5d 08 4c 89 2b 48 89 53 08 48 89 1a e8 73 87 67 00 4c 3b 6d c8 74 2b 45 85 ff 75 0a eb 0e 66 0f 1f 44 00 00 f3 90 <f6> 43 20 01 75 f8 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75
Oct 1 11:52:48 h4 kernel: [327005.879935] BUG: soft lockup - CPU#12 stuck for 22s! [qemu-system-x86:26366]
Oct 1 11:52:48 h4 kernel: [327005.881477] Modules linked in: xt_state vhost_net vhost macvtap macvlan ebt_arp ebt_ip ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs ceph libceph joydev hid_generic gpio_ich 8021q garp stp dm_multipath mrp scsi_dh llc ast ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt usbhid mei_me hid ses mei sb_edac enclosure edac_core shpchp lpc_ich ipmi_si ipmi_msghandler kvm_intel mac_hid kvm lp parport btrfs xor raid6_pq libcrc32c ixgbe mpt2sas isci mdio igb raid_class ahci libsas libahci i2c_algo_bit dca scsi_transport_sas ptp pps_core
Oct 1 11:52:48 h4 kernel: [327005.892524] CPU: 12 PID: 26366 Comm: qemu-system-x86 Tainted: G D W 3.12.0-031200rc1-generic #201309161735
Oct 1 11:52:48 h4 kernel: [327005.894091] Hardware name: Quanta S210-X22RQ/S210-X22RQ, BIOS S2RQ3A20 01/25/2013
Oct 1 11:52:48 h4 kernel: [327005.895658] task: ffff8813ef2f17a0 ti: ffff8811a595c000 task.ti: ffff8811a595c000

A few minutes later, another host died:

Oct 1 11:52:54 h1 kernel: [332358.802657] libceph: osd56 down
Oct 1 11:52:54 h1 kernel: [332358.802662] libceph: osd58 down
Oct 1 11:52:54 h1 kernel: [332358.802709] libceph: osd13 down
Oct 1 11:52:54 h1 kernel: [332358.802710] libceph: osd59 down
Oct 1 11:52:54 h1 kernel: [332358.802711] libceph: osd60 down
Oct 1 11:52:54 h1 kernel: [332358.802712] libceph: osd63 down
Oct 1 11:52:54 h1 kernel: [332358.802732] libceph: osd12 down
Oct 1 11:52:54 h1 kernel: [332358.802733] libceph: osd15 down
Oct 1 11:52:59 h1 kernel: [332363.799463] libceph: osd68 down
Oct 1 11:52:59 h1 kernel: [332363.799593] libceph: osd14 down
Oct 1 11:52:59 h1 kernel: [332363.799594] libceph: osd62 down
Oct 1 11:52:59 h1 kernel: [332363.799653] libceph: osd61 down
Oct 1 11:52:59 h1 kernel: [332363.799654] libceph: osd64 down
Oct 1 11:53:04 h1 kernel: [332368.330879] libceph: osd73 down
Oct 1 11:53:09 h1 kernel: [332373.326771] libceph: osd57 down
Oct 1 11:53:26 h1 kernel: [332390.160136] libceph: osd59 up
Oct 1 11:53:36 h1 kernel: [332400.594259] libceph: osd59 down
Oct 1 11:55:01 h1 CRON⁸⁰⁸⁰: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 1 11:57:45 h1 kernel: [332649.917471] libceph: osd56 weight 0x0 (out)
Oct 1 11:57:54 h1 kernel: [332658.602766] libceph: osd13 weight 0x0 (out)
Oct 1 11:57:54 h1 kernel: [332658.602770] libceph: osd15 weight 0x0 (out)
Oct 1 11:57:54 h1 kernel: [332658.602772] libceph: osd58 weight 0x0 (out)
Oct 1 11:57:54 h1 kernel: [332658.602773] libceph: osd60 weight 0x0 (out)
Oct 1 11:57:54 h1 kernel: [332658.602787] libceph: osd12 weight 0x0 (out)
Oct 1 11:57:54 h1 kernel: [332658.602788] libceph: osd63 weight 0x0 (out)
Oct 1 11:57:54 h1 kernel: [332658.602789] libceph: osd68 weight 0x0 (out)
Oct 1 11:57:58 h1 kernel: [332662.429372] libceph: osd14 weight 0x0 (out)
Oct 1 11:57:58 h1 kernel: [332662.429378] libceph: osd62 weight 0x0 (out)
Oct 1 11:58:02 h1 kernel: [332666.562412] libceph: osd61 weight 0x0 (out)
Oct 1 11:58:02 h1 kernel: [332666.562418] libceph: osd64 weight 0x0 (out)
Oct 1 11:58:02 h1 kernel: [332666.562421] libceph: osd73 weight 0x0 (out)
Oct 1 11:59:50 h1 kernel: [332774.560060] ceph: mds0 caps stale
Oct 1 12:00:10 h1 kernel: [332794.545750] ceph: mds0 caps stale
Oct 1 12:01:00 h1 kernel: [332844.602539] libceph: osd57 weight 0x0 (out)
Oct 1 12:01:00 h1 kernel: [332844.602543] libceph: osd59 weight 0x0 (out)
Oct 1 12:03:39 h1 kernel: [333002.780710] INFO: task qemu-system-x86:9475 blocked for more than 120 seconds.
Oct 1 12:03:39 h1 kernel: [333002.782852] Tainted: G W 3.12.0-031200rc1-generic #201309161735
Oct 1 12:03:39 h1 kernel: [333002.785163] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 12:03:39 h1 kernel: [333002.787664] qemu-system-x86 D 0000000000000000 0 9475 1 0x00000000
Oct 1 12:03:39 h1 kernel: [333002.787670] ffff88036e2e1cf8 0000000000000002 6ff774b222fb4e75 d3a27e99b2a52360
Oct 1 12:03:39 h1 kernel: [333002.787678] ffff88036e2e1fd8 ffff88036e2e1fd8 ffff88036e2e1fd8 00000000000144c0
Oct 1 12:03:39 h1 kernel: [333002.787682] ffff8810297a17a0 ffff880ea6bdaf40 8e1115bc2237e181 ffff881c072e8968
Oct 1 12:03:39 h1 kernel: [333002.787688] Call Trace:
Oct 1 12:03:39 h1 kernel: [333002.787701] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 1 12:03:39 h1 kernel: [333002.787706] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 1 12:03:39 h1 kernel: [333002.787711] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 1 12:03:39 h1 kernel: [333002.787715] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 1 12:03:39 h1 kernel: [333002.787739] [<ffffffffa0457b3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 1 12:03:39 h1 kernel: [333002.787746] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 1 12:03:39 h1 kernel: [333002.787750] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 1 12:03:39 h1 kernel: [333002.787752] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 1 12:03:39 h1 kernel: [333002.787757] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 1 12:03:39 h1 kernel: [333002.787761] INFO: task qemu-system-x86:9517 blocked for more than 120 seconds.
Oct 1 12:03:39 h1 kernel: [333002.790415] Tainted: G W 3.12.0-031200rc1-generic #201309161735
Oct 1 12:03:39 h1 kernel: [333002.793221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 12:03:39 h1 kernel: [333002.796261] qemu-system-x86 D 0000000000000000 0 9517 1 0x00000000
Oct 1 12:03:39 h1 kernel: [333002.796269] ffff880a8f6d7cf8 0000000000000002 21b02f900080ffff 000007b000000000
Oct 1 12:03:39 h1 kernel: [333002.796276] ffff880a8f6d7fd8 ffff880a8f6d7fd8 ffff880a8f6d7fd8 00000000000144c0
Oct 1 12:03:39 h1 kernel: [333002.796281] ffff8810297b2f40 ffff880d92a097a0 0000000000000007 ffff881c072e8968
Oct 1 12:03:39 h1 kernel: [333002.796286] Call Trace:
Oct 1 12:03:39 h1 kernel: [333002.796298] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 1 12:03:39 h1 kernel: [333002.796302] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 1 12:03:39 h1 kernel: [333002.796307] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 1 12:03:39 h1 kernel: [333002.796311] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 1 12:03:39 h1 kernel: [333002.796335] [<ffffffffa0457b3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 1 12:03:39 h1 kernel: [333002.796344] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 1 12:03:39 h1 kernel: [333002.796349] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 1 12:03:39 h1 kernel: [333002.796354] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 1 12:03:39 h1 kernel: [333002.796358] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 1 12:03:39 h1 kernel: [333002.796364] INFO: task qemu-system-x86:9359 blocked for more than 120 seconds.
Oct 1 12:03:39 h1 kernel: [333002.799584] Tainted: G W 3.12.0-031200rc1-generic #201309161735
Oct 1 12:03:39 h1 kernel: [333002.803001] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 12:03:39 h1 kernel: [333002.806664] qemu-system-x86 D 0000000000000000 0 9359 1 0x00000000
Oct 1 12:03:39 h1 kernel: [333002.806668] ffff880d2f5c9cf8 0000000000000002 0000000000000000 0000000000000000
Oct 1 12:03:39 h1 kernel: [333002.806674] ffff880d2f5c9fd8 ffff880d2f5c9fd8 ffff880d2f5c9fd8 00000000000144c0
Oct 1 12:03:39 h1 kernel: [333002.806678] ffff8810297b46e0 ffff881fce9a2f40 0000000000000000 ffff8809baba8968
Oct 1 12:03:39 h1 kernel: [333002.806686] Call Trace:
Oct 1 12:03:39 h1 kernel: [333002.806694] [<ffffffff8174dd59>] schedule+0x29/0x70
Oct 1 12:03:39 h1 kernel: [333002.806698] [<ffffffff8174e08e>] schedule_preempt_disabled+0xe/0x10
Oct 1 12:03:39 h1 kernel: [333002.806702] [<ffffffff8174c004>] __mutex_lock_slowpath+0x114/0x1b0
Oct 1 12:03:39 h1 kernel: [333002.806706] [<ffffffff8174c0c3>] mutex_lock+0x23/0x40
Oct 1 12:03:39 h1 kernel: [333002.806721] [<ffffffffa0457b3a>] ceph_aio_write+0x8a/0x4b0 [ceph]
Oct 1 12:03:39 h1 kernel: [333002.806728] [<ffffffff81097d5b>] ? perf_event_task_sched_out+0x8b/0xa0
Oct 1 12:03:39 h1 kernel: [333002.806734] [<ffffffff811bc00a>] do_sync_write+0x5a/0x90
Oct 1 12:03:39 h1 kernel: [333002.806738] [<ffffffff811bcb0e>] vfs_write+0xce/0x200
Oct 1 12:03:39 h1 kernel: [333002.806743] [<ffffffff811bd192>] SyS_pwrite64+0x92/0xa0
Oct 1 12:03:39 h1 kernel: [333002.806748] [<ffffffff81758a2f>] tracesys+0xe1/0xe6
Oct 1 12:03:42 h1 kernel: [333005.990282] ceph: mds0 caps renewed
Oct 1 12:05:01 h1 CRON¹⁰⁹⁹⁵: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 1 12:05:10 h1 kernel: [333094.331099] ceph: mds0 caps stale
Oct 1 12:05:30 h1 kernel: [333114.316791] ceph: mds0 caps stale
Oct 1 12:05:34 h1 kernel: [333117.844735] ceph: mds0 caps renewed
Oct 1 12:06:30 h1 kernel: [333174.273870] ceph: mds0 caps stale
Oct 1 12:06:50 h1 kernel: [333194.259543] ceph: mds0 caps stale

and it died 3 minutes later:

Oct 1 12:09:27 h1 kernel: [333350.208067] BUG: unable to handle kernel paging request at 000060df80008868
Oct 1 12:09:27 h1 kernel: [333350.211816] IP: [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 1 12:09:27 h1 kernel: [333350.215718] PGD 0
Oct 1 12:09:27 h1 kernel: [333350.219626] Oops: 0000 [#1] SMP
Oct 1 12:09:27 h1 kernel: [333350.223645] Modules linked in: xt_multiport xt_state vhost_net vhost macvtap macvlan ebt_arp ebt_ip ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge nbd ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs ceph libceph gpio_ich 8021q dm_multipath ses garp enclosure scsi_dh stp mrp ast llc ttm drm_kms_helper drm sb_edac syscopyarea edac_core sysfillrect sysimgblt mei_me mei shpchp lpc_ich ipmi_si ipmi_msghandler kvm_intel kvm mac_hid lp parport btrfs xor raid6_pq libcrc32c igb ixgbe mpt2sas i2c_algo_bit raid_class isci mdio dca ahci libsas ptp libahci scsi_transport_sas pps_core
Oct 1 12:09:27 h1 kernel: [333350.261862] CPU: 10 PID: 10046 Comm: kworker/10:21 Tainted: G W 3.12.0-031200rc1-generic #201309161735
Oct 1 12:09:27 h1 kernel: [333350.268554] Hardware name: Quanta S210-X22RQ/S210-X22RQ, BIOS S2RQ3A19 10/26/2012
Oct 1 12:09:27 h1 kernel: [333350.275416] Workqueue: events css_killed_work_fn
Oct 1 12:09:27 h1 kernel: [333350.282358] task: ffff881be4918000 ti: ffff881d5849c000 task.ti: ffff881d5849c000
Oct 1 12:09:27 h1 kernel: [333350.289587] RIP: 0010:[<ffffffff811b1fba>] [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 1 12:09:27 h1 kernel: [333350.296978] RSP: 0018:ffff881d5849dc48 EFLAGS: 00010002
Oct 1 12:09:27 h1 kernel: [333350.304432] RAX: 0000000000000246 RBX: ffff88103f13dbf0 RCX: 00000000ffffffff
Oct 1 12:09:27 h1 kernel: [333350.312115] RDX: 0000000000000000 RSI: 000060df80008850 RDI: ffffc9001fd7622c
Oct 1 12:09:27 h1 kernel: [333350.319929] RBP: ffff881d5849dca8 R08: ffffc9000c066000 R09: 0000000000000001
Oct 1 12:09:27 h1 kernel: [333350.327893] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00012f6fc0
Oct 1 12:09:27 h1 kernel: [333350.335987] R13: 0000000000000001 R14: ffffc9001fd76000 R15: ffffc9001fd76000
Oct 1 12:09:27 h1 kernel: [333350.344135] FS: 0000000000000000(0000) GS:ffff88207fc80000(0000) knlGS:0000000000000000
Oct 1 12:09:27 h1 kernel: [333350.352491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 1 12:09:27 h1 kernel: [333350.360920] CR2: 000060df80008868 CR3: 0000000001c0d000 CR4: 00000000000427e0
Oct 1 12:09:27 h1 kernel: [333350.369536] Stack:
Oct 1 12:09:27 h1 kernel: [333350.378144] ffffea00012f6fc0 ffffea00012f6fc0 ffff881d5849dc00 ffffc9000c066000
Oct 1 12:09:27 h1 kernel: [333350.387016] ffff881d5849dc78 ffffc9001fd7622c ffff881d5849dca8 ffffea00012f6fc0
Oct 1 12:09:27 h1 kernel: [333350.395986] ffffc9001fd76000 0000000000000001 ffff88103f13dbf0 0000000000000000
Oct 1 12:09:27 h1 kernel: [333350.405021] Call Trace:
Oct 1 12:09:27 h1 kernel: [333350.414076] [<ffffffff811b21e3>] mem_cgroup_move_parent+0xd3/0x1a0
Oct 1 12:09:27 h1 kernel: [333350.423412] [<ffffffff811b2d1a>] mem_cgroup_force_empty_list+0xaa/0x130
Oct 1 12:09:27 h1 kernel: [333350.432841] [<ffffffff811b3425>] mem_cgroup_reparent_charges+0xb5/0x140
Oct 1 12:09:27 h1 kernel: [333350.442342] [<ffffffff811b3609>] mem_cgroup_css_offline+0x59/0xc0
Oct 1 12:09:27 h1 kernel: [333350.451898] [<ffffffff810e6aef>] css_killed_work_fn+0x4f/0xe0
Oct 1 12:09:27 h1 kernel: [333350.461532] [<ffffffff81083d0f>] process_one_work+0x17f/0x4d0
Oct 1 12:09:27 h1 kernel: [333350.471243] [<ffffffff81084f4b>] worker_thread+0x11b/0x3d0
Oct 1 12:09:27 h1 kernel: [333350.481044] [<ffffffff81084e30>] ? manage_workers.isra.20+0x1b0/0x1b0
Oct 1 12:09:27 h1 kernel: [333350.491068] [<ffffffff8108c0d0>] kthread+0xc0/0xd0
Oct 1 12:09:27 h1 kernel: [333350.500900] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 12:09:27 h1 kernel: [333350.505102] type=1400 audit(1380622167.061:80): apparmor="STATUS" operation="profile_remove" name="libvirt-bfb84fec-5b70-49e7-8477-bb08b88ef6a3" pid=14391 comm="apparmor_parser"
Oct 1 12:09:27 h1 kernel: [333350.530779] [<ffffffff8175876c>] ret_from_fork+0x7c/0xb0
Oct 1 12:09:27 h1 kernel: [333350.540755] [<ffffffff8108c010>] ? flush_kthread_worker+0xb0/0xb0
Oct 1 12:09:27 h1 kernel: [333350.550835] Code: 45 c8 e8 8a d5 59 00 0f b6 55 b0 44 89 e9 4c 8b 45 b8 f7 d9 84 d2 75 37 41 8b 74 24 18 85 f6 78 2e 49 8b b6 30 02 00 00 45 89 e9 <4c> 39 4e 18 0f 8c af 00 00 00 49 8b b7 30 02 00 00 89 cf 65 48
Oct 1 12:09:27 h1 kernel: [333350.571492] RIP [<ffffffff811b1fba>] mem_cgroup_move_account+0xda/0x230
Oct 1 12:09:27 h1 kernel: [333350.581639] RSP <ffff881d5849dc48>
Oct 1 12:09:27 h1 kernel: [333350.591698] CR2: 000060df80008868
Oct 1 12:09:27 h1 kernel: [333350.661304] ---[ end trace 63a4d12ac149f283 ]---

Luckily - there were no further cascaded failures....

We lost 24 of 74 OSDs and 2 (of 5 MONs).

Any ideas what caused this?

cheers
Jens-Christian

Actions

Copy link

Updated by Zheng Yan over 10 years ago

Category set to 1
Status changed from New to Need More Info

The first warning was caused by a MDS bug. (you can try upgrading MDS 0.67.3 ) The rest BUGs did not look like ceph related. (rc1 kernel is unsuitable to use other than kernel hacking). I have no idea why these kernel bugs cause OSD die. Did you run OSD on rc1 kernel? Do you have the dead OSD's log or coredump?

Actions

Copy link

Updated by Jens-Christian Fischer over 10 years ago

Can I run a 0.67.3 MDS with the rest of the infrastructure on 0.61.8?

We are using the rc1 kernels in order to run cephFS without ceph-fuse (which has caused us a few problems as well)

All OSDs are running on the 3.12rc1 kernel. The OSDs are back up in the meantime (I was just counting what we lost during the outage)

Right now I'm struggling to stabilize the whole thing again (I had it clean for a moment, but one of the MONs didn't start completely). In the meantime, another host has died on us :) Talk about bleeding edge...

Actions

Copy link

Updated by Zheng Yan over 10 years ago

Jens-Christian Fischer wrote:

Can I run a 0.67.3 MDS with the rest of the infrastructure on 0.61.8?

yes, you can

We are using the rc1 kernels in order to run cephFS without ceph-fuse (which has caused us a few problems as well)

For cephfs, 3.11 kernel + commit 590fb51f1c (vfs: call d_op->d_prune() before unhashing dentry) should work well in most cases..

All OSDs are running on the 3.12rc1 kernel. The OSDs are back up in the meantime (I was just counting what we lost during the outage)

no need to run OSDs and monitors on bleeding edge kernel.

Right now I'm struggling to stabilize the whole thing again (I had it clean for a moment, but one of the MONs didn't start completely). In the meantime, another host has died on us :) Talk about bleeding edge...

Actions

Copy link

Updated by Jens-Christian Fischer over 10 years ago

we have now upgraded the complete cluster to Dumpling (0.67.3) (also due to other problems we have experienced in the past)

cpu/thread usage has increased across all nodes, and we are still battling stability issues (mons coming in and out of service),

It will probably take a few days before we know if that has improved general stability.

Going back to 3.11 + patch kernels is not planned right now (depending on how the 3.12 kernel holds up)

Actions

Copy link

Updated by Sage Weil over 10 years ago

Project changed from Ceph to CephFS
Category deleted (1)

Actions

Copy link

Updated by Sage Weil over 10 years ago

Jens-Christian Fischer wrote:

we have now upgraded the complete cluster to Dumpling (0.67.3) (also due to other problems we have experienced in the past)

cpu/thread usage has increased across all nodes, and we are still battling stability issues (mons coming in and out of service),

I'd be curious to hear what the mon issues are!

It will probably take a few days before we know if that has improved general stability.

Going back to 3.11 + patch kernels is not planned right now (depending on how the 3.12 kernel holds up)

Any other news on this bug? I'm inclined to close out this bug..

Actions

Copy link