Project

General

Profile

Actions

Bug #18671

closed

kernel 4.8.15: BUG: soft lockup

Added by Burkhard Linke over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Running kernel 4.8.15 from Ubuntu mainline PPA, a machine is stuck in a kernel bug:

[Wed Jan 25 15:32:46 2017] NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [jellyfish:157790]
[Wed Jan 25 15:32:46 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:46 2017] CPU: 88 PID: 157790 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:46 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:46 2017] task: ffff8c4c95e11a00 task.stack: ffff8eb05c1c8000
[Wed Jan 25 15:32:46 2017] RIP: 0010:[<ffffffffa76ceb44>] [<ffffffffa76ceb44>] native_queued_spin_lock_slowpath+0x114/0x1a0
[Wed Jan 25 15:32:46 2017] RSP: 0018:ffff8eb05c1cbb10 EFLAGS: 00000246
[Wed Jan 25 15:32:46 2017] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8c4f3fc99d40
[Wed Jan 25 15:32:46 2017] RDX: 0000000000000011 RSI: 0000000000480000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] RBP: ffff8eb05c1cbb10 R08: 0000000001640000 R09: 0000000000000000
[Wed Jan 25 15:32:46 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] R13: 00000000ffffffff R14: ffff8c1a4c0b5f18 R15: ffff8ecea5f4bc00
[Wed Jan 25 15:32:46 2017] FS: 00007f6e68743700(0000) GS:ffff8c4f3fc80000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:46 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:46 2017] CR2: 00007f35feffd9d0 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:46 2017] Stack:
[Wed Jan 25 15:32:46 2017] ffff8eb05c1cbb20 ffffffffa7e833b0 ffff8eb05c1cbc50 ffffffffc0a1ce09
[Wed Jan 25 15:32:46 2017] ffff8ecea5f4bca8 ffff8c1a4c0b6260 ffff8c1a4c0b5f18 ffff8eb05c1cbbd8
[Wed Jan 25 15:32:46 2017] ffff8c1a4c0b5f28 0000000000000000 0000000000000000 0000000000000000
[Wed Jan 25 15:32:46 2017] Call Trace:
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1ce09>] ceph_check_caps+0x89/0xaa0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a196d5>] ? __cap_is_valid+0x25/0xc0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a196d5>] ? __cap_is_valid+0x25/0xc0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1bf64>] ? __ceph_caps_mds_wanted+0x54/0x80 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1afcb>] ? __ceph_caps_issued+0x7b/0xe0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a11bdb>] ceph_renew_caps+0xbb/0x1c0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1f32f>] ceph_get_caps+0x29f/0x3b0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:46 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:46 2017] Code: 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 <8b> 41 08 85 c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02
[Wed Jan 25 15:32:46 2017] NMI watchdog: BUG: soft lockup - CPU#89 stuck for 22s! [jellyfish:157787]
[Wed Jan 25 15:32:46 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:46 2017] CPU: 89 PID: 157787 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:46 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:46 2017] task: ffff8c4c95e14e00 task.stack: ffff8ece98b64000
[Wed Jan 25 15:32:46 2017] RIP: 0010:[<ffffffffa76ceb44>] [<ffffffffa76ceb44>] native_queued_spin_lock_slowpath+0x114/0x1a0
[Wed Jan 25 15:32:46 2017] RSP: 0018:ffff8ece98b67bd8 EFLAGS: 00000246
[Wed Jan 25 15:32:46 2017] RAX: 0000000000000000 RBX: ffff8c1a4c0b5f28 RCX: ffff8c4f3fcd9d40
[Wed Jan 25 15:32:46 2017] RDX: 0000000000000057 RSI: 0000000001600000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] RBP: ffff8ece98b67bd8 R08: 0000000001680000 R09: 0000000000000000
[Wed Jan 25 15:32:46 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f18
[Wed Jan 25 15:32:46 2017] R13: ffff8c1a4c0b6260 R14: ffff8ecea5f4bca8 R15: 0000000000000800
[Wed Jan 25 15:32:46 2017] FS: 00007f6e69f46700(0000) GS:ffff8c4f3fcc0000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:46 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:46 2017] CR2: 00007f36a6794ab8 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:46 2017] Stack:
[Wed Jan 25 15:32:46 2017] ffff8ece98b67be8 ffffffffa7e833b0 ffff8ece98b67c88 ffffffffc0a1c023
[Wed Jan 25 15:32:46 2017] ffff8ece98b67ce4 ffff8ece98b67ce0 ffff8ecea5f4bc00 0000040098b67c50
[Wed Jan 25 15:32:46 2017] ffffffffffffffff 00000000cd2db6ed ffff8c1a4c0b6260 ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] Call Trace:
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1c023>] try_get_cap_refs+0x93/0x5c0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1f1a5>] ceph_get_caps+0x115/0x3b0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:46 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:46 2017] Code: 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 <8b> 41 08 85 c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02
[Wed Jan 25 15:32:50 2017] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [jellyfish:157830]
[Wed Jan 25 15:32:50 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:50 2017] CPU: 2 PID: 157830 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:50 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:50 2017] task: ffff8ec6c9f40d00 task.stack: ffff8eb05c388000
[Wed Jan 25 15:32:50 2017] RIP: 0010:[<ffffffffa76ceb44>] [<ffffffffa76ceb44>] native_queued_spin_lock_slowpath+0x114/0x1a0
[Wed Jan 25 15:32:50 2017] RSP: 0018:ffff8eb05c38bbd8 EFLAGS: 00000246
[Wed Jan 25 15:32:50 2017] RAX: 0000000000000000 RBX: ffff8c1a4c0b5f28 RCX: ffff8c4f3f899d40
[Wed Jan 25 15:32:50 2017] RDX: 0000000000000058 RSI: 0000000001640000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:50 2017] RBP: ffff8eb05c38bbd8 R08: 00000000000c0000 R09: 0000000000000000
[Wed Jan 25 15:32:50 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f18
[Wed Jan 25 15:32:50 2017] R13: ffff8c1a4c0b6260 R14: ffff8ecea5f4bca8 R15: 0000000000000800
[Wed Jan 25 15:32:50 2017] FS: 00007f6e5471b700(0000) GS:ffff8c4f3f880000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:50 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:50 2017] CR2: 00007f35f67e7ea8 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:50 2017] Stack:
[Wed Jan 25 15:32:50 2017] ffff8eb05c38bbe8 ffffffffa7e833b0 ffff8eb05c38bc88 ffffffffc0a1c023
[Wed Jan 25 15:32:50 2017] ffff8eb05c38bce4 ffff8eb05c38bce0 ffff8ecea5f4bc00 000004005c38bc50
[Wed Jan 25 15:32:50 2017] ffffffffffffffff 0000000003927d82 ffff8c1a4c0b6260 ffff8c1a4c0b5f28
[Wed Jan 25 15:32:50 2017] Call Trace:
[Wed Jan 25 15:32:50 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:50 2017] [<ffffffffc0a1c023>] try_get_cap_refs+0x93/0x5c0 [ceph]
[Wed Jan 25 15:32:50 2017] [<ffffffffc0a1f1a5>] ceph_get_caps+0x115/0x3b0 [ceph]
[Wed Jan 25 15:32:50 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:50 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:50 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:50 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:50 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:50 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:50 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:50 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:50 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:50 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:50 2017] Code: 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 <8b> 41 08 85 c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02
[Wed Jan 25 15:32:54 2017] NMI watchdog: BUG: soft lockup - CPU#91 stuck for 22s! [jellyfish:157783]
[Wed Jan 25 15:32:54 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:54 2017] CPU: 91 PID: 157783 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:54 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:54 2017] task: ffff8e3d245f2700 task.stack: ffff8eb47316c000
[Wed Jan 25 15:32:54 2017] RIP: 0010:[<ffffffffa76ceb47>] [<ffffffffa76ceb47>] native_queued_spin_lock_slowpath+0x117/0x1a0
[Wed Jan 25 15:32:54 2017] RSP: 0018:ffff8eb47316fbd8 EFLAGS: 00000246
[Wed Jan 25 15:32:54 2017] RAX: 0000000000000000 RBX: ffff8c1a4c0b5f28 RCX: ffff8ccebfad9d40
[Wed Jan 25 15:32:54 2017] RDX: 0000000000000063 RSI: 0000000001900000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:54 2017] RBP: ffff8eb47316fbd8 R08: 0000000001700000 R09: 0000000000000000
[Wed Jan 25 15:32:54 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f18
[Wed Jan 25 15:32:54 2017] R13: ffff8c1a4c0b6260 R14: ffff8ecea5f4bca8 R15: 0000000000000800
[Wed Jan 25 15:32:54 2017] FS: 00007f6e6bf4a700(0000) GS:ffff8ccebfac0000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:54 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:54 2017] CR2: 0000000000a80118 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:54 2017] Stack:
[Wed Jan 25 15:32:54 2017] ffff8eb47316fbe8 ffffffffa7e833b0 ffff8eb47316fc88 ffffffffc0a1c023
[Wed Jan 25 15:32:54 2017] ffff8eb47316fce4 ffff8eb47316fce0 ffff8ecea5f4bc00 000004007316fc50
[Wed Jan 25 15:32:54 2017] ffffffffffffffff 0000000034dd7540 ffff8c1a4c0b6260 ffff8c1a4c0b5f28
[Wed Jan 25 15:32:54 2017] Call Trace:
[Wed Jan 25 15:32:54 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:54 2017] [<ffffffffc0a1c023>] try_get_cap_refs+0x93/0x5c0 [ceph]
[Wed Jan 25 15:32:54 2017] [<ffffffffc0a1f1a5>] ceph_get_caps+0x115/0x3b0 [ceph]
[Wed Jan 25 15:32:54 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:54 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:54 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:54 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:54 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:54 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:54 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:54 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:54 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:54 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:54 2017] Code: 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b

The machine hosts a LXC container for executing HPC jobs. The physical host is not accessible, but ssh to the LXC container succeeds after several minutes. Access to the /sys/kernel/debug filesystem is not possible from within the container. The machine currently has a high load, either due to several kernel threads spinning on their locks or user space applications:

  1. cat /proc/loadavg
    430.48 431.23 430.97 263/2850 106806

Listing the user process is not possible. The machine is stuck in this situation for about half a day; the MDS does not list an active session for it anymore (using ceph daemon mds.XXX session ls). The kernel log further indicates that the machine had some trouble with mon and mds connections yesterday. Unfortunately most of the kernel log file entries are truncated.

We will have to reboot the machine (or find a better way to recover it), so we will be unable to provide more information in this case.

Actions

Also available in: Atom PDF