Project

General

Profile

Actions

Bug #18671

closed

kernel 4.8.15: BUG: soft lockup

Added by Burkhard Linke about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Running kernel 4.8.15 from Ubuntu mainline PPA, a machine is stuck in a kernel bug:

[Wed Jan 25 15:32:46 2017] NMI watchdog: BUG: soft lockup - CPU#88 stuck for 22s! [jellyfish:157790]
[Wed Jan 25 15:32:46 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:46 2017] CPU: 88 PID: 157790 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:46 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:46 2017] task: ffff8c4c95e11a00 task.stack: ffff8eb05c1c8000
[Wed Jan 25 15:32:46 2017] RIP: 0010:[<ffffffffa76ceb44>] [<ffffffffa76ceb44>] native_queued_spin_lock_slowpath+0x114/0x1a0
[Wed Jan 25 15:32:46 2017] RSP: 0018:ffff8eb05c1cbb10 EFLAGS: 00000246
[Wed Jan 25 15:32:46 2017] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8c4f3fc99d40
[Wed Jan 25 15:32:46 2017] RDX: 0000000000000011 RSI: 0000000000480000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] RBP: ffff8eb05c1cbb10 R08: 0000000001640000 R09: 0000000000000000
[Wed Jan 25 15:32:46 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] R13: 00000000ffffffff R14: ffff8c1a4c0b5f18 R15: ffff8ecea5f4bc00
[Wed Jan 25 15:32:46 2017] FS: 00007f6e68743700(0000) GS:ffff8c4f3fc80000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:46 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:46 2017] CR2: 00007f35feffd9d0 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:46 2017] Stack:
[Wed Jan 25 15:32:46 2017] ffff8eb05c1cbb20 ffffffffa7e833b0 ffff8eb05c1cbc50 ffffffffc0a1ce09
[Wed Jan 25 15:32:46 2017] ffff8ecea5f4bca8 ffff8c1a4c0b6260 ffff8c1a4c0b5f18 ffff8eb05c1cbbd8
[Wed Jan 25 15:32:46 2017] ffff8c1a4c0b5f28 0000000000000000 0000000000000000 0000000000000000
[Wed Jan 25 15:32:46 2017] Call Trace:
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1ce09>] ceph_check_caps+0x89/0xaa0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a196d5>] ? __cap_is_valid+0x25/0xc0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a196d5>] ? __cap_is_valid+0x25/0xc0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1bf64>] ? __ceph_caps_mds_wanted+0x54/0x80 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1afcb>] ? __ceph_caps_issued+0x7b/0xe0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a11bdb>] ceph_renew_caps+0xbb/0x1c0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1f32f>] ceph_get_caps+0x29f/0x3b0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:46 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:46 2017] Code: 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 <8b> 41 08 85 c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02
[Wed Jan 25 15:32:46 2017] NMI watchdog: BUG: soft lockup - CPU#89 stuck for 22s! [jellyfish:157787]
[Wed Jan 25 15:32:46 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:46 2017] CPU: 89 PID: 157787 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:46 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:46 2017] task: ffff8c4c95e14e00 task.stack: ffff8ece98b64000
[Wed Jan 25 15:32:46 2017] RIP: 0010:[<ffffffffa76ceb44>] [<ffffffffa76ceb44>] native_queued_spin_lock_slowpath+0x114/0x1a0
[Wed Jan 25 15:32:46 2017] RSP: 0018:ffff8ece98b67bd8 EFLAGS: 00000246
[Wed Jan 25 15:32:46 2017] RAX: 0000000000000000 RBX: ffff8c1a4c0b5f28 RCX: ffff8c4f3fcd9d40
[Wed Jan 25 15:32:46 2017] RDX: 0000000000000057 RSI: 0000000001600000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] RBP: ffff8ece98b67bd8 R08: 0000000001680000 R09: 0000000000000000
[Wed Jan 25 15:32:46 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f18
[Wed Jan 25 15:32:46 2017] R13: ffff8c1a4c0b6260 R14: ffff8ecea5f4bca8 R15: 0000000000000800
[Wed Jan 25 15:32:46 2017] FS: 00007f6e69f46700(0000) GS:ffff8c4f3fcc0000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:46 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:46 2017] CR2: 00007f36a6794ab8 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:46 2017] Stack:
[Wed Jan 25 15:32:46 2017] ffff8ece98b67be8 ffffffffa7e833b0 ffff8ece98b67c88 ffffffffc0a1c023
[Wed Jan 25 15:32:46 2017] ffff8ece98b67ce4 ffff8ece98b67ce0 ffff8ecea5f4bc00 0000040098b67c50
[Wed Jan 25 15:32:46 2017] ffffffffffffffff 00000000cd2db6ed ffff8c1a4c0b6260 ffff8c1a4c0b5f28
[Wed Jan 25 15:32:46 2017] Call Trace:
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1c023>] try_get_cap_refs+0x93/0x5c0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a1f1a5>] ceph_get_caps+0x115/0x3b0 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:46 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:46 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:46 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:46 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:46 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:46 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:46 2017] Code: 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 <8b> 41 08 85 c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02
[Wed Jan 25 15:32:50 2017] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [jellyfish:157830]
[Wed Jan 25 15:32:50 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:50 2017] CPU: 2 PID: 157830 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:50 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:50 2017] task: ffff8ec6c9f40d00 task.stack: ffff8eb05c388000
[Wed Jan 25 15:32:50 2017] RIP: 0010:[<ffffffffa76ceb44>] [<ffffffffa76ceb44>] native_queued_spin_lock_slowpath+0x114/0x1a0
[Wed Jan 25 15:32:50 2017] RSP: 0018:ffff8eb05c38bbd8 EFLAGS: 00000246
[Wed Jan 25 15:32:50 2017] RAX: 0000000000000000 RBX: ffff8c1a4c0b5f28 RCX: ffff8c4f3f899d40
[Wed Jan 25 15:32:50 2017] RDX: 0000000000000058 RSI: 0000000001640000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:50 2017] RBP: ffff8eb05c38bbd8 R08: 00000000000c0000 R09: 0000000000000000
[Wed Jan 25 15:32:50 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f18
[Wed Jan 25 15:32:50 2017] R13: ffff8c1a4c0b6260 R14: ffff8ecea5f4bca8 R15: 0000000000000800
[Wed Jan 25 15:32:50 2017] FS: 00007f6e5471b700(0000) GS:ffff8c4f3f880000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:50 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:50 2017] CR2: 00007f35f67e7ea8 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:50 2017] Stack:
[Wed Jan 25 15:32:50 2017] ffff8eb05c38bbe8 ffffffffa7e833b0 ffff8eb05c38bc88 ffffffffc0a1c023
[Wed Jan 25 15:32:50 2017] ffff8eb05c38bce4 ffff8eb05c38bce0 ffff8ecea5f4bc00 000004005c38bc50
[Wed Jan 25 15:32:50 2017] ffffffffffffffff 0000000003927d82 ffff8c1a4c0b6260 ffff8c1a4c0b5f28
[Wed Jan 25 15:32:50 2017] Call Trace:
[Wed Jan 25 15:32:50 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:50 2017] [<ffffffffc0a1c023>] try_get_cap_refs+0x93/0x5c0 [ceph]
[Wed Jan 25 15:32:50 2017] [<ffffffffc0a1f1a5>] ceph_get_caps+0x115/0x3b0 [ceph]
[Wed Jan 25 15:32:50 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:50 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:50 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:50 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:50 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:50 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:50 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:50 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:50 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:50 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:50 2017] Code: 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 <8b> 41 08 85 c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02
[Wed Jan 25 15:32:54 2017] NMI watchdog: BUG: soft lockup - CPU#91 stuck for 22s! [jellyfish:157783]
[Wed Jan 25 15:32:54 2017] Modules linked in: ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc veth xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM openvswitch iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 xt_tcpudp nf_defrag_ipv4 nf_nat_ipv4 bridge iptable_filter ip_tables nf_defrag_ipv6 x_tables nf_nat nf_conntrack libcrc32c 8021q garp mrp stp llc bonding ipmi_ssif intel_powerclamp binfmt_misc coretemp ipmi_si joydev input_leds hpilo crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i7core_edac aesni_intel gpio_ich aes_x86_64 lrw glue_helper ablk_helper cryptd lpc_ich intel_cstate kvm_intel serio_raw ipmi_msghandler acpi_power_meter edac_core shpchp mac_hid kvm irqbypass autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt hid_generic fb_sys_fops usbhid hpsa psmouse drm hid pata_acpi scsi_transport_sas netxen_nic wmi fjes
[Wed Jan 25 15:32:54 2017] CPU: 91 PID: 157783 Comm: jellyfish Tainted: G L 4.8.15-040815-generic #201612151231
[Wed Jan 25 15:32:54 2017] Hardware name: HP ProLiant DL980 G7, BIOS P66 08/16/2015
[Wed Jan 25 15:32:54 2017] task: ffff8e3d245f2700 task.stack: ffff8eb47316c000
[Wed Jan 25 15:32:54 2017] RIP: 0010:[<ffffffffa76ceb47>] [<ffffffffa76ceb47>] native_queued_spin_lock_slowpath+0x117/0x1a0
[Wed Jan 25 15:32:54 2017] RSP: 0018:ffff8eb47316fbd8 EFLAGS: 00000246
[Wed Jan 25 15:32:54 2017] RAX: 0000000000000000 RBX: ffff8c1a4c0b5f28 RCX: ffff8ccebfad9d40
[Wed Jan 25 15:32:54 2017] RDX: 0000000000000063 RSI: 0000000001900000 RDI: ffff8c1a4c0b5f28
[Wed Jan 25 15:32:54 2017] RBP: ffff8eb47316fbd8 R08: 0000000001700000 R09: 0000000000000000
[Wed Jan 25 15:32:54 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c1a4c0b5f18
[Wed Jan 25 15:32:54 2017] R13: ffff8c1a4c0b6260 R14: ffff8ecea5f4bca8 R15: 0000000000000800
[Wed Jan 25 15:32:54 2017] FS: 00007f6e6bf4a700(0000) GS:ffff8ccebfac0000(0000) knlGS:0000000000000000
[Wed Jan 25 15:32:54 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 15:32:54 2017] CR2: 0000000000a80118 CR3: 000000bee402c000 CR4: 00000000000006e0
[Wed Jan 25 15:32:54 2017] Stack:
[Wed Jan 25 15:32:54 2017] ffff8eb47316fbe8 ffffffffa7e833b0 ffff8eb47316fc88 ffffffffc0a1c023
[Wed Jan 25 15:32:54 2017] ffff8eb47316fce4 ffff8eb47316fce0 ffff8ecea5f4bc00 000004007316fc50
[Wed Jan 25 15:32:54 2017] ffffffffffffffff 0000000034dd7540 ffff8c1a4c0b6260 ffff8c1a4c0b5f28
[Wed Jan 25 15:32:54 2017] Call Trace:
[Wed Jan 25 15:32:54 2017] [<ffffffffa7e833b0>] _raw_spin_lock+0x20/0x30
[Wed Jan 25 15:32:54 2017] [<ffffffffc0a1c023>] try_get_cap_refs+0x93/0x5c0 [ceph]
[Wed Jan 25 15:32:54 2017] [<ffffffffc0a1f1a5>] ceph_get_caps+0x115/0x3b0 [ceph]
[Wed Jan 25 15:32:54 2017] [<ffffffffa76c6cf0>] ? wake_atomic_t_function+0x60/0x60
[Wed Jan 25 15:32:54 2017] [<ffffffffc0a16b50>] ceph_filemap_fault+0xb0/0x460 [ceph]
[Wed Jan 25 15:32:54 2017] [<ffffffffa77d6514>] __do_fault+0x84/0x170
[Wed Jan 25 15:32:54 2017] [<ffffffffa76f58cc>] ? hrtimer_try_to_cancel+0x2c/0x120
[Wed Jan 25 15:32:54 2017] [<ffffffffa77dad8a>] handle_mm_fault+0xdba/0x13c0
[Wed Jan 25 15:32:54 2017] [<ffffffffa7e827d6>] ? do_nanosleep+0x96/0xf0
[Wed Jan 25 15:32:54 2017] [<ffffffffa76f657b>] ? hrtimer_nanosleep+0xdb/0x210
[Wed Jan 25 15:32:54 2017] [<ffffffffa766b37b>] __do_page_fault+0x1db/0x4d0
[Wed Jan 25 15:32:54 2017] [<ffffffffa766b692>] do_page_fault+0x22/0x30
[Wed Jan 25 15:32:54 2017] [<ffffffffa7e84898>] page_fault+0x28/0x30
[Wed Jan 25 15:32:54 2017] Code: 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 9d 01 00 48 03 04 d5 20 83 55 a8 48 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b

The machine hosts a LXC container for executing HPC jobs. The physical host is not accessible, but ssh to the LXC container succeeds after several minutes. Access to the /sys/kernel/debug filesystem is not possible from within the container. The machine currently has a high load, either due to several kernel threads spinning on their locks or user space applications:

  1. cat /proc/loadavg
    430.48 431.23 430.97 263/2850 106806

Listing the user process is not possible. The machine is stuck in this situation for about half a day; the MDS does not list an active session for it anymore (using ceph daemon mds.XXX session ls). The kernel log further indicates that the machine had some trouble with mon and mds connections yesterday. Unfortunately most of the kernel log file entries are truncated.

We will have to reboot the machine (or find a better way to recover it), so we will be unable to provide more information in this case.

Actions #1

Updated by Burkhard Linke about 7 years ago

We have a similar problem on another machine, in this case the host itself is accessible:

Kernel 4.9.2

[Wed Jan 25 16:22:59 2017] NMI watchdog: BUG: soft lockup - CPU#55 stuck for 22s! [perl:17021]
[Wed Jan 25 16:22:59 2017] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc veth binfmt_misc openvswitch nf_conntrack_ipv6 xt_CHECKSUM nf_nat_ipv6 nf_defrag_ipv6 iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge iptable_filter ip_tables x_tables 8021q garp mrp stp llc bonding intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_helper ablk_helper cryptd intel_cstate joydev input_leds intel_rapl_perf ipmi_ssif shpchp ioatdma lpc_ich mei_me mei mac_hid ipmi_si ipmi_msghandler ceph libceph libcrc32c fscache autofs4 hid_generic usbhid hid ixgbe dca ptp pps_core mdio
[Wed Jan 25 16:22:59 2017] uas ahci usb_storage libahci wmi fjes
[Wed Jan 25 16:22:59 2017] CPU: 55 PID: 17021 Comm: perl Tainted: G D W L 4.9.2-040902-generic #201701090331
[Wed Jan 25 16:22:59 2017] Hardware name: Supermicro X9QR7-TF+/X9QRi-F+/X9QR7-TF+/X9QRi-F+, BIOS 3.0 02/21/2014
[Wed Jan 25 16:22:59 2017] task: ffff9f318590ad80 task.stack: ffffb117f5024000
[Wed Jan 25 16:22:59 2017] RIP: 0010:[<ffffffffc02c049a>] [<ffffffffc02c049a>] __ceph_caps_file_wanted+0x1a/0x40 [ceph]
[Wed Jan 25 16:22:59 2017] RSP: 0018:ffffb117f5027b60 EFLAGS: 00000202
[Wed Jan 25 16:22:59 2017] RAX: 0000000000000005 RBX: ffff9f9191bb10d8 RCX: 0000000000000003
[Wed Jan 25 16:22:59 2017] RDX: 0000000000000008 RSI: 0000000000000001 RDI: ffff9f9191bb10c8
[Wed Jan 25 16:22:59 2017] RBP: ffffb117f5027bf8 R08: 0000000000000000 R09: ffffb117f5027c40
[Wed Jan 25 16:22:59 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9f9191bb10c8
[Wed Jan 25 16:22:59 2017] R13: ffff9f9191bb1410 R14: ffff9f31a67f40a8 R15: 0000000000001000
[Wed Jan 25 16:22:59 2017] FS: 00007f92ed1e2700(0000) GS:ffff9f71bfdc0000(0000) knlGS:0000000000000000
[Wed Jan 25 16:22:59 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 16:22:59 2017] CR2: 0000000000604180 CR3: 0000005fead06000 CR4: 00000000000406e0
[Wed Jan 25 16:22:59 2017] Stack:
[Wed Jan 25 16:22:59 2017] ffffffffc02c102b ffffb117f5027c44 ffffb117f5027c40 ffff9f31a67f4000
[Wed Jan 25 16:22:59 2017] 00002000f5027bc0 000000000ab1c000 00000000c5695d66 ffff9f9191bb1410
[Wed Jan 25 16:22:59 2017] ffff9f9191bb10d8 ffff9f9100002000 ffff9f31a67f4000 00000000000032cd
[Wed Jan 25 16:22:59 2017] Call Trace:
[Wed Jan 25 16:22:59 2017] [<ffffffffc02c102b>] ? try_get_cap_refs+0x9b/0x5c0 [ceph]
[Wed Jan 25 16:22:59 2017] [<ffffffffc02c4183>] ceph_get_caps+0x113/0x390 [ceph]
[Wed Jan 25 16:22:59 2017] [<ffffffff91857df9>] ? generic_update_time+0x79/0xd0
[Wed Jan 25 16:22:59 2017] [<ffffffff91858078>] ? file_update_time+0xc8/0x110
[Wed Jan 25 16:22:59 2017] [<ffffffffc02b4c19>] ceph_write_iter+0x349/0xbe0 [ceph]
[Wed Jan 25 16:22:59 2017] [<ffffffff9182a121>] ? uncharge_list+0x111/0x120
[Wed Jan 25 16:22:59 2017] [<ffffffff91859b35>] ? touch_atime+0x35/0xd0
[Wed Jan 25 16:22:59 2017] [<ffffffff9183a8f5>] __vfs_write+0xe5/0x160
[Wed Jan 25 16:22:59 2017] [<ffffffff9183af75>] vfs_write+0xb5/0x1a0
[Wed Jan 25 16:22:59 2017] [<ffffffff9183c3d5>] SyS_write+0x55/0xc0
[Wed Jan 25 16:22:59 2017] [<ffffffff91603b6b>] do_syscall_64+0x5b/0xc0
[Wed Jan 25 16:22:59 2017] [<ffffffff91e91b6f>] entry_SYSCALL64_slow_path+0x25/0x25
[Wed Jan 25 16:22:59 2017] Code: 00 00 00 0f 45 c2 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 31 c9 31 c0 be 01 00 00 00 44 8b 84 8f d4 01 00 00 89 f2 d3 e2 <09> c2 45 85 c0 0f 45 c2 48 83 c1 01 48 83 f9 04 75 e2 85 c0 75
[Wed Jan 25 16:23:27 2017] NMI watchdog: BUG: soft lockup - CPU#55 stuck for 22s! [perl:17021]
[Wed Jan 25 16:23:27 2017] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc veth binfmt_misc openvswitch nf_conntrack_ipv6 xt_CHECKSUM nf_nat_ipv6 nf_defrag_ipv6 iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge iptable_filter ip_tables x_tables 8021q garp mrp stp llc bonding intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_helper ablk_helper cryptd intel_cstate joydev input_leds intel_rapl_perf ipmi_ssif shpchp ioatdma lpc_ich mei_me mei mac_hid ipmi_si ipmi_msghandler ceph libceph libcrc32c fscache autofs4 hid_generic usbhid hid ixgbe dca ptp pps_core mdio
[Wed Jan 25 16:23:27 2017] uas ahci usb_storage libahci wmi fjes
[Wed Jan 25 16:23:27 2017] CPU: 55 PID: 17021 Comm: perl Tainted: G D W L 4.9.2-040902-generic #201701090331
[Wed Jan 25 16:23:27 2017] Hardware name: Supermicro X9QR7-TF+/X9QRi-F+/X9QR7-TF+/X9QRi-F+, BIOS 3.0 02/21/2014
[Wed Jan 25 16:23:27 2017] task: ffff9f318590ad80 task.stack: ffffb117f5024000
[Wed Jan 25 16:23:27 2017] RIP: 0010:[<ffffffffc02c049a>] [<ffffffffc02c049a>] __ceph_caps_file_wanted+0x1a/0x40 [ceph]
[Wed Jan 25 16:23:27 2017] RSP: 0018:ffffb117f5027a98 EFLAGS: 00000202
[Wed Jan 25 16:23:27 2017] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
[Wed Jan 25 16:23:27 2017] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f9191bb10c8
[Wed Jan 25 16:23:27 2017] RBP: ffffb117f5027bc0 R08: 0000000000000000 R09: ffffb117f5027c40
[Wed Jan 25 16:23:27 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9f9191bb10d8
[Wed Jan 25 16:23:27 2017] R13: 00000000ffffffff R14: ffff9f9191bb10c8 R15: ffff9f31a67f4000
[Wed Jan 25 16:23:27 2017] FS: 00007f92ed1e2700(0000) GS:ffff9f71bfdc0000(0000) knlGS:0000000000000000
[Wed Jan 25 16:23:27 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 16:23:27 2017] CR2: 0000000000604180 CR3: 0000005fead06000 CR4: 00000000000406e0
[Wed Jan 25 16:23:27 2017] Stack:
[Wed Jan 25 16:23:27 2017] ffffffffc02c1eb1 0000000000000000 0000000000000001 ffffb117f5027bc4
[Wed Jan 25 16:23:27 2017] ffff9f915da2dd28 ffffffffffffff10 ffffffff91e91960 0000000000000010
[Wed Jan 25 16:23:27 2017] 0000000000000246 ffff9f31a67f40a8 ffffffffc02e4dc8 ffffffffc02e4dc8
[Wed Jan 25 16:23:27 2017] Call Trace:
[Wed Jan 25 16:23:27 2017] [<ffffffffc02c1eb1>] ? ceph_check_caps+0x131/0xaa0 [ceph]
[Wed Jan 25 16:23:27 2017] [<ffffffff91e91960>] ? _raw_spin_lock+0x10/0x30
[Wed Jan 25 16:23:27 2017] [<ffffffffc02c0f64>] ? __ceph_caps_mds_wanted+0x54/0x80 [ceph]
[Wed Jan 25 16:23:27 2017] [<ffffffffc02bffcb>] ? __ceph_caps_issued+0x7b/0xe0 [ceph]
[Wed Jan 25 16:23:27 2017] [<ffffffffc02b6beb>] ceph_renew_caps+0xbb/0x1c0 [ceph]
[Wed Jan 25 16:23:27 2017] [<ffffffffc02c430e>] ceph_get_caps+0x29e/0x390 [ceph]
[Wed Jan 25 16:23:27 2017] [<ffffffff91857df9>] ? generic_update_time+0x79/0xd0
[Wed Jan 25 16:23:27 2017] [<ffffffff91858078>] ? file_update_time+0xc8/0x110
[Wed Jan 25 16:23:27 2017] [<ffffffffc02b4c19>] ceph_write_iter+0x349/0xbe0 [ceph]
[Wed Jan 25 16:23:27 2017] [<ffffffff9182a121>] ? uncharge_list+0x111/0x120
[Wed Jan 25 16:23:27 2017] [<ffffffff91859b35>] ? touch_atime+0x35/0xd0
[Wed Jan 25 16:23:27 2017] [<ffffffff9183a8f5>] __vfs_write+0xe5/0x160
[Wed Jan 25 16:23:27 2017] [<ffffffff9183af75>] vfs_write+0xb5/0x1a0
[Wed Jan 25 16:23:27 2017] [<ffffffff9183c3d5>] SyS_write+0x55/0xc0
[Wed Jan 25 16:23:27 2017] [<ffffffff91603b6b>] do_syscall_64+0x5b/0xc0
[Wed Jan 25 16:23:27 2017] [<ffffffff91e91b6f>] entry_SYSCALL64_slow_path+0x25/0x25
[Wed Jan 25 16:23:27 2017] Code: 00 00 00 0f 45 c2 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 31 c9 31 c0 be 01 00 00 00 44 8b 84 8f d4 01 00 00 89 f2 d3 e2 <09> c2 45 85 c0 0f 45 c2 48 83 c1 01 48 83 f9 04 75 e2 85 c0 75
[Wed Jan 25 16:23:55 2017] NMI watchdog: BUG: soft lockup - CPU#55 stuck for 22s! [perl:17021]
[Wed Jan 25 16:23:55 2017] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc veth binfmt_misc openvswitch nf_conntrack_ipv6 xt_CHECKSUM nf_nat_ipv6 nf_defrag_ipv6 iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge iptable_filter ip_tables x_tables 8021q garp mrp stp llc bonding intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_helper ablk_helper cryptd intel_cstate joydev input_leds intel_rapl_perf ipmi_ssif shpchp ioatdma lpc_ich mei_me mei mac_hid ipmi_si ipmi_msghandler ceph libceph libcrc32c fscache autofs4 hid_generic usbhid hid ixgbe dca ptp pps_core mdio
[Wed Jan 25 16:23:55 2017] uas ahci usb_storage libahci wmi fjes
[Wed Jan 25 16:23:55 2017] CPU: 55 PID: 17021 Comm: perl Tainted: G D W L 4.9.2-040902-generic #201701090331
[Wed Jan 25 16:23:55 2017] Hardware name: Supermicro X9QR7-TF+/X9QRi-F+/X9QR7-TF+/X9QRi-F+, BIOS 3.0 02/21/2014
[Wed Jan 25 16:23:55 2017] task: ffff9f318590ad80 task.stack: ffffb117f5024000
[Wed Jan 25 16:23:55 2017] RIP: 0010:[<ffffffffc02be6b0>] [<ffffffffc02be6b0>] __cap_is_valid+0x0/0xc0 [ceph]
[Wed Jan 25 16:23:55 2017] RSP: 0018:ffffb117f5027a58 EFLAGS: 00000282
[Wed Jan 25 16:23:55 2017] RAX: ffff9f9191bb1410 RBX: ffff9f31a5773f00 RCX: 0000000000000000
[Wed Jan 25 16:23:55 2017] RDX: 0000000000000000 RSI: ffffb117f5027b7c RDI: ffff9f31a5773f00
[Wed Jan 25 16:23:55 2017] RBP: ffffb117f5027a90 R08: 0000000000000000 R09: 0000000000000000
[Wed Jan 25 16:23:55 2017] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb117f5027b7c
[Wed Jan 25 16:23:55 2017] R13: 0000000000000000 R14: ffff9f9191bb10c8 R15: ffff9f31a5773f08
[Wed Jan 25 16:23:55 2017] FS: 00007f92ed1e2700(0000) GS:ffff9f71bfdc0000(0000) knlGS:0000000000000000
[Wed Jan 25 16:23:55 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jan 25 16:23:55 2017] CR2: 0000000000604180 CR3: 0000005fead06000 CR4: 00000000000406e0
[Wed Jan 25 16:23:55 2017] Stack:
[Wed Jan 25 16:23:55 2017] ffffffffc02bffa9 ffff9f9191bb1410 0000000000000000 ffff9f9191bb10d8
[Wed Jan 25 16:23:55 2017] 00000000ffffffff ffff9f9191bb10c8 00000000000032cd ffffb117f5027bc0
[Wed Jan 25 16:23:55 2017] ffffffffc02c1ecb 0000000000000000 0000000000000001 ffffb117f5027bc4
[Wed Jan 25 16:23:55 2017] Call Trace:
[Wed Jan 25 16:23:55 2017] [<ffffffffc02bffa9>] ? __ceph_caps_issued+0x59/0xe0 [ceph]
[Wed Jan 25 16:23:55 2017] [<ffffffffc02c1ecb>] ceph_check_caps+0x14b/0xaa0 [ceph]
[Wed Jan 25 16:23:55 2017] [<ffffffff91e91960>] ? _raw_spin_lock+0x10/0x30
[Wed Jan 25 16:23:55 2017] [<ffffffffc02c0f64>] ? __ceph_caps_mds_wanted+0x54/0x80 [ceph]
[Wed Jan 25 16:23:55 2017] [<ffffffffc02bffcb>] ? __ceph_caps_issued+0x7b/0xe0 [ceph]
[Wed Jan 25 16:23:55 2017] [<ffffffffc02b6beb>] ceph_renew_caps+0xbb/0x1c0 [ceph]
[Wed Jan 25 16:23:55 2017] [<ffffffffc02c430e>] ceph_get_caps+0x29e/0x390 [ceph]
[Wed Jan 25 16:23:55 2017] [<ffffffff91857df9>] ? generic_update_time+0x79/0xd0
[Wed Jan 25 16:23:55 2017] [<ffffffff91858078>] ? file_update_time+0xc8/0x110
[Wed Jan 25 16:23:55 2017] [<ffffffffc02b4c19>] ceph_write_iter+0x349/0xbe0 [ceph]
[Wed Jan 25 16:23:55 2017] [<ffffffff9182a121>] ? uncharge_list+0x111/0x120
[Wed Jan 25 16:23:55 2017] [<ffffffff91859b35>] ? touch_atime+0x35/0xd0
[Wed Jan 25 16:23:55 2017] [<ffffffff9183a8f5>] __vfs_write+0xe5/0x160
[Wed Jan 25 16:23:55 2017] [<ffffffff9183af75>] vfs_write+0xb5/0x1a0
[Wed Jan 25 16:23:55 2017] [<ffffffff9183c3d5>] SyS_write+0x55/0xc0
[Wed Jan 25 16:23:55 2017] [<ffffffff91603b6b>] do_syscall_64+0x5b/0xc0
[Wed Jan 25 16:23:55 2017] [<ffffffff91e91b6f>] entry_SYSCALL64_slow_path+0x25/0x25
[Wed Jan 25 16:23:55 2017] Code: ff 85 db 74 b1 48 8d 78 01 c6 00 46 89 de e8 18 f8 ff ff 4c 39 e0 75 a3 c6 00 2d 48 83 c0 01 c6 00 00 4c 89 e0 5b 41 5c 5d c3 90 <66> 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 89 fb 48 83 ec 08

The machine was rebooted this morning, so it was not affected by the MDS outtake yesterday. There's no session listed on MDS side for this hosts.

/sys/kernel/debug/ceph/49098879-85ac-4c5d-aac0-e1a2658a680b.client9537154# cat caps
total 545
avail 463
used 80
reserved 2
min 8192
/sys/kernel/debug/ceph/49098879-85ac-4c5d-aac0-e1a2658a680b.client9537154# cat client_options
name=volumes,secret=<hidden>
/sys/kernel/debug/ceph/49098879-85ac-4c5d-aac0-e1a2658a680b.client9537154# cat mds_sessions
global_id 9537154
name "volumes"
mds.0 hung
/sys/kernel/debug/ceph/49098879-85ac-4c5d-aac0-e1a2658a680b.client9537154# cat mdsc
92 mds0 getattr #100009de0f2
93 mds0 getattr #100009de0f2
/sys/kernel/debug/ceph/49098879-85ac-4c5d-aac0-e1a2658a680b.client9537154# cat mdsmap
epoch 221104
root 0
session_timeout 60
session_autoclose 300
mds0 192.168.6.129:6824 (up:active)
/sys/kernel/debug/ceph/49098879-85ac-4c5d-aac0-e1a2658a680b.client9537154# cat monmap
epoch 23
mon0 192.168.6.131:6789
mon1 192.168.6.133:6789
mon2 192.168.6.134:6789
/sys/kernel/debug/ceph/49098879-85ac-4c5d-aac0-e1a2658a680b.client9537154# cat osdc
REQUESTS 0 homeless 0
LINGER REQUESTS

Actions #2

Updated by Zheng Yan about 7 years ago

  • Status changed from New to 12

I think it's infinite loop of ceph_renew_caps. caused by the __cap_is_valid check in __ceph_caps_mds_wanted

Actions #4

Updated by Zheng Yan about 7 years ago

  • Status changed from 12 to 7
Actions #5

Updated by Zheng Yan almost 7 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF