Actions
Bug #9426
closedkcephfs: soft lockup in handle mds map
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
[2346240.855916] ceph: mds0 reconnect start [2346267.396560] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:14794] [2346267.403473] Modules linked in: ipmi_devintf(E) ipmi_si(E) ipmi_msghandler(E) ip6table_filter(E) ip6_tables(E) ebtable_nat(E) ebtables(E) ipt_MASQUERADE(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_state(E) nf_conntrack(E) ipt_REJECT(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) iptable_filter(E) ip_tables(E) x_tables(E) bridge(E) stp(E) llc(E) ceph(E) libceph(E) fscache(E) psmouse(E) joydev(E) serio_raw(E) gpio_ich(E) xfs(E) i7core_edac(E) tpm_infineon(E) edac_core(E) tpm_tis(E) lpc_ich(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) btrfs(E) raid6_pq(E) e1000e(E) ahci(E) ptp(E) libahci(E) pps_core(E) arcmsr(E) xor(E) libcrc32c(E) [2346267.465985] irq event stamp: 0 [2346267.469307] hardirqs last enabled at (0): [< (null)>] (null) [2346267.477119] hardirqs last disabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0 [2346267.485662] softirqs last enabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0 [2346267.494151] softirqs last disabled at (0): [< (null)>] (null) [2346267.502007] CPU: 0 PID: 14794 Comm: kworker/0:1 Tainted: G E 3.16.0-ceph-00005-gd8f31c8 #1 [2346267.511566] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010 [2346267.518338] Workqueue: ceph-msgr con_work [libceph] [2346267.523475] task: ffff88003906c3e0 ti: ffff88011ff24000 task.ti: ffff88011ff24000 [2346267.531330] RIP: 0010:[<ffffffff810aebbe>] [<ffffffff810aebbe>] lock_acquire+0x9e/0x120 [2346267.539860] RSP: 0018:ffff88011ff279c8 EFLAGS: 00000246 [2346267.545576] RAX: ffff88003906c3e0 RBX: ffffffff8170de60 RCX: 0000000000000000 [2346267.553031] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880406d98028 [2346267.560494] RBP: ffff88011ff27a48 R08: 0000000000000001 R09: 0000000000000000 [2346267.567906] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011ff27938 [2346267.575394] R13: ffff88043fc13b40 R14: ffff88011ff24000 R15: 0000000000000001 [2346267.582799] FS: 0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000 [2346267.591170] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [2346267.597211] CR2: 00007f7699844000 CR3: 0000000001c11000 CR4: 00000000000007f0 [2346267.604800] Stack: [2346267.607007] 0000000000000000 ffffffffa038b657 ffffffff00000000 ffffffff8170de60 [2346267.614823] 0000000000000000 0000000000000246 0000000000000000 0000000000000001 [2346267.622733] 0000000000000000 ffff880406d98028 ffffffffa038b792 ffff880406d98010 [2346267.630494] Call Trace: [2346267.633205] [<ffffffffa038b657>] ? __choose_mds+0xd7/0x2b0 [ceph] [2346267.639639] [<ffffffff8170de60>] ? retint_restore_args+0xe/0xe [2346267.645919] [<ffffffffa038b792>] ? __choose_mds+0x212/0x2b0 [ceph] [2346267.652590] [<ffffffff8170c219>] _raw_spin_lock+0x39/0x50 [2346267.658329] [<ffffffffa038b657>] ? __choose_mds+0xd7/0x2b0 [ceph] [2346267.664839] [<ffffffff8170c9eb>] ? _raw_spin_unlock+0x2b/0x40 [2346267.671076] [<ffffffffa038b657>] __choose_mds+0xd7/0x2b0 [ceph] [2346267.677388] [<ffffffffa038d4b7>] __do_request+0xa7/0x2c0 [ceph] [2346267.683737] [<ffffffffa038d7bc>] __wake_requests+0x6c/0x80 [ceph] [2346267.690307] [<ffffffffa038f375>] ceph_mdsc_handle_map+0x435/0x4e0 [ceph] [2346267.697345] [<ffffffffa0375379>] extra_mon_dispatch+0x29/0x30 [ceph] [2346267.704142] [<ffffffffa03459e4>] dispatch+0x54/0x820 [libceph] [2346267.710317] [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10 [2346267.716572] [<ffffffffa034271d>] con_work+0x12ed/0x20f0 [libceph] [2346267.723008] [<ffffffff810867ef>] ? finish_task_switch+0x3f/0x120 [2346267.729452] [<ffffffff8107511f>] ? process_one_work+0x16f/0x570 [2346267.735750] [<ffffffff81075181>] process_one_work+0x1d1/0x570 [2346267.741837] [<ffffffff8107511f>] ? process_one_work+0x16f/0x570 [2346267.748189] [<ffffffff81075fbc>] worker_thread+0x11c/0x530 [2346267.754071] [<ffffffff81075ea0>] ? create_and_start_worker+0x50/0x50 [2346267.760731] [<ffffffff8107d2d4>] kthread+0xe4/0x100 [2346267.766005] [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130 [2346267.772632] [<ffffffff8170d02c>] ret_from_fork+0x7c/0xb0 [2346267.778276] [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130 [2346267.785122] Code: 74 24 08 4c 89 24 24 41 0f 94 c1 45 0f b6 c9 e8 99 dc ff ff 65 48 8b 04 25 c0 ba 00 00 c7 80 34 08 00 00 00 00 00 00 ff 75 a8 9d <48> 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 00 65 ff 04 [2346295.381621] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:14794] [2346295.388630] Modules linked in: ipmi_devintf(E) ipmi_si(E) ipmi_msghandler(E) ip6table_filter(E) ip6_tables(E) ebtable_nat(E) ebtables(E) ipt_MASQUERADE(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_state(E) nf_conntrack(E) ipt_REJECT(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) iptable_filter(E) ip_tables(E) x_tables(E) bridge(E) stp(E) llc(E) ceph(E) libceph(E) fscache(E) psmouse(E) joydev(E) serio_raw(E) gpio_ich(E) xfs(E) i7core_edac(E) tpm_infineon(E) edac_core(E) tpm_tis(E) lpc_ich(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) btrfs(E) raid6_pq(E) e1000e(E) ahci(E) ptp(E) libahci(E) pps_core(E) arcmsr(E) xor(E) libcrc32c(E) [2346295.450880] irq event stamp: 0 [2346295.454137] hardirqs last enabled at (0): [< (null)>] (null) [2346295.462043] hardirqs last disabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0 [2346295.470550] softirqs last enabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0 [2346295.479137] softirqs last disabled at (0): [< (null)>] (null) [2346295.487010] CPU: 0 PID: 14794 Comm: kworker/0:1 Tainted: G E 3.16.0-ceph-00005-gd8f31c8 #1 [2346295.496570] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010 [2346295.503366] Workqueue: ceph-msgr con_work [libceph] [2346295.508466] task: ffff88003906c3e0 ti: ffff88011ff24000 task.ti: ffff88011ff24000 [2346295.516278] RIP: 0010:[<ffffffffa038b598>] [<ffffffffa038b598>] __choose_mds+0x18/0x2b0 [ceph] [2346295.525288] RSP: 0018:ffff88011ff27a88 EFLAGS: 00000286 [2346295.530933] RAX: 0000000000000000 RBX: 000000001ff27a18 RCX: ffff88016cedeef8 [2346295.538434] RDX: ffff88042631f130 RSI: ffff88004c4bd000 RDI: ffff88042631f000 [2346295.545912] RBP: ffff88011ff27af8 R08: 0000000000000001 R09: 0000000000000000 [2346295.553424] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246 [2346295.560859] R13: ffffffff810ae91c R14: ffffffffffffff7d R15: ffff880400000000 [2346295.569364] FS: 0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000 [2346295.577808] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [2346295.583799] CR2: 00007f7699844000 CR3: 0000000001c11000 CR4: 00000000000007f0 [2346295.591336] Stack: [2346295.593579] 0000000000000000 0000000000000000 0000000000000000 0000000000000001 [2346295.601324] ffff88013b0d2960 00000000ad2ead2d 000000000000ad2e 0000000000000001 [2346295.609112] ffff880406d98010 ffffffffffffffbe ffff88004c4bd000 ffff88042631f000 [2346295.616812] Call Trace: [2346295.619499] [<ffffffffa038d4b7>] __do_request+0xa7/0x2c0 [ceph] [2346295.625829] [<ffffffffa038d7bc>] __wake_requests+0x6c/0x80 [ceph] [2346295.632220] [<ffffffffa038f375>] ceph_mdsc_handle_map+0x435/0x4e0 [ceph] [2346295.639228] [<ffffffffa0375379>] extra_mon_dispatch+0x29/0x30 [ceph] [2346295.645933] [<ffffffffa03459e4>] dispatch+0x54/0x820 [libceph] [2346295.652182] [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10 [2346295.658354] [<ffffffffa034271d>] con_work+0x12ed/0x20f0 [libceph] [2346295.664749] [<ffffffff810867ef>] ? finish_task_switch+0x3f/0x120 [2346295.671170] [<ffffffff8107511f>] ? process_one_work+0x16f/0x570 [2346295.677379] [<ffffffff81075181>] process_one_work+0x1d1/0x570 [2346295.683466] [<ffffffff8107511f>] ? process_one_work+0x16f/0x570 [2346295.689830] [<ffffffff81075fbc>] worker_thread+0x11c/0x530 [2346295.695750] [<ffffffff81075ea0>] ? create_and_start_worker+0x50/0x50 [2346295.702633] [<ffffffff8107d2d4>] kthread+0xe4/0x100 [2346295.707798] [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130 [2346295.714419] [<ffffffff8170d02c>] ret_from_fork+0x7c/0xb0 [2346295.720179] [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130 [2346295.726734] Code: c7 c7 33 63 39 a0 31 c0 e8 f7 37 37 e1 e9 53 ff ff ff 90 66 66 66 66 90 55 48 89 e5 48 83 ec 70 48 89 5d e0 4c 89 65 e8 48 89 fb <4c> 89 6d f0 4c 89 75 f8 8b 86 e8 02 00 00 44 8b a6 68 01 00 00 [2346300.882679] INFO: rcu_sched self-detected stall on CPU { 0} (t=15000 jiffies g=306343080 c=306343079 q=33044) [2346300.893194] sending NMI to all CPUs: [2346300.897078] NMI backtrace for cpu 4 [2346300.900774] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G E 3.16.0-ceph-00005-gd8f31c8 #1 [2346300.909692] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010 [2346300.916422] task: ffff88042966a1f0 ti: ffff880429670000 task.ti: ffff880429670000 [2346300.924120] RIP: 0010:[<ffffffff81403aac>] [<ffffffff81403aac>] intel_idle+0xcc/0x150 [2346300.932289] RSP: 0018:ffff880429673df8 EFLAGS: 00000046 [2346300.937803] RAX: 0000000000000020 RBX: 0000000000000008 RCX: 0000000000000001 [2346300.945152] RDX: 0000000000000000 RSI: ffff880429673fd8 RDI: ffffffff8170c9af [2346300.952505] RBP: ffff880429673e28 R08: 0000000000000001 R09: 0000000000000000 [2346300.959858] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 [2346300.967214] R13: 0000000000000020 R14: 0000000000000003 R15: 0000000000000004 [2346300.974568] FS: 0000000000000000(0000) GS:ffff88043fd00000(0000) knlGS:0000000000000000 [2346300.982877] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [2346300.988825] CR2: 00007f8f0cfa5000 CR3: 0000000001c11000 CR4: 00000000000007e0 [2346300.996180] Stack: [2346300.998395] ffffffff815b945f 00000004ffffffff ffff88043fd1b020 000857167696c0b4 [2346301.006098] ffffffff81ca42b8 ffffffff81ca4140 ffff880429673e88 ffffffff815b946f [2346301.013807] 0000000000000000 0000000000f1adf4 0000000000000000 0000000000f1adf4 [2346301.021516] Call Trace: [2346301.024173] [<ffffffff815b945f>] ? cpuidle_enter_state+0x3f/0xe0 [2346301.030468] [<ffffffff815b946f>] cpuidle_enter_state+0x4f/0xe0 [2346301.036593] [<ffffffff815b95c7>] cpuidle_enter+0x17/0x20 [2346301.042196] [<ffffffff810a2b1b>] cpu_startup_entry+0x32b/0x490 [2346301.048321] [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10 [2346301.054360] [<ffffffff81032892>] start_secondary+0x1a2/0x210 [2346301.060305] Code: 31 d2 65 48 8b 34 25 c8 ba 00 00 48 89 d1 48 8d 86 38 c0 ff ff 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 04 25 c8 ba 00 00 f0 80 a0 3a c0 ff ff df 0f ae f0 48 [2346301.081025] NMI backtrace for cpu 0 [2346301.081040] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 184.003 msecs [2346301.081061] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=15049 jiffies, g=306343080, c=306343079, q=33045) [2346301.106630] CPU: 0 PID: 14794 Comm: kworker/0:1 Tainted: G E 3.16.0-ceph-00005-gd8f31c8 #1 [2346301.116064] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010 [2346301.122806] Workqueue: ceph-msgr con_work [libceph] [2346301.127905] task: ffff88003906c3e0 ti: ffff88011ff24000 task.ti: ffff88011ff24000 [2346301.135602] RIP: 0010:[<ffffffff813a5e8f>] [<ffffffff813a5e8f>] delay_tsc+0x4f/0x80 [2346301.143589] RSP: 0018:ffff88043fc03c98 EFLAGS: 00000003 [2346301.149097] RAX: 00000000d4cf69b2 RBX: 0000000000002710 RCX: 0000000001354050 [2346301.156445] RDX: 00000000001520f8 RSI: 00000000d4cf696e RDI: 00000000002658f3 [2346301.163789] RBP: ffff88043fc03c98 R08: 0000000000000000 R09: 0000000000000000 [2346301.171134] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff81c54f80 [2346301.178478] R13: 0000000000000000 R14: ffffffff81c54f80 R15: ffffffff81d10900 [2346301.185825] FS: 0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000 [2346301.194125] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [2346301.200073] CR2: 00007f7699844000 CR3: 0000000001c11000 CR4: 00000000000007f0 [2346301.207420] Stack: [2346301.209626] ffff88043fc03ca8 ffffffff813a5dbc ffff88043fc03cc8 ffffffff810362b2 [2346301.217323] 0000000000008114 ffff88043fc0d7c0 ffff88043fc03d28 ffffffff810cbcf2 [2346301.225023] 0000000000000082 0000000000008114 ffff88043fc03cf8 00000000810a9cad [2346301.232722] Call Trace: [2346301.235364] <IRQ> [2346301.237471] [<ffffffff813a5dbc>] __const_udelay+0x2c/0x30 [2346301.243392] [<ffffffff810362b2>] arch_trigger_all_cpu_backtrace+0x92/0x120 [2346301.250560] [<ffffffff810cbcf2>] rcu_check_callbacks+0x452/0x6b0 [2346301.256858] [<ffffffff81066e78>] update_process_times+0x48/0x80 [2346301.263069] [<ffffffff810d7753>] tick_sched_handle.isra.15+0x33/0x70 [2346301.269714] [<ffffffff810d7c3c>] tick_sched_timer+0x4c/0x80 [2346301.275577] [<ffffffff81080081>] __run_hrtimer+0x71/0x250 [2346301.281266] [<ffffffff81080895>] ? hrtimer_interrupt+0x75/0x260 [2346301.287470] [<ffffffff810d7bf0>] ? tick_sched_do_timer+0x40/0x40 [2346301.293765] [<ffffffff81080927>] hrtimer_interrupt+0x107/0x260 [2346301.299884] [<ffffffff810c366e>] ? handle_edge_irq+0x1e/0x110 [2346301.305921] [<ffffffff810404b6>] hpet_interrupt_handler+0x16/0x40 [2346301.312304] [<ffffffff810c08dd>] handle_irq_event_percpu+0x5d/0x200 [2346301.318864] [<ffffffff810c0ac8>] handle_irq_event+0x48/0x70 [2346301.324723] [<ffffffff810c36c7>] handle_edge_irq+0x77/0x110 [2346301.330581] [<ffffffff810053c2>] handle_irq+0x22/0x40 [2346301.335925] [<ffffffff8170fc8e>] do_IRQ+0x5e/0x110 [2346301.341006] [<ffffffff8170ddaf>] common_interrupt+0x6f/0x6f [2346301.346862] <EOI> [2346301.348970] [<ffffffffa038b792>] ? __choose_mds+0x212/0x2b0 [ceph] [2346301.355677] [<ffffffffa038d688>] ? __do_request+0x278/0x2c0 [ceph] [2346301.362152] [<ffffffffa038d4b7>] ? __do_request+0xa7/0x2c0 [ceph] [2346301.368538] [<ffffffffa038d7bc>] __wake_requests+0x6c/0x80 [ceph] [2346301.374927] [<ffffffffa038f375>] ceph_mdsc_handle_map+0x435/0x4e0 [ceph] [2346301.381919] [<ffffffffa0375379>] extra_mon_dispatch+0x29/0x30 [ceph] [2346301.388562] [<ffffffffa03459e4>] dispatch+0x54/0x820 [libceph] [2346301.394680] [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10
Updated by Zheng Yan over 9 years ago
looks like it was waiting for i_ceph_lock, did the soft lockup recover?
Updated by Zheng Yan over 9 years ago
maybe this can be fixed by: "ceph: protect kick_requests() with mdsc->mutex" and "ceph: make sure request isn't in any waiting list when kicking request" in testing branch
Actions