Project

General

Profile

Actions

Bug #9426

closed

kcephfs: soft lockup in handle mds map

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

[2346240.855916] ceph: mds0 reconnect start
[2346267.396560] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:14794]
[2346267.403473] Modules linked in: ipmi_devintf(E) ipmi_si(E) ipmi_msghandler(E) ip6table_filter(E) ip6_tables(E) ebtable_nat(E) ebtables(E) ipt_MASQUERADE(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_state(E) nf_conntrack(E) ipt_REJECT(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) iptable_filter(E) ip_tables(E) x_tables(E) bridge(E) stp(E) llc(E) ceph(E) libceph(E) fscache(E) psmouse(E) joydev(E) serio_raw(E) gpio_ich(E) xfs(E) i7core_edac(E) tpm_infineon(E) edac_core(E) tpm_tis(E) lpc_ich(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) btrfs(E) raid6_pq(E) e1000e(E) ahci(E) ptp(E) libahci(E) pps_core(E) arcmsr(E) xor(E) libcrc32c(E)
[2346267.465985] irq event stamp: 0
[2346267.469307] hardirqs last  enabled at (0): [<          (null)>]           (null)
[2346267.477119] hardirqs last disabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0
[2346267.485662] softirqs last  enabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0
[2346267.494151] softirqs last disabled at (0): [<          (null)>]           (null)
[2346267.502007] CPU: 0 PID: 14794 Comm: kworker/0:1 Tainted: G            E 3.16.0-ceph-00005-gd8f31c8 #1
[2346267.511566] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010
[2346267.518338] Workqueue: ceph-msgr con_work [libceph]
[2346267.523475] task: ffff88003906c3e0 ti: ffff88011ff24000 task.ti: ffff88011ff24000
[2346267.531330] RIP: 0010:[<ffffffff810aebbe>]  [<ffffffff810aebbe>] lock_acquire+0x9e/0x120
[2346267.539860] RSP: 0018:ffff88011ff279c8  EFLAGS: 00000246
[2346267.545576] RAX: ffff88003906c3e0 RBX: ffffffff8170de60 RCX: 0000000000000000
[2346267.553031] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880406d98028
[2346267.560494] RBP: ffff88011ff27a48 R08: 0000000000000001 R09: 0000000000000000
[2346267.567906] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011ff27938
[2346267.575394] R13: ffff88043fc13b40 R14: ffff88011ff24000 R15: 0000000000000001
[2346267.582799] FS:  0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
[2346267.591170] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2346267.597211] CR2: 00007f7699844000 CR3: 0000000001c11000 CR4: 00000000000007f0
[2346267.604800] Stack:
[2346267.607007]  0000000000000000 ffffffffa038b657 ffffffff00000000 ffffffff8170de60
[2346267.614823]  0000000000000000 0000000000000246 0000000000000000 0000000000000001
[2346267.622733]  0000000000000000 ffff880406d98028 ffffffffa038b792 ffff880406d98010
[2346267.630494] Call Trace:
[2346267.633205]  [<ffffffffa038b657>] ? __choose_mds+0xd7/0x2b0 [ceph]
[2346267.639639]  [<ffffffff8170de60>] ? retint_restore_args+0xe/0xe
[2346267.645919]  [<ffffffffa038b792>] ? __choose_mds+0x212/0x2b0 [ceph]
[2346267.652590]  [<ffffffff8170c219>] _raw_spin_lock+0x39/0x50
[2346267.658329]  [<ffffffffa038b657>] ? __choose_mds+0xd7/0x2b0 [ceph]
[2346267.664839]  [<ffffffff8170c9eb>] ? _raw_spin_unlock+0x2b/0x40
[2346267.671076]  [<ffffffffa038b657>] __choose_mds+0xd7/0x2b0 [ceph]
[2346267.677388]  [<ffffffffa038d4b7>] __do_request+0xa7/0x2c0 [ceph]
[2346267.683737]  [<ffffffffa038d7bc>] __wake_requests+0x6c/0x80 [ceph]
[2346267.690307]  [<ffffffffa038f375>] ceph_mdsc_handle_map+0x435/0x4e0 [ceph]
[2346267.697345]  [<ffffffffa0375379>] extra_mon_dispatch+0x29/0x30 [ceph]
[2346267.704142]  [<ffffffffa03459e4>] dispatch+0x54/0x820 [libceph]
[2346267.710317]  [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10
[2346267.716572]  [<ffffffffa034271d>] con_work+0x12ed/0x20f0 [libceph]
[2346267.723008]  [<ffffffff810867ef>] ? finish_task_switch+0x3f/0x120
[2346267.729452]  [<ffffffff8107511f>] ? process_one_work+0x16f/0x570
[2346267.735750]  [<ffffffff81075181>] process_one_work+0x1d1/0x570
[2346267.741837]  [<ffffffff8107511f>] ? process_one_work+0x16f/0x570
[2346267.748189]  [<ffffffff81075fbc>] worker_thread+0x11c/0x530
[2346267.754071]  [<ffffffff81075ea0>] ? create_and_start_worker+0x50/0x50
[2346267.760731]  [<ffffffff8107d2d4>] kthread+0xe4/0x100
[2346267.766005]  [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130
[2346267.772632]  [<ffffffff8170d02c>] ret_from_fork+0x7c/0xb0
[2346267.778276]  [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130
[2346267.785122] Code: 74 24 08 4c 89 24 24 41 0f 94 c1 45 0f b6 c9 e8 99 dc ff ff 65 48 8b 04 25 c0 ba 00 00 c7 80 34 08 00 00 00 00 00 00 ff 75 a8 9d <48> 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 00 65 ff 04 
[2346295.381621] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:14794]
[2346295.388630] Modules linked in: ipmi_devintf(E) ipmi_si(E) ipmi_msghandler(E) ip6table_filter(E) ip6_tables(E) ebtable_nat(E) ebtables(E) ipt_MASQUERADE(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_state(E) nf_conntrack(E) ipt_REJECT(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) iptable_filter(E) ip_tables(E) x_tables(E) bridge(E) stp(E) llc(E) ceph(E) libceph(E) fscache(E) psmouse(E) joydev(E) serio_raw(E) gpio_ich(E) xfs(E) i7core_edac(E) tpm_infineon(E) edac_core(E) tpm_tis(E) lpc_ich(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) btrfs(E) raid6_pq(E) e1000e(E) ahci(E) ptp(E) libahci(E) pps_core(E) arcmsr(E) xor(E) libcrc32c(E)
[2346295.450880] irq event stamp: 0
[2346295.454137] hardirqs last  enabled at (0): [<          (null)>]           (null)
[2346295.462043] hardirqs last disabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0
[2346295.470550] softirqs last  enabled at (0): [<ffffffff8105516d>] copy_process+0x78d/0x22f0
[2346295.479137] softirqs last disabled at (0): [<          (null)>]           (null)
[2346295.487010] CPU: 0 PID: 14794 Comm: kworker/0:1 Tainted: G            E 3.16.0-ceph-00005-gd8f31c8 #1
[2346295.496570] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010
[2346295.503366] Workqueue: ceph-msgr con_work [libceph]
[2346295.508466] task: ffff88003906c3e0 ti: ffff88011ff24000 task.ti: ffff88011ff24000
[2346295.516278] RIP: 0010:[<ffffffffa038b598>]  [<ffffffffa038b598>] __choose_mds+0x18/0x2b0 [ceph]
[2346295.525288] RSP: 0018:ffff88011ff27a88  EFLAGS: 00000286
[2346295.530933] RAX: 0000000000000000 RBX: 000000001ff27a18 RCX: ffff88016cedeef8
[2346295.538434] RDX: ffff88042631f130 RSI: ffff88004c4bd000 RDI: ffff88042631f000
[2346295.545912] RBP: ffff88011ff27af8 R08: 0000000000000001 R09: 0000000000000000
[2346295.553424] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246
[2346295.560859] R13: ffffffff810ae91c R14: ffffffffffffff7d R15: ffff880400000000
[2346295.569364] FS:  0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
[2346295.577808] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2346295.583799] CR2: 00007f7699844000 CR3: 0000000001c11000 CR4: 00000000000007f0
[2346295.591336] Stack:
[2346295.593579]  0000000000000000 0000000000000000 0000000000000000 0000000000000001
[2346295.601324]  ffff88013b0d2960 00000000ad2ead2d 000000000000ad2e 0000000000000001
[2346295.609112]  ffff880406d98010 ffffffffffffffbe ffff88004c4bd000 ffff88042631f000
[2346295.616812] Call Trace:
[2346295.619499]  [<ffffffffa038d4b7>] __do_request+0xa7/0x2c0 [ceph]
[2346295.625829]  [<ffffffffa038d7bc>] __wake_requests+0x6c/0x80 [ceph]
[2346295.632220]  [<ffffffffa038f375>] ceph_mdsc_handle_map+0x435/0x4e0 [ceph]
[2346295.639228]  [<ffffffffa0375379>] extra_mon_dispatch+0x29/0x30 [ceph]
[2346295.645933]  [<ffffffffa03459e4>] dispatch+0x54/0x820 [libceph]
[2346295.652182]  [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10
[2346295.658354]  [<ffffffffa034271d>] con_work+0x12ed/0x20f0 [libceph]
[2346295.664749]  [<ffffffff810867ef>] ? finish_task_switch+0x3f/0x120
[2346295.671170]  [<ffffffff8107511f>] ? process_one_work+0x16f/0x570
[2346295.677379]  [<ffffffff81075181>] process_one_work+0x1d1/0x570
[2346295.683466]  [<ffffffff8107511f>] ? process_one_work+0x16f/0x570
[2346295.689830]  [<ffffffff81075fbc>] worker_thread+0x11c/0x530
[2346295.695750]  [<ffffffff81075ea0>] ? create_and_start_worker+0x50/0x50
[2346295.702633]  [<ffffffff8107d2d4>] kthread+0xe4/0x100
[2346295.707798]  [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130
[2346295.714419]  [<ffffffff8170d02c>] ret_from_fork+0x7c/0xb0
[2346295.720179]  [<ffffffff8107d1f0>] ? flush_kthread_worker+0x130/0x130
[2346295.726734] Code: c7 c7 33 63 39 a0 31 c0 e8 f7 37 37 e1 e9 53 ff ff ff 90 66 66 66 66 90 55 48 89 e5 48 83 ec 70 48 89 5d e0 4c 89 65 e8 48 89 fb <4c> 89 6d f0 4c 89 75 f8 8b 86 e8 02 00 00 44 8b a6 68 01 00 00 
[2346300.882679] INFO: rcu_sched self-detected stall on CPU { 0}  (t=15000 jiffies g=306343080 c=306343079 q=33044)
[2346300.893194] sending NMI to all CPUs:
[2346300.897078] NMI backtrace for cpu 4
[2346300.900774] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G            E 3.16.0-ceph-00005-gd8f31c8 #1
[2346300.909692] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010
[2346300.916422] task: ffff88042966a1f0 ti: ffff880429670000 task.ti: ffff880429670000
[2346300.924120] RIP: 0010:[<ffffffff81403aac>]  [<ffffffff81403aac>] intel_idle+0xcc/0x150
[2346300.932289] RSP: 0018:ffff880429673df8  EFLAGS: 00000046
[2346300.937803] RAX: 0000000000000020 RBX: 0000000000000008 RCX: 0000000000000001
[2346300.945152] RDX: 0000000000000000 RSI: ffff880429673fd8 RDI: ffffffff8170c9af
[2346300.952505] RBP: ffff880429673e28 R08: 0000000000000001 R09: 0000000000000000
[2346300.959858] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
[2346300.967214] R13: 0000000000000020 R14: 0000000000000003 R15: 0000000000000004
[2346300.974568] FS:  0000000000000000(0000) GS:ffff88043fd00000(0000) knlGS:0000000000000000
[2346300.982877] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2346300.988825] CR2: 00007f8f0cfa5000 CR3: 0000000001c11000 CR4: 00000000000007e0
[2346300.996180] Stack:
[2346300.998395]  ffffffff815b945f 00000004ffffffff ffff88043fd1b020 000857167696c0b4
[2346301.006098]  ffffffff81ca42b8 ffffffff81ca4140 ffff880429673e88 ffffffff815b946f
[2346301.013807]  0000000000000000 0000000000f1adf4 0000000000000000 0000000000f1adf4
[2346301.021516] Call Trace:
[2346301.024173]  [<ffffffff815b945f>] ? cpuidle_enter_state+0x3f/0xe0
[2346301.030468]  [<ffffffff815b946f>] cpuidle_enter_state+0x4f/0xe0
[2346301.036593]  [<ffffffff815b95c7>] cpuidle_enter+0x17/0x20
[2346301.042196]  [<ffffffff810a2b1b>] cpu_startup_entry+0x32b/0x490
[2346301.048321]  [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10
[2346301.054360]  [<ffffffff81032892>] start_secondary+0x1a2/0x210
[2346301.060305] Code: 31 d2 65 48 8b 34 25 c8 ba 00 00 48 89 d1 48 8d 86 38 c0 ff ff 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 04 25 c8 ba 00 00 f0 80 a0 3a c0 ff ff df 0f ae f0 48 
[2346301.081025] NMI backtrace for cpu 0
[2346301.081040] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 184.003 msecs
[2346301.081061] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=15049 jiffies, g=306343080, c=306343079, q=33045)
[2346301.106630] CPU: 0 PID: 14794 Comm: kworker/0:1 Tainted: G            E 3.16.0-ceph-00005-gd8f31c8 #1
[2346301.116064] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.1 05/27/2010
[2346301.122806] Workqueue: ceph-msgr con_work [libceph]
[2346301.127905] task: ffff88003906c3e0 ti: ffff88011ff24000 task.ti: ffff88011ff24000
[2346301.135602] RIP: 0010:[<ffffffff813a5e8f>]  [<ffffffff813a5e8f>] delay_tsc+0x4f/0x80
[2346301.143589] RSP: 0018:ffff88043fc03c98  EFLAGS: 00000003
[2346301.149097] RAX: 00000000d4cf69b2 RBX: 0000000000002710 RCX: 0000000001354050
[2346301.156445] RDX: 00000000001520f8 RSI: 00000000d4cf696e RDI: 00000000002658f3
[2346301.163789] RBP: ffff88043fc03c98 R08: 0000000000000000 R09: 0000000000000000
[2346301.171134] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff81c54f80
[2346301.178478] R13: 0000000000000000 R14: ffffffff81c54f80 R15: ffffffff81d10900
[2346301.185825] FS:  0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
[2346301.194125] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[2346301.200073] CR2: 00007f7699844000 CR3: 0000000001c11000 CR4: 00000000000007f0
[2346301.207420] Stack:
[2346301.209626]  ffff88043fc03ca8 ffffffff813a5dbc ffff88043fc03cc8 ffffffff810362b2
[2346301.217323]  0000000000008114 ffff88043fc0d7c0 ffff88043fc03d28 ffffffff810cbcf2
[2346301.225023]  0000000000000082 0000000000008114 ffff88043fc03cf8 00000000810a9cad
[2346301.232722] Call Trace:
[2346301.235364]  <IRQ> 
[2346301.237471]  [<ffffffff813a5dbc>] __const_udelay+0x2c/0x30
[2346301.243392]  [<ffffffff810362b2>] arch_trigger_all_cpu_backtrace+0x92/0x120
[2346301.250560]  [<ffffffff810cbcf2>] rcu_check_callbacks+0x452/0x6b0
[2346301.256858]  [<ffffffff81066e78>] update_process_times+0x48/0x80
[2346301.263069]  [<ffffffff810d7753>] tick_sched_handle.isra.15+0x33/0x70
[2346301.269714]  [<ffffffff810d7c3c>] tick_sched_timer+0x4c/0x80
[2346301.275577]  [<ffffffff81080081>] __run_hrtimer+0x71/0x250
[2346301.281266]  [<ffffffff81080895>] ? hrtimer_interrupt+0x75/0x260
[2346301.287470]  [<ffffffff810d7bf0>] ? tick_sched_do_timer+0x40/0x40
[2346301.293765]  [<ffffffff81080927>] hrtimer_interrupt+0x107/0x260
[2346301.299884]  [<ffffffff810c366e>] ? handle_edge_irq+0x1e/0x110
[2346301.305921]  [<ffffffff810404b6>] hpet_interrupt_handler+0x16/0x40
[2346301.312304]  [<ffffffff810c08dd>] handle_irq_event_percpu+0x5d/0x200
[2346301.318864]  [<ffffffff810c0ac8>] handle_irq_event+0x48/0x70
[2346301.324723]  [<ffffffff810c36c7>] handle_edge_irq+0x77/0x110
[2346301.330581]  [<ffffffff810053c2>] handle_irq+0x22/0x40
[2346301.335925]  [<ffffffff8170fc8e>] do_IRQ+0x5e/0x110
[2346301.341006]  [<ffffffff8170ddaf>] common_interrupt+0x6f/0x6f
[2346301.346862]  <EOI> 
[2346301.348970]  [<ffffffffa038b792>] ? __choose_mds+0x212/0x2b0 [ceph]
[2346301.355677]  [<ffffffffa038d688>] ? __do_request+0x278/0x2c0 [ceph]
[2346301.362152]  [<ffffffffa038d4b7>] ? __do_request+0xa7/0x2c0 [ceph]
[2346301.368538]  [<ffffffffa038d7bc>] __wake_requests+0x6c/0x80 [ceph]
[2346301.374927]  [<ffffffffa038f375>] ceph_mdsc_handle_map+0x435/0x4e0 [ceph]
[2346301.381919]  [<ffffffffa0375379>] extra_mon_dispatch+0x29/0x30 [ceph]
[2346301.388562]  [<ffffffffa03459e4>] dispatch+0x54/0x820 [libceph]
[2346301.394680]  [<ffffffff810af68d>] ? trace_hardirqs_on+0xd/0x10

Actions #1

Updated by Zheng Yan over 9 years ago

looks like it was waiting for i_ceph_lock, did the soft lockup recover?

Actions #2

Updated by Zheng Yan over 9 years ago

maybe this can be fixed by: "ceph: protect kick_requests() with mdsc->mutex" and "ceph: make sure request isn't in any waiting list when kicking request" in testing branch

Actions #3

Updated by Sage Weil over 9 years ago

  • Status changed from New to 15
Actions #4

Updated by Sage Weil over 9 years ago

Added my Reviewed-by.

Actions #5

Updated by Zheng Yan over 9 years ago

  • Status changed from 15 to Resolved
Actions

Also available in: Atom PDF