Bug #2081: msgr: spinlock badness? - Linux kernel client - Ceph

Actions

Copy link

Bug #2081

closed

msgr: spinlock badness?

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Alex Elder

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

captured this console fragment from a crashed qa run

[ 8598.813081] Process kworker/0:3 (pid: 25843, threadinfo ffff8800df23a000, task ffff880037eb1f70)
[ 8598.813082] Stack:
[ 8598.813084]  ffff8800df23b728 ffffffff8107f285 ffff8800df23b728 ffff880111c13b80
[ 8598.813087]  ffff8800df23b7c8 ffffffff81602db3 0000000000000000 ffffffff81067021
[ 8598.813090]  ffff880037eb1f70 0000000000013b80 ffff8800df23bfd8 ffff8800df23a010
[ 8598.813093] Call Trace:
[ 8598.813096]  [<ffffffff8107f285>] wq_worker_sleeping+0x15/0xa0
[ 8598.813099]  [<ffffffff81602db3>] __schedule+0x5c3/0x940
[ 8598.813102]  [<ffffffff81067021>] ? do_exit+0x551/0x880
[ 8598.813105]  [<ffffffff81065f1d>] ? release_task+0x1d/0x470
[ 8598.813108]  [<ffffffff8160345f>] schedule+0x3f/0x60
[ 8598.813111]  [<ffffffff81067097>] do_exit+0x5c7/0x880
[ 8598.813113]  [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140
[ 8598.813117]  [<ffffffff81607100>] oops_end+0xb0/0xf0
[ 8598.813120]  [<ffffffff8103f88d>] no_context+0xfd/0x270
[ 8598.813123]  [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230

[ 8598.813127]  [<ffffffff81609b61>] ? do_page_fault+0x221/0x4b0

[ 8598.813130]  [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20

[ 8598.813133]  [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0
[ 8598.813136]  [<ffffffff814f04e4>] ? release_sock+0x144/0x190
[ 8598.813140]  [<ffffffff81545245>] ? tcp_sendpage+0xc5/0x590
[ 8598.813143]  [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8598.813147]  [<ffffffff81606535>] page_fault+0x25/0x30
[ 8598.813154]  [<ffffffffa048ec30>] ? try_write+0x570/0xef0 [libceph]
[ 8598.813158]  [<ffffffff814eb1b4>] ? kernel_recvmsg+0x44/0x60
[ 8598.813165]  [<ffffffffa048d858>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[ 8598.813173]  [<ffffffffa04901b0>] con_work+0xc00/0x1b10 [libceph]
[ 8598.813176]  [<ffffffff8107fe76>] process_one_work+0x1a6/0x520
[ 8598.813179]  [<ffffffff8107fe07>] ? process_one_work+0x137/0x520
[ 8598.813186]  [<ffffffffa048f5b0>] ? try_write+0xef0/0xef0 [libceph]
[ 8598.813189]  [<ffffffff810821a3>] worker_thread+0x173/0x400
[ 8598.813192]  [<ffffffff81082030>] ? manage_workers+0x210/0x210
[ 8598.813195]  [<ffffffff810871e6>] kthread+0xb6/0xc0
[ 8598.813198]  [<ffffffff8109f1a5>] ? trace_hardirqs_on_caller+0x105/0x190
[ 8598.813201]  [<ffffffff816103c4>] kernel_thread_helper+0x4/0x10
[ 8598.813204]  [<ffffffff816062b4>] ? retint_restore_args+0x13/0x13
[ 8598.813207]  [<ffffffff81087130>] ? __init_kthread_worker+0x70/0x70
[ 8598.813210]  [<ffffffff816103c0>] ? gs_change+0x13/0x13
[ 8598.813212] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 <48> 8b 40 f8 c9 c3 eb 08 90 90 90 90 90 90 90 90 55 48 89 e5 66 

[ 8598.813239] RIP  [<ffffffff81086bd0>] kthread_data+0x10/0x20
[ 8598.813242]  RSP <ffff8800df23b708>
[ 8598.813243] CR2: fffffffffffffff8
[ 8598.813245] ---[ end trace 96bfe67b43fe6e90 ]---
[ 8598.813246] Fixing recursive fault but reboot is needed!
[ 8599.827067] BUG: spinlock lockup on CPU#0, kworker/0:3/25843
[ 8599.827067]  lock: ffff880111c13b80, .magic: dead4ead, .owner: kworker/0:3/25843, .owner_cpu: 0
[ 8599.827067] Pid: 25843, comm: kworker/0:3 Tainted: G      D W    3.2.0-ceph-00161-g27772da #1
[ 8599.827067] Call Trace:
[ 8599.827067]  [<ffffffff8131bf18>] spin_dump+0x78/0xc0
[ 8599.827067]  [<ffffffff8131c13d>] do_raw_spin_lock+0xed/0x120
[ 8599.827067]  [<ffffffff81605835>] _raw_spin_lock_irq+0x45/0x50
[ 8599.827067]  [<ffffffff816028c7>] ? __schedule+0xd7/0x940
[ 8599.827067]  [<ffffffff816028c7>] __schedule+0xd7/0x940
[ 8599.827067]  [<ffffffff8160345f>] schedule+0x3f/0x60
[ 8599.827067]  [<ffffffff81067304>] do_exit+0x834/0x880
[ 8599.827067]  [<ffffffff81063c65>] ? kmsg_dump+0x105/0x140
[ 8599.827067]  [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140
[ 8599.827067]  [<ffffffff81607100>] oops_end+0xb0/0xf0
[ 8599.827067]  [<ffffffff8103f88d>] no_context+0xfd/0x270

[ 8599.827067]  [<ffffffffa01a9a19>] ? drm_mode_object_find+0x39/0x90 [drm]
[ 8599.827067]  [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230
[ 8599.827067]  [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20
[ 8599.827067]  [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0
[ 8599.827067]  [<ffffffff81055d62>] ? cpuacct_charge+0x22/0xb0
[ 8599.827067]  [<ffffffff81055d62>] ? cpuacct_charge+0x22/0xb0
[ 8599.827067]  [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8599.827067]  [<ffffffff81606535>] page_fault+0x25/0x30
[ 8599.827067]  [<ffffffff81086bd0>] ? kthread_data+0x10/0x20
[ 8599.827067]  [<ffffffff8107f285>] wq_worker_sleeping+0x15/0xa0
[ 8599.827067]  [<ffffffff81602db3>] __schedule+0x5c3/0x940
[ 8599.827067]  [<ffffffff81067021>] ? do_exit+0x551/0x880
[ 8599.827067]  [<ffffffff81065f1d>] ? release_task+0x1d/0x470
[ 8599.827067]  [<ffffffff8160345f>] schedule+0x3f/0x60

[ 8599.827067]  [<ffffffff81067097>] do_exit+0x5c7/0x880
[ 8599.827067]  [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140
[ 8599.827067]  [<ffffffff81607100>] oops_end+0xb0/0xf0
[ 8599.827067]  [<ffffffff8103f88d>] no_context+0xfd/0x270
[ 8599.827067]  [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230
[ 8599.827067]  [<ffffffff81609b61>] ? do_page_fault+0x221/0x4b0
[ 8599.827067]  [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20

[ 8599.827067]  [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0
[ 8599.827067]  [<ffffffff814f04e4>] ? release_sock+0x144/0x190
[ 8599.827067]  [<ffffffff81545245>] ? tcp_sendpage+0xc5/0x590
[ 8599.827067]  [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8599.827067]  [<ffffffff81606535>] page_fault+0x25/0x30
[ 8599.827067]  [<ffffffffa048ec30>] ? try_write+0x570/0xef0 [libceph]
[ 8599.827067]  [<ffffffff814eb1b4>] ? kernel_recvmsg+0x44/0x60
[ 8599.827067]  [<ffffffffa048d858>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[ 8599.827067]  [<ffffffffa04901b0>] con_work+0xc00/0x1b10 [libceph]
[ 8599.827067]  [<ffffffff8107fe76>] process_one_work+0x1a6/0x520
[ 8599.827067]  [<ffffffff8107fe07>] ? process_one_work+0x137/0x520
[ 8599.827067]  [<ffffffffa048f5b0>] ? try_write+0xef0/0xef0 [libceph]
[ 8599.827067]  [<ffffffff810821a3>] worker_thread+0x173/0x400
[ 8599.827067]  [<ffffffff81082030>] ? manage_workers+0x210/0x210
[ 8599.827067]  [<ffffffff810871e6>] kthread+0xb6/0xc0
[ 8599.827067]  [<ffffffff8109f1a5>] ? trace_hardirqs_on_caller+0x105/0x190
[ 8599.827067]  [<ffffffff816103c4>] kernel_thread_helper+0x4/0x10
[ 8599.827067]  [<ffffffff816062b4>] ? retint_restore_args+0x13/0x13
[ 8599.827067]  [<ffffffff81087130>] ? __init_kthread_worker+0x70/0x70
[ 8599.827067]  [<ffffffff816103c0>] ? gs_change+0x13/0x13

The job was

ubuntu@teuthology:/a/nightly_coverage_2012-02-15-b/12139$ cat config.yaml 
kernel: &id001
  sha1: 27772dabb75b1072a81c0215b61b066bf8810f6c
nuke-on-error: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    coverage: true
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - old request
    sha1: e32668f8b83abad74e858d9e9fffbd456968a918
roles:
- - mon.a
  - mon.c
  - osd.0
- - mon.b
  - mds.a
  - osd.1
- - client.0
targets:
  ubuntu@sepia55.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDZ0HauoMdA5zgEmINOhrUYdVO1V3PDt4dp/dfm+AwB3iunb6Y/BXifGRDznlrqgIZ8+Sxl7U6lPsMsj0+MOysoWA1GPmpQqwQlDLF3SMOKvRXnGALB2x2spotMgQstoSYQWUAWK+6tC41oZq2R6ht0bljU0jiS8TP3qCZ+I5izt/p31vUjZUAvtCZBQkGr8vLZDrc+/R5dCPkGn/D+IoNA0P+l75WO1nHOpAfuYN7q0xCUN42NBJl13lCN9xZBaQZyDC4OyFp7sCsSEO16SkHpiLgALITUREksvWmS/OAcIk6ayvQNWr8mBCYpiiGgr8Arvdn826UB6PDCZ4Ym9J3H
  ubuntu@sepia56.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUTeKEAoW2ZB3Rl0YEdWLBCfL4PuA3QA6KC2BiPsR3UXfI92jFNnZ8Yr4Qz7pJQkI9ELdC1p4T9ro1zHv/fY7U5jtFeT4aJ1daTY2fw5R5dFuMTaBnvCnWczJshoFEaEgvwh8CHXyykPtpl9l329kWOK8VjNEyN9xCork9i7NLkrPXj6Mq44baZWw32IaFWnpcTfhL1NprMeu7pSLmZooRWElIXrT2WD8FJGp8t1FlSpvOwou5pIcW1EzBLNoR8YG8lVAl5r9xKsRI1zlOtmkKeKosasoNNx8PYmSGXR/ToiMlCpxVtqMxtd7bT1VQLqDEu7W/MMHnj2T9M8ZTF759
  ubuntu@sepia6.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDS1viQGCX9RNrDOwhu61WNR7UUIfR83pRvN2K2Xw8uWoUB5jEkpU6E3sYyvzEA+kEQo9TDbIZSsm8uYFciP7iy5ONj7aQYwqueec9QQS/v4ivRCg9SuD/s146ZiaIHyaOWqHVTleslUwAHRGEKRwFw8KoKIUKO119KrWfx57ahaeWa8ex9wq96TmScPcU4hB4RH2BQCYD9Qexn3JIxdf0L8dwNDHAqsKPV6g5ewlszfNw30cXip7WgL6sWkOPzv9A2xsUIPUkbTj9lLyk9JsUtyDEPTFziKEfsKbDREX2ayyr0d43H8SlhFkMzPRTrlRixPlebS82ZdEvvqYPrW63n
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- chef: null
- ceph: null
- rbd:
    all: null
- workunit:
    all:
    - kernel_untar_build.sh

Actions

Copy link

Updated by Sage Weil about 12 years ago

Assignee set to Alex Elder
Priority changed from Normal to High

Actions

Copy link

Updated by Sage Weil about 12 years ago

ubuntu@teuthology:/a/nightly_coverage_2012-02-20-b/12984 with same trace on the console.

Actions

Copy link

Updated by Alex Elder about 12 years ago

This:

ubuntu@teuthology:/a/nightly_coverage_2012-02-20-b/12984

does not provide very much useful information, HOWEVER it does indicate that
the kernel was at this revision:
cc050a5dab7beb893d72dbd41e8aa5c07f1d0427

That commit is a temporary one, so here's a list of the messages that lead
up to it in the log:

cc050a5dab7beb893d72dbd41e8aa5c07f1d0427 xfs: only take the ILOCK in xfs_reclaim
07fd42934a53b8486709f7f866346a9e4bb6d5ce ceph: fix overflow check in build_snap_
fe6aa081fffeb0712c2c1e45931f9be005219b67 libceph: fix overflow check in crush_de
27772dabb75b1072a81c0215b61b066bf8810f6c ceph: messenger: some simple changes
fa75088b4e92ee0b659ef2060f5c61cc6dcca66e ceph: messenger: small refactor in writ
e92a0d861f1c9910916978be647f5ebeed2183fe ceph: messenger: do crc calculations ou
b18a3fc5185c30094f39d20514bea77bfd2d8f34 ceph: messenger: separate CRC calculati
9817aa47f049db29b5315ee0a006681f0e48c42f ceph: messenger: use "do" in CRC-relate
1477b41675cbb134a3398a45e8108b72743b5b3d ceph: ensure Boolean options support bo

Actions

Copy link

Updated by Alex Elder about 12 years ago

Status changed from New to Need More Info

Bugs 1793 and 2081 have a signature of a page fault/bad memory reference
from process_one_work() -> con_work(), and they may well be the same issue.
So I have been treating them as such, and have been attempting to reproduce
them by running the kernel_untar_build.sh workunit repeatedly.

The last day or two I have been testing with the latest version of the
ceph-client code (c666601a93). One time I hit a kernel failure that had a
signature different from either 1793 or 2081--it reported:

INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=492980 jiffies)

I believe this is bug 2174, and am adding an update to that to record
this occurrence.

I restarted testing and completed 50 more iterations of kernel_untar_build.sh
without error. I will try a bit more, I think we should keep these two bugs
on a sort of inactive status (1793 is already "Need More Info" but 2081 was
not until now) until we start seeing occurrences again. A kernel core dump
would also be helpful, once we get that in place (see bug 2127).

Actions

Copy link

Updated by Alex Elder about 12 years ago

Another 100 iterations of kernel_untar_build.sh using the current
master branch (c666601a935b94cc0f3310339411b6940de751ba) without
error. I'm going to quit testing this now and wait to see if
our regular automated testing produces it again.

Actions

Copy link

Updated by Alex Elder about 12 years ago

Status changed from Need More Info to Can't reproduce

Marking this Can't Reproduce. Will reopen if it happens again.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #2081

msgr: spinlock badness?

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Alex Elder about 12 years ago

Updated by Alex Elder about 12 years ago

Updated by Alex Elder about 12 years ago

Updated by Alex Elder about 12 years ago