Bug #2081
closedmsgr: spinlock badness?
0%
Description
captured this console fragment from a crashed qa run
[ 8598.813081] Process kworker/0:3 (pid: 25843, threadinfo ffff8800df23a000, task ffff880037eb1f70) [ 8598.813082] Stack: [ 8598.813084] ffff8800df23b728 ffffffff8107f285 ffff8800df23b728 ffff880111c13b80 [ 8598.813087] ffff8800df23b7c8 ffffffff81602db3 0000000000000000 ffffffff81067021 [ 8598.813090] ffff880037eb1f70 0000000000013b80 ffff8800df23bfd8 ffff8800df23a010 [ 8598.813093] Call Trace: [ 8598.813096] [<ffffffff8107f285>] wq_worker_sleeping+0x15/0xa0 [ 8598.813099] [<ffffffff81602db3>] __schedule+0x5c3/0x940 [ 8598.813102] [<ffffffff81067021>] ? do_exit+0x551/0x880 [ 8598.813105] [<ffffffff81065f1d>] ? release_task+0x1d/0x470 [ 8598.813108] [<ffffffff8160345f>] schedule+0x3f/0x60 [ 8598.813111] [<ffffffff81067097>] do_exit+0x5c7/0x880 [ 8598.813113] [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140 [ 8598.813117] [<ffffffff81607100>] oops_end+0xb0/0xf0 [ 8598.813120] [<ffffffff8103f88d>] no_context+0xfd/0x270 [ 8598.813123] [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230 [ 8598.813127] [<ffffffff81609b61>] ? do_page_fault+0x221/0x4b0 [ 8598.813130] [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20 [ 8598.813133] [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0 [ 8598.813136] [<ffffffff814f04e4>] ? release_sock+0x144/0x190 [ 8598.813140] [<ffffffff81545245>] ? tcp_sendpage+0xc5/0x590 [ 8598.813143] [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 8598.813147] [<ffffffff81606535>] page_fault+0x25/0x30 [ 8598.813154] [<ffffffffa048ec30>] ? try_write+0x570/0xef0 [libceph] [ 8598.813158] [<ffffffff814eb1b4>] ? kernel_recvmsg+0x44/0x60 [ 8598.813165] [<ffffffffa048d858>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [ 8598.813173] [<ffffffffa04901b0>] con_work+0xc00/0x1b10 [libceph] [ 8598.813176] [<ffffffff8107fe76>] process_one_work+0x1a6/0x520 [ 8598.813179] [<ffffffff8107fe07>] ? process_one_work+0x137/0x520 [ 8598.813186] [<ffffffffa048f5b0>] ? try_write+0xef0/0xef0 [libceph] [ 8598.813189] [<ffffffff810821a3>] worker_thread+0x173/0x400 [ 8598.813192] [<ffffffff81082030>] ? manage_workers+0x210/0x210 [ 8598.813195] [<ffffffff810871e6>] kthread+0xb6/0xc0 [ 8598.813198] [<ffffffff8109f1a5>] ? trace_hardirqs_on_caller+0x105/0x190 [ 8598.813201] [<ffffffff816103c4>] kernel_thread_helper+0x4/0x10 [ 8598.813204] [<ffffffff816062b4>] ? retint_restore_args+0x13/0x13 [ 8598.813207] [<ffffffff81087130>] ? __init_kthread_worker+0x70/0x70 [ 8598.813210] [<ffffffff816103c0>] ? gs_change+0x13/0x13 [ 8598.813212] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 <48> 8b 40 f8 c9 c3 eb 08 90 90 90 90 90 90 90 90 55 48 89 e5 66 [ 8598.813239] RIP [<ffffffff81086bd0>] kthread_data+0x10/0x20 [ 8598.813242] RSP <ffff8800df23b708> [ 8598.813243] CR2: fffffffffffffff8 [ 8598.813245] ---[ end trace 96bfe67b43fe6e90 ]--- [ 8598.813246] Fixing recursive fault but reboot is needed! [ 8599.827067] BUG: spinlock lockup on CPU#0, kworker/0:3/25843 [ 8599.827067] lock: ffff880111c13b80, .magic: dead4ead, .owner: kworker/0:3/25843, .owner_cpu: 0 [ 8599.827067] Pid: 25843, comm: kworker/0:3 Tainted: G D W 3.2.0-ceph-00161-g27772da #1 [ 8599.827067] Call Trace: [ 8599.827067] [<ffffffff8131bf18>] spin_dump+0x78/0xc0 [ 8599.827067] [<ffffffff8131c13d>] do_raw_spin_lock+0xed/0x120 [ 8599.827067] [<ffffffff81605835>] _raw_spin_lock_irq+0x45/0x50 [ 8599.827067] [<ffffffff816028c7>] ? __schedule+0xd7/0x940 [ 8599.827067] [<ffffffff816028c7>] __schedule+0xd7/0x940 [ 8599.827067] [<ffffffff8160345f>] schedule+0x3f/0x60 [ 8599.827067] [<ffffffff81067304>] do_exit+0x834/0x880 [ 8599.827067] [<ffffffff81063c65>] ? kmsg_dump+0x105/0x140 [ 8599.827067] [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140 [ 8599.827067] [<ffffffff81607100>] oops_end+0xb0/0xf0 [ 8599.827067] [<ffffffff8103f88d>] no_context+0xfd/0x270 [ 8599.827067] [<ffffffffa01a9a19>] ? drm_mode_object_find+0x39/0x90 [drm] [ 8599.827067] [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230 [ 8599.827067] [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20 [ 8599.827067] [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0 [ 8599.827067] [<ffffffff81055d62>] ? cpuacct_charge+0x22/0xb0 [ 8599.827067] [<ffffffff81055d62>] ? cpuacct_charge+0x22/0xb0 [ 8599.827067] [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 8599.827067] [<ffffffff81606535>] page_fault+0x25/0x30 [ 8599.827067] [<ffffffff81086bd0>] ? kthread_data+0x10/0x20 [ 8599.827067] [<ffffffff8107f285>] wq_worker_sleeping+0x15/0xa0 [ 8599.827067] [<ffffffff81602db3>] __schedule+0x5c3/0x940 [ 8599.827067] [<ffffffff81067021>] ? do_exit+0x551/0x880 [ 8599.827067] [<ffffffff81065f1d>] ? release_task+0x1d/0x470 [ 8599.827067] [<ffffffff8160345f>] schedule+0x3f/0x60 [ 8599.827067] [<ffffffff81067097>] do_exit+0x5c7/0x880 [ 8599.827067] [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140 [ 8599.827067] [<ffffffff81607100>] oops_end+0xb0/0xf0 [ 8599.827067] [<ffffffff8103f88d>] no_context+0xfd/0x270 [ 8599.827067] [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230 [ 8599.827067] [<ffffffff81609b61>] ? do_page_fault+0x221/0x4b0 [ 8599.827067] [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20 [ 8599.827067] [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0 [ 8599.827067] [<ffffffff814f04e4>] ? release_sock+0x144/0x190 [ 8599.827067] [<ffffffff81545245>] ? tcp_sendpage+0xc5/0x590 [ 8599.827067] [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 8599.827067] [<ffffffff81606535>] page_fault+0x25/0x30 [ 8599.827067] [<ffffffffa048ec30>] ? try_write+0x570/0xef0 [libceph] [ 8599.827067] [<ffffffff814eb1b4>] ? kernel_recvmsg+0x44/0x60 [ 8599.827067] [<ffffffffa048d858>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [ 8599.827067] [<ffffffffa04901b0>] con_work+0xc00/0x1b10 [libceph] [ 8599.827067] [<ffffffff8107fe76>] process_one_work+0x1a6/0x520 [ 8599.827067] [<ffffffff8107fe07>] ? process_one_work+0x137/0x520 [ 8599.827067] [<ffffffffa048f5b0>] ? try_write+0xef0/0xef0 [libceph] [ 8599.827067] [<ffffffff810821a3>] worker_thread+0x173/0x400 [ 8599.827067] [<ffffffff81082030>] ? manage_workers+0x210/0x210 [ 8599.827067] [<ffffffff810871e6>] kthread+0xb6/0xc0 [ 8599.827067] [<ffffffff8109f1a5>] ? trace_hardirqs_on_caller+0x105/0x190 [ 8599.827067] [<ffffffff816103c4>] kernel_thread_helper+0x4/0x10 [ 8599.827067] [<ffffffff816062b4>] ? retint_restore_args+0x13/0x13 [ 8599.827067] [<ffffffff81087130>] ? __init_kthread_worker+0x70/0x70 [ 8599.827067] [<ffffffff816103c0>] ? gs_change+0x13/0x13
The job was
ubuntu@teuthology:/a/nightly_coverage_2012-02-15-b/12139$ cat config.yaml kernel: &id001 sha1: 27772dabb75b1072a81c0215b61b066bf8810f6c nuke-on-error: true overrides: ceph: conf: osd: osd op complaint time: 120 coverage: true fs: btrfs log-whitelist: - clocks not synchronized - old request sha1: e32668f8b83abad74e858d9e9fffbd456968a918 roles: - - mon.a - mon.c - osd.0 - - mon.b - mds.a - osd.1 - - client.0 targets: ubuntu@sepia55.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDZ0HauoMdA5zgEmINOhrUYdVO1V3PDt4dp/dfm+AwB3iunb6Y/BXifGRDznlrqgIZ8+Sxl7U6lPsMsj0+MOysoWA1GPmpQqwQlDLF3SMOKvRXnGALB2x2spotMgQstoSYQWUAWK+6tC41oZq2R6ht0bljU0jiS8TP3qCZ+I5izt/p31vUjZUAvtCZBQkGr8vLZDrc+/R5dCPkGn/D+IoNA0P+l75WO1nHOpAfuYN7q0xCUN42NBJl13lCN9xZBaQZyDC4OyFp7sCsSEO16SkHpiLgALITUREksvWmS/OAcIk6ayvQNWr8mBCYpiiGgr8Arvdn826UB6PDCZ4Ym9J3H ubuntu@sepia56.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUTeKEAoW2ZB3Rl0YEdWLBCfL4PuA3QA6KC2BiPsR3UXfI92jFNnZ8Yr4Qz7pJQkI9ELdC1p4T9ro1zHv/fY7U5jtFeT4aJ1daTY2fw5R5dFuMTaBnvCnWczJshoFEaEgvwh8CHXyykPtpl9l329kWOK8VjNEyN9xCork9i7NLkrPXj6Mq44baZWw32IaFWnpcTfhL1NprMeu7pSLmZooRWElIXrT2WD8FJGp8t1FlSpvOwou5pIcW1EzBLNoR8YG8lVAl5r9xKsRI1zlOtmkKeKosasoNNx8PYmSGXR/ToiMlCpxVtqMxtd7bT1VQLqDEu7W/MMHnj2T9M8ZTF759 ubuntu@sepia6.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDS1viQGCX9RNrDOwhu61WNR7UUIfR83pRvN2K2Xw8uWoUB5jEkpU6E3sYyvzEA+kEQo9TDbIZSsm8uYFciP7iy5ONj7aQYwqueec9QQS/v4ivRCg9SuD/s146ZiaIHyaOWqHVTleslUwAHRGEKRwFw8KoKIUKO119KrWfx57ahaeWa8ex9wq96TmScPcU4hB4RH2BQCYD9Qexn3JIxdf0L8dwNDHAqsKPV6g5ewlszfNw30cXip7WgL6sWkOPzv9A2xsUIPUkbTj9lLyk9JsUtyDEPTFziKEfsKbDREX2ayyr0d43H8SlhFkMzPRTrlRixPlebS82ZdEvvqYPrW63n tasks: - internal.lock_machines: 3 - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - chef: null - ceph: null - rbd: all: null - workunit: all: - kernel_untar_build.sh
Updated by Sage Weil about 12 years ago
- Assignee set to Alex Elder
- Priority changed from Normal to High
Updated by Sage Weil about 12 years ago
ubuntu@teuthology:/a/nightly_coverage_2012-02-20-b/12984 with same trace on the console.
Updated by Alex Elder about 12 years ago
This:
ubuntu@teuthology:/a/nightly_coverage_2012-02-20-b/12984
does not provide very much useful information, HOWEVER it does indicate that
the kernel was at this revision:
cc050a5dab7beb893d72dbd41e8aa5c07f1d0427
That commit is a temporary one, so here's a list of the messages that lead
up to it in the log:
cc050a5dab7beb893d72dbd41e8aa5c07f1d0427 xfs: only take the ILOCK in xfs_reclaim
07fd42934a53b8486709f7f866346a9e4bb6d5ce ceph: fix overflow check in build_snap_
fe6aa081fffeb0712c2c1e45931f9be005219b67 libceph: fix overflow check in crush_de
27772dabb75b1072a81c0215b61b066bf8810f6c ceph: messenger: some simple changes
fa75088b4e92ee0b659ef2060f5c61cc6dcca66e ceph: messenger: small refactor in writ
e92a0d861f1c9910916978be647f5ebeed2183fe ceph: messenger: do crc calculations ou
b18a3fc5185c30094f39d20514bea77bfd2d8f34 ceph: messenger: separate CRC calculati
9817aa47f049db29b5315ee0a006681f0e48c42f ceph: messenger: use "do" in CRC-relate
1477b41675cbb134a3398a45e8108b72743b5b3d ceph: ensure Boolean options support bo
Updated by Alex Elder about 12 years ago
- Status changed from New to Need More Info
Bugs 1793 and 2081 have a signature of a page fault/bad memory reference
from process_one_work() -> con_work(), and they may well be the same issue.
So I have been treating them as such, and have been attempting to reproduce
them by running the kernel_untar_build.sh workunit repeatedly.
The last day or two I have been testing with the latest version of the
ceph-client code (c666601a93). One time I hit a kernel failure that had a
signature different from either 1793 or 2081--it reported:
INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=492980 jiffies)
I believe this is bug 2174, and am adding an update to that to record
this occurrence.
I restarted testing and completed 50 more iterations of kernel_untar_build.sh
without error. I will try a bit more, I think we should keep these two bugs
on a sort of inactive status (1793 is already "Need More Info" but 2081 was
not until now) until we start seeing occurrences again. A kernel core dump
would also be helpful, once we get that in place (see bug 2127).
Updated by Alex Elder about 12 years ago
Another 100 iterations of kernel_untar_build.sh using the current
master branch (c666601a935b94cc0f3310339411b6940de751ba) without
error. I'm going to quit testing this now and wait to see if
our regular automated testing produces it again.
Updated by Alex Elder about 12 years ago
- Status changed from Need More Info to Can't reproduce
Marking this Can't Reproduce. Will reopen if it happens again.