Project

General

Profile

Bug #2081

msgr: spinlock badness?

Added by Sage Weil about 12 years ago. Updated almost 12 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

captured this console fragment from a crashed qa run

[ 8598.813081] Process kworker/0:3 (pid: 25843, threadinfo ffff8800df23a000, task ffff880037eb1f70)
[ 8598.813082] Stack:
[ 8598.813084]  ffff8800df23b728 ffffffff8107f285 ffff8800df23b728 ffff880111c13b80
[ 8598.813087]  ffff8800df23b7c8 ffffffff81602db3 0000000000000000 ffffffff81067021
[ 8598.813090]  ffff880037eb1f70 0000000000013b80 ffff8800df23bfd8 ffff8800df23a010
[ 8598.813093] Call Trace:
[ 8598.813096]  [<ffffffff8107f285>] wq_worker_sleeping+0x15/0xa0
[ 8598.813099]  [<ffffffff81602db3>] __schedule+0x5c3/0x940
[ 8598.813102]  [<ffffffff81067021>] ? do_exit+0x551/0x880
[ 8598.813105]  [<ffffffff81065f1d>] ? release_task+0x1d/0x470
[ 8598.813108]  [<ffffffff8160345f>] schedule+0x3f/0x60
[ 8598.813111]  [<ffffffff81067097>] do_exit+0x5c7/0x880
[ 8598.813113]  [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140
[ 8598.813117]  [<ffffffff81607100>] oops_end+0xb0/0xf0
[ 8598.813120]  [<ffffffff8103f88d>] no_context+0xfd/0x270
[ 8598.813123]  [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230

[ 8598.813127]  [<ffffffff81609b61>] ? do_page_fault+0x221/0x4b0

[ 8598.813130]  [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20

[ 8598.813133]  [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0
[ 8598.813136]  [<ffffffff814f04e4>] ? release_sock+0x144/0x190
[ 8598.813140]  [<ffffffff81545245>] ? tcp_sendpage+0xc5/0x590
[ 8598.813143]  [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8598.813147]  [<ffffffff81606535>] page_fault+0x25/0x30
[ 8598.813154]  [<ffffffffa048ec30>] ? try_write+0x570/0xef0 [libceph]
[ 8598.813158]  [<ffffffff814eb1b4>] ? kernel_recvmsg+0x44/0x60
[ 8598.813165]  [<ffffffffa048d858>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[ 8598.813173]  [<ffffffffa04901b0>] con_work+0xc00/0x1b10 [libceph]
[ 8598.813176]  [<ffffffff8107fe76>] process_one_work+0x1a6/0x520
[ 8598.813179]  [<ffffffff8107fe07>] ? process_one_work+0x137/0x520
[ 8598.813186]  [<ffffffffa048f5b0>] ? try_write+0xef0/0xef0 [libceph]
[ 8598.813189]  [<ffffffff810821a3>] worker_thread+0x173/0x400
[ 8598.813192]  [<ffffffff81082030>] ? manage_workers+0x210/0x210
[ 8598.813195]  [<ffffffff810871e6>] kthread+0xb6/0xc0
[ 8598.813198]  [<ffffffff8109f1a5>] ? trace_hardirqs_on_caller+0x105/0x190
[ 8598.813201]  [<ffffffff816103c4>] kernel_thread_helper+0x4/0x10
[ 8598.813204]  [<ffffffff816062b4>] ? retint_restore_args+0x13/0x13
[ 8598.813207]  [<ffffffff81087130>] ? __init_kthread_worker+0x70/0x70
[ 8598.813210]  [<ffffffff816103c0>] ? gs_change+0x13/0x13
[ 8598.813212] Code: 66 66 66 90 65 48 8b 04 25 40 c4 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 66 90 55 48 89 e5 66 66 66 66 90 48 8b 87 48 03 00 00 <48> 8b 40 f8 c9 c3 eb 08 90 90 90 90 90 90 90 90 55 48 89 e5 66 

[ 8598.813239] RIP  [<ffffffff81086bd0>] kthread_data+0x10/0x20
[ 8598.813242]  RSP <ffff8800df23b708>
[ 8598.813243] CR2: fffffffffffffff8
[ 8598.813245] ---[ end trace 96bfe67b43fe6e90 ]---
[ 8598.813246] Fixing recursive fault but reboot is needed!
[ 8599.827067] BUG: spinlock lockup on CPU#0, kworker/0:3/25843
[ 8599.827067]  lock: ffff880111c13b80, .magic: dead4ead, .owner: kworker/0:3/25843, .owner_cpu: 0
[ 8599.827067] Pid: 25843, comm: kworker/0:3 Tainted: G      D W    3.2.0-ceph-00161-g27772da #1
[ 8599.827067] Call Trace:
[ 8599.827067]  [<ffffffff8131bf18>] spin_dump+0x78/0xc0
[ 8599.827067]  [<ffffffff8131c13d>] do_raw_spin_lock+0xed/0x120
[ 8599.827067]  [<ffffffff81605835>] _raw_spin_lock_irq+0x45/0x50
[ 8599.827067]  [<ffffffff816028c7>] ? __schedule+0xd7/0x940
[ 8599.827067]  [<ffffffff816028c7>] __schedule+0xd7/0x940
[ 8599.827067]  [<ffffffff8160345f>] schedule+0x3f/0x60
[ 8599.827067]  [<ffffffff81067304>] do_exit+0x834/0x880
[ 8599.827067]  [<ffffffff81063c65>] ? kmsg_dump+0x105/0x140
[ 8599.827067]  [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140
[ 8599.827067]  [<ffffffff81607100>] oops_end+0xb0/0xf0
[ 8599.827067]  [<ffffffff8103f88d>] no_context+0xfd/0x270

[ 8599.827067]  [<ffffffffa01a9a19>] ? drm_mode_object_find+0x39/0x90 [drm]
[ 8599.827067]  [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230
[ 8599.827067]  [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20
[ 8599.827067]  [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0
[ 8599.827067]  [<ffffffff81055d62>] ? cpuacct_charge+0x22/0xb0
[ 8599.827067]  [<ffffffff81055d62>] ? cpuacct_charge+0x22/0xb0
[ 8599.827067]  [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8599.827067]  [<ffffffff81606535>] page_fault+0x25/0x30
[ 8599.827067]  [<ffffffff81086bd0>] ? kthread_data+0x10/0x20
[ 8599.827067]  [<ffffffff8107f285>] wq_worker_sleeping+0x15/0xa0
[ 8599.827067]  [<ffffffff81602db3>] __schedule+0x5c3/0x940
[ 8599.827067]  [<ffffffff81067021>] ? do_exit+0x551/0x880
[ 8599.827067]  [<ffffffff81065f1d>] ? release_task+0x1d/0x470
[ 8599.827067]  [<ffffffff8160345f>] schedule+0x3f/0x60

[ 8599.827067]  [<ffffffff81067097>] do_exit+0x5c7/0x880
[ 8599.827067]  [<ffffffff81063bd5>] ? kmsg_dump+0x75/0x140
[ 8599.827067]  [<ffffffff81607100>] oops_end+0xb0/0xf0
[ 8599.827067]  [<ffffffff8103f88d>] no_context+0xfd/0x270
[ 8599.827067]  [<ffffffff8103fb45>] __bad_area_nosemaphore+0x145/0x230
[ 8599.827067]  [<ffffffff81609b61>] ? do_page_fault+0x221/0x4b0
[ 8599.827067]  [<ffffffff8103fc43>] bad_area_nosemaphore+0x13/0x20

[ 8599.827067]  [<ffffffff81609c8e>] do_page_fault+0x34e/0x4b0
[ 8599.827067]  [<ffffffff814f04e4>] ? release_sock+0x144/0x190
[ 8599.827067]  [<ffffffff81545245>] ? tcp_sendpage+0xc5/0x590
[ 8599.827067]  [<ffffffff81315d5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 8599.827067]  [<ffffffff81606535>] page_fault+0x25/0x30
[ 8599.827067]  [<ffffffffa048ec30>] ? try_write+0x570/0xef0 [libceph]
[ 8599.827067]  [<ffffffff814eb1b4>] ? kernel_recvmsg+0x44/0x60
[ 8599.827067]  [<ffffffffa048d858>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[ 8599.827067]  [<ffffffffa04901b0>] con_work+0xc00/0x1b10 [libceph]
[ 8599.827067]  [<ffffffff8107fe76>] process_one_work+0x1a6/0x520
[ 8599.827067]  [<ffffffff8107fe07>] ? process_one_work+0x137/0x520
[ 8599.827067]  [<ffffffffa048f5b0>] ? try_write+0xef0/0xef0 [libceph]
[ 8599.827067]  [<ffffffff810821a3>] worker_thread+0x173/0x400
[ 8599.827067]  [<ffffffff81082030>] ? manage_workers+0x210/0x210
[ 8599.827067]  [<ffffffff810871e6>] kthread+0xb6/0xc0
[ 8599.827067]  [<ffffffff8109f1a5>] ? trace_hardirqs_on_caller+0x105/0x190
[ 8599.827067]  [<ffffffff816103c4>] kernel_thread_helper+0x4/0x10
[ 8599.827067]  [<ffffffff816062b4>] ? retint_restore_args+0x13/0x13
[ 8599.827067]  [<ffffffff81087130>] ? __init_kthread_worker+0x70/0x70
[ 8599.827067]  [<ffffffff816103c0>] ? gs_change+0x13/0x13

The job was
ubuntu@teuthology:/a/nightly_coverage_2012-02-15-b/12139$ cat config.yaml 
kernel: &id001
  sha1: 27772dabb75b1072a81c0215b61b066bf8810f6c
nuke-on-error: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    coverage: true
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - old request
    sha1: e32668f8b83abad74e858d9e9fffbd456968a918
roles:
- - mon.a
  - mon.c
  - osd.0
- - mon.b
  - mds.a
  - osd.1
- - client.0
targets:
  ubuntu@sepia55.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDZ0HauoMdA5zgEmINOhrUYdVO1V3PDt4dp/dfm+AwB3iunb6Y/BXifGRDznlrqgIZ8+Sxl7U6lPsMsj0+MOysoWA1GPmpQqwQlDLF3SMOKvRXnGALB2x2spotMgQstoSYQWUAWK+6tC41oZq2R6ht0bljU0jiS8TP3qCZ+I5izt/p31vUjZUAvtCZBQkGr8vLZDrc+/R5dCPkGn/D+IoNA0P+l75WO1nHOpAfuYN7q0xCUN42NBJl13lCN9xZBaQZyDC4OyFp7sCsSEO16SkHpiLgALITUREksvWmS/OAcIk6ayvQNWr8mBCYpiiGgr8Arvdn826UB6PDCZ4Ym9J3H
  ubuntu@sepia56.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUTeKEAoW2ZB3Rl0YEdWLBCfL4PuA3QA6KC2BiPsR3UXfI92jFNnZ8Yr4Qz7pJQkI9ELdC1p4T9ro1zHv/fY7U5jtFeT4aJ1daTY2fw5R5dFuMTaBnvCnWczJshoFEaEgvwh8CHXyykPtpl9l329kWOK8VjNEyN9xCork9i7NLkrPXj6Mq44baZWw32IaFWnpcTfhL1NprMeu7pSLmZooRWElIXrT2WD8FJGp8t1FlSpvOwou5pIcW1EzBLNoR8YG8lVAl5r9xKsRI1zlOtmkKeKosasoNNx8PYmSGXR/ToiMlCpxVtqMxtd7bT1VQLqDEu7W/MMHnj2T9M8ZTF759
  ubuntu@sepia6.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDS1viQGCX9RNrDOwhu61WNR7UUIfR83pRvN2K2Xw8uWoUB5jEkpU6E3sYyvzEA+kEQo9TDbIZSsm8uYFciP7iy5ONj7aQYwqueec9QQS/v4ivRCg9SuD/s146ZiaIHyaOWqHVTleslUwAHRGEKRwFw8KoKIUKO119KrWfx57ahaeWa8ex9wq96TmScPcU4hB4RH2BQCYD9Qexn3JIxdf0L8dwNDHAqsKPV6g5ewlszfNw30cXip7WgL6sWkOPzv9A2xsUIPUkbTj9lLyk9JsUtyDEPTFziKEfsKbDREX2ayyr0d43H8SlhFkMzPRTrlRixPlebS82ZdEvvqYPrW63n
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- chef: null
- ceph: null
- rbd:
    all: null
- workunit:
    all:
    - kernel_untar_build.sh

History

#1 Updated by Sage Weil about 12 years ago

  • Assignee set to Alex Elder
  • Priority changed from Normal to High

#2 Updated by Sage Weil about 12 years ago

ubuntu@teuthology:/a/nightly_coverage_2012-02-20-b/12984 with same trace on the console.

#3 Updated by Alex Elder almost 12 years ago

This:

ubuntu@teuthology:/a/nightly_coverage_2012-02-20-b/12984

does not provide very much useful information, HOWEVER it does indicate that
the kernel was at this revision:
cc050a5dab7beb893d72dbd41e8aa5c07f1d0427

That commit is a temporary one, so here's a list of the messages that lead
up to it in the log:

cc050a5dab7beb893d72dbd41e8aa5c07f1d0427 xfs: only take the ILOCK in xfs_reclaim
07fd42934a53b8486709f7f866346a9e4bb6d5ce ceph: fix overflow check in build_snap_
fe6aa081fffeb0712c2c1e45931f9be005219b67 libceph: fix overflow check in crush_de
27772dabb75b1072a81c0215b61b066bf8810f6c ceph: messenger: some simple changes
fa75088b4e92ee0b659ef2060f5c61cc6dcca66e ceph: messenger: small refactor in writ
e92a0d861f1c9910916978be647f5ebeed2183fe ceph: messenger: do crc calculations ou
b18a3fc5185c30094f39d20514bea77bfd2d8f34 ceph: messenger: separate CRC calculati
9817aa47f049db29b5315ee0a006681f0e48c42f ceph: messenger: use "do" in CRC-relate
1477b41675cbb134a3398a45e8108b72743b5b3d ceph: ensure Boolean options support bo

#4 Updated by Alex Elder almost 12 years ago

  • Status changed from New to Need More Info

Bugs 1793 and 2081 have a signature of a page fault/bad memory reference
from process_one_work() -> con_work(), and they may well be the same issue.
So I have been treating them as such, and have been attempting to reproduce
them by running the kernel_untar_build.sh workunit repeatedly.

The last day or two I have been testing with the latest version of the
ceph-client code (c666601a93). One time I hit a kernel failure that had a
signature different from either 1793 or 2081--it reported:

INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=492980 jiffies)

I believe this is bug 2174, and am adding an update to that to record
this occurrence.

I restarted testing and completed 50 more iterations of kernel_untar_build.sh
without error. I will try a bit more, I think we should keep these two bugs
on a sort of inactive status (1793 is already "Need More Info" but 2081 was
not until now) until we start seeing occurrences again. A kernel core dump
would also be helpful, once we get that in place (see bug 2127).

#5 Updated by Alex Elder almost 12 years ago

Another 100 iterations of kernel_untar_build.sh using the current
master branch (c666601a935b94cc0f3310339411b6940de751ba) without
error. I'm going to quit testing this now and wait to see if
our regular automated testing produces it again.

#6 Updated by Alex Elder almost 12 years ago

  • Status changed from Need More Info to Can't reproduce

Marking this Can't Reproduce. Will reopen if it happens again.

Also available in: Atom PDF