Project

General

Profile

Actions

Bug #3342

closed

con_work crash on bad con pointer or mutex (kclient ffsb + thrashing)

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
libceph
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

[0]kdb> rd
ax: 0000000000000000  bx: ffff880223db6830  cx: 0000000000000000
dx: ffff880223db69f0  si: ffff880223db69f0  di: 0000000000000090
bp: ffff88020d7efc40  sp: ffff88020d7efb50  r8: 2222222222222222
r9: 2222222222222222  r10: 2222222222222222  r11: 0000000000000000
r12: 0000000000000000  r13: 0000000000000000  r14: ffff880223db69b0
r15: ffff880223db6c20  ip: ffffffffa047e7ed  flags: 00010293  cs: 00000010
ss: 00000018  ds: 00000018  es: 00000018  fs: 00000018  gs: 00000018
[0]kdb> bt
Stack traceback for pid 62
0xffff88020d793ec0       62        2  1    0   R  0xffff88020d794320 *kworker/0:2
 ffff88020d7efb50 0000000000000018 0000000000000000 0000000000000246
 0000000000000000 0000000000000000 0000000000000000 ffff880223db6a18
 0000000000000000 ffff880223db69b0 ffff880223db69b0 0000000000000246
Call Trace:
 [<ffffffff8163274c>] ? mutex_lock_nested+0x25c/0x320
 [<ffffffffa047eedf>] ? con_work+0x3f/0x16b0 [libceph]
 [<ffffffffa047efd3>] ? con_work+0x133/0x16b0 [libceph]
 [<ffffffff810848a8>] ? finish_task_switch+0x48/0x110
 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0
 [<ffffffff810717e0>] ? process_one_work+0x1a0/0x5f0
 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0
 [<ffffffffa047eea0>] ? try_read+0x1920/0x1920 [libceph]
 [<ffffffff810735fd>] ? worker_thread+0x18d/0x4f0
 [<ffffffff81073470>] ? manage_workers+0x320/0x320
 [<ffffffff810791fe>] ? kthread+0xae/0xc0
 [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8163f3c4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13
 [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0
 [<ffffffff8163f3c0>] ? gs_change+0x13/0x13

that was

kernel: &id001
  kdb: true
  sha1: 6285bc231277419255f3498d3eb5ddc9f8e7fe79
nuke-on-error: true
overrides:
  ceph:
    conf:
      mds:
        debug mds: 1/20
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 2528b5ee105b16352c91af064af5c0b5a7d45d7c
  s3tests:
    branch: testing
  workunit:
    sha1: 2528b5ee105b16352c91af064af5c0b5a7d45d7c
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana72.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCwa+tIQskKvsvQ1/J9QOYGeunl0M+mIAvxqGeKFawdMKoXCNNb5YLRL6Fr9y/smJVCBIGugSb8LBrjVWF14gUOXAk1j16qJ6rsvIF8176L8tsIMxhXx3dw7dCacaGEHCrmK6KO8YhFm6ky1qftGCu7amfyzeJTj8kvrY4tl1ifwH0sv2M0iEzLXx63xn2UpAMAIvo5v+eqPjo+1w2PFXe+r2dViDN5wjlwQTKOXAPFRewmDy2o6K/rW/iRRg0tHLO4atCZr5Y8XlXYQkIBQVlXrL9CRwe8rmrxHyH8wnFqYcvpzLFk1oKw93sFNauy4mCxVntvIm3WT1S8nQuEE6HJ
  ubuntu@plana73.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxoJnvRI1V0OJuQI9SosOedC7mj9O627LjoQWPKilJiBbHduPe1byBaKrgwTeEghl43VNf+EBs1+MwVH7zlDolnwN4tAlW9bRpC2SzURJfhZskp2CSQY3l8ca7a5f0J3hdOhx47oSSapN7O2cqmPzwlL/+MrFKGi+ITT613nUtzCjduZRPdhjyqZ0cQWeb0p1neDw5hbDBKd+HAH+ek/E6DK2PaqN6YAtmIgP76q0fQ85Omd0oDlmGXpKe3jlxlPT0W/5KD1+mpobPsh/EF2qar7IG/WqHHJ6NZAcXbdZ4KiMf9erP+Pk4KkD5SJ+e3GF7OEOwXtahKIIR1An4P2GD
  ubuntu@plana74.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbl3n0OItF2T7uE/9SNA9YV+pq3nFmmr3EBjf2aMo7Bkyy6e9MC3kaK64vzlZxppbpxEhbrgWo7GsVHxphZqwtMt5Ek3RDbLUpQ8dPckbhOZWBhoZe9S8uCynDIZaF3Yq/b1//OBufSdM0TRYK4Vf4rtQnpyWPEn48HaairNhQvKCxY6XUImZc7Lh4CIaMkT0yyZkBlksJeAkTs4UgxhkSoSc9IAlAxwwPK5JpRL7GOowT+hsqYCVJM5+NVqIOGS66Vam/m9xHqecQtfYIEFoVFZiF/xi+b2kstxJzD39UTNP/XB5J2CtJOujW9J1jsdPKSz1L9XoDIkAFHzQjdhyx
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/ffsb.sh

ubuntu@teuthology:/a/teuthology-2012-10-17_02:00:05-regression-testing-master-basic/2759

Files

dmesg.txt (261 KB) dmesg.txt Sage Weil, 10/23/2012 10:17 AM
Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to 12

again on master branch, ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-10-19_02:00:08-regression-testing-master-basic/3975.

maybe this is only affecting master and not testing?

Actions #2

Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Sage Weil over 11 years ago

was able to reproduce on kernel master after 7 iterations of hammer.sh.

Actions #4

Updated by Sage Weil over 11 years ago

also reproduced on testing after 10 iterations.

Actions #5

Updated by Sage Weil over 11 years ago

crossing my fingers that

rbd: reset BACKOFF if unable to re-queue

will fix it

Actions #6

Updated by Sage Weil over 11 years ago

wip-msgr didn't help

this appears to be triggered by osd reset, which isn't exercised by the socket errors. going to instrument the osd to periodically stall clients.

<3>[  456.934680] libceph: osd3 10.214.131.24:6800 socket closed
<3>[  459.426482] libceph: osd0 10.214.131.26:6800 socket closed
<6>[  470.795235] libceph: osd3 weight 0x0 (out)
<3>[  472.204997] libceph: osd3 10.214.131.24:6800 socket closed
<6>[  486.303070] libceph: osd3 weight 0x10000 (in)
<6>[  494.319882] ceph: mds0 caps stale
<4>[  495.817269] libceph:  tid 5237 timed out on osd2, will reset osd
<4>[  495.851868] libceph:  tid 13007 timed out on osd1, will reset osd
<1>[  495.952780] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
<1>[  496.010904] IP: [<ffffffffa08b3b5d>] con_work+0x257d/0x2f40 [libceph]
<4>[  496.043558] PGD 0 
<4>[  496.070990] Oops: 0002 [#1] SMP 
[4]kdb> 
[4]kdb>  
[4]kdb> bt
Stack traceback for pid 3288
0xffff88020ba21f60     3288        2  1    4   R  0xffff88020ba223c0 *kworker/u:6
 ffff88020c9b1c00 0000000000000018 ffff88020c9b1d1c ffff880227214580
 0000000000000000 ffffffff81092a73 0000000000000002 ffff88020ba21f60
 ffff88020c9b1cc0 0000000000000046 0000000000000000 ffff880227294580
Call Trace:
 [<ffffffff81092a73>] ? idle_balance+0x133/0x180
 [<ffffffff810848a8>] ? finish_task_switch+0x48/0x110
 [<ffffffff810717e0>] ? process_one_work+0x1a0/0x5f0
 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0
 [<ffffffffa08b15e0>] ? ceph_msg_new+0x2e0/0x2e0 [libceph]
 [<ffffffff810735fd>] ? worker_thread+0x18d/0x4f0
 [<ffffffff81073470>] ? manage_workers+0x320/0x320
 [<ffffffff810791fe>] ? kthread+0xae/0xc0
 [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8163f3c4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13
 [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0
 [<ffffffff8163f3c0>] ? gs_change+0x13/0x13

Actions #7

Updated by Sage Weil over 11 years ago

ooh, captured libceph logging leading up to this!

Actions #8

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to 7

libceph: avoid NULL kref_put when osd reset races with alloc_msg

Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF