Actions
Bug #3342
closedcon_work crash on bad con pointer or mutex (kclient ffsb + thrashing)
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
libceph
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
[0]kdb> rd ax: 0000000000000000 bx: ffff880223db6830 cx: 0000000000000000 dx: ffff880223db69f0 si: ffff880223db69f0 di: 0000000000000090 bp: ffff88020d7efc40 sp: ffff88020d7efb50 r8: 2222222222222222 r9: 2222222222222222 r10: 2222222222222222 r11: 0000000000000000 r12: 0000000000000000 r13: 0000000000000000 r14: ffff880223db69b0 r15: ffff880223db6c20 ip: ffffffffa047e7ed flags: 00010293 cs: 00000010 ss: 00000018 ds: 00000018 es: 00000018 fs: 00000018 gs: 00000018 [0]kdb> bt Stack traceback for pid 62 0xffff88020d793ec0 62 2 1 0 R 0xffff88020d794320 *kworker/0:2 ffff88020d7efb50 0000000000000018 0000000000000000 0000000000000246 0000000000000000 0000000000000000 0000000000000000 ffff880223db6a18 0000000000000000 ffff880223db69b0 ffff880223db69b0 0000000000000246 Call Trace: [<ffffffff8163274c>] ? mutex_lock_nested+0x25c/0x320 [<ffffffffa047eedf>] ? con_work+0x3f/0x16b0 [libceph] [<ffffffffa047efd3>] ? con_work+0x133/0x16b0 [libceph] [<ffffffff810848a8>] ? finish_task_switch+0x48/0x110 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0 [<ffffffff810717e0>] ? process_one_work+0x1a0/0x5f0 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0 [<ffffffffa047eea0>] ? try_read+0x1920/0x1920 [libceph] [<ffffffff810735fd>] ? worker_thread+0x18d/0x4f0 [<ffffffff81073470>] ? manage_workers+0x320/0x320 [<ffffffff810791fe>] ? kthread+0xae/0xc0 [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163f3c4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13 [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163f3c0>] ? gs_change+0x13/0x13
that was
kernel: &id001 kdb: true sha1: 6285bc231277419255f3498d3eb5ddc9f8e7fe79 nuke-on-error: true overrides: ceph: conf: mds: debug mds: 1/20 fs: btrfs log-whitelist: - slow request sha1: 2528b5ee105b16352c91af064af5c0b5a7d45d7c s3tests: branch: testing workunit: sha1: 2528b5ee105b16352c91af064af5c0b5a7d45d7c roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 targets: ubuntu@plana72.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCwa+tIQskKvsvQ1/J9QOYGeunl0M+mIAvxqGeKFawdMKoXCNNb5YLRL6Fr9y/smJVCBIGugSb8LBrjVWF14gUOXAk1j16qJ6rsvIF8176L8tsIMxhXx3dw7dCacaGEHCrmK6KO8YhFm6ky1qftGCu7amfyzeJTj8kvrY4tl1ifwH0sv2M0iEzLXx63xn2UpAMAIvo5v+eqPjo+1w2PFXe+r2dViDN5wjlwQTKOXAPFRewmDy2o6K/rW/iRRg0tHLO4atCZr5Y8XlXYQkIBQVlXrL9CRwe8rmrxHyH8wnFqYcvpzLFk1oKw93sFNauy4mCxVntvIm3WT1S8nQuEE6HJ ubuntu@plana73.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxoJnvRI1V0OJuQI9SosOedC7mj9O627LjoQWPKilJiBbHduPe1byBaKrgwTeEghl43VNf+EBs1+MwVH7zlDolnwN4tAlW9bRpC2SzURJfhZskp2CSQY3l8ca7a5f0J3hdOhx47oSSapN7O2cqmPzwlL/+MrFKGi+ITT613nUtzCjduZRPdhjyqZ0cQWeb0p1neDw5hbDBKd+HAH+ek/E6DK2PaqN6YAtmIgP76q0fQ85Omd0oDlmGXpKe3jlxlPT0W/5KD1+mpobPsh/EF2qar7IG/WqHHJ6NZAcXbdZ4KiMf9erP+Pk4KkD5SJ+e3GF7OEOwXtahKIIR1An4P2GD ubuntu@plana74.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbl3n0OItF2T7uE/9SNA9YV+pq3nFmmr3EBjf2aMo7Bkyy6e9MC3kaK64vzlZxppbpxEhbrgWo7GsVHxphZqwtMt5Ek3RDbLUpQ8dPckbhOZWBhoZe9S8uCynDIZaF3Yq/b1//OBufSdM0TRYK4Vf4rtQnpyWPEn48HaairNhQvKCxY6XUImZc7Lh4CIaMkT0yyZkBlksJeAkTs4UgxhkSoSc9IAlAxwwPK5JpRL7GOowT+hsqYCVJM5+NVqIOGS66Vam/m9xHqecQtfYIEFoVFZiF/xi+b2kstxJzD39UTNP/XB5J2CtJOujW9J1jsdPKSz1L9XoDIkAFHzQjdhyx tasks: - internal.lock_machines: 3 - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: null - kclient: null - workunit: clients: all: - suites/ffsb.sh
ubuntu@teuthology:/a/teuthology-2012-10-17_02:00:05-regression-testing-master-basic/2759
Files
Updated by Sage Weil over 11 years ago
- Status changed from New to 12
again on master branch, ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-10-19_02:00:08-regression-testing-master-basic/3975.
maybe this is only affecting master and not testing?
Updated by Sage Weil over 11 years ago
was able to reproduce on kernel master after 7 iterations of hammer.sh.
Updated by Sage Weil over 11 years ago
also reproduced on testing after 10 iterations.
Updated by Sage Weil over 11 years ago
crossing my fingers that
rbd: reset BACKOFF if unable to re-queue
will fix it
Updated by Sage Weil over 11 years ago
wip-msgr didn't help
this appears to be triggered by osd reset, which isn't exercised by the socket errors. going to instrument the osd to periodically stall clients.
<3>[ 456.934680] libceph: osd3 10.214.131.24:6800 socket closed <3>[ 459.426482] libceph: osd0 10.214.131.26:6800 socket closed <6>[ 470.795235] libceph: osd3 weight 0x0 (out) <3>[ 472.204997] libceph: osd3 10.214.131.24:6800 socket closed <6>[ 486.303070] libceph: osd3 weight 0x10000 (in) <6>[ 494.319882] ceph: mds0 caps stale <4>[ 495.817269] libceph: tid 5237 timed out on osd2, will reset osd <4>[ 495.851868] libceph: tid 13007 timed out on osd1, will reset osd <1>[ 495.952780] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 <1>[ 496.010904] IP: [<ffffffffa08b3b5d>] con_work+0x257d/0x2f40 [libceph] <4>[ 496.043558] PGD 0 <4>[ 496.070990] Oops: 0002 [#1] SMP [4]kdb> [4]kdb> [4]kdb> bt Stack traceback for pid 3288 0xffff88020ba21f60 3288 2 1 4 R 0xffff88020ba223c0 *kworker/u:6 ffff88020c9b1c00 0000000000000018 ffff88020c9b1d1c ffff880227214580 0000000000000000 ffffffff81092a73 0000000000000002 ffff88020ba21f60 ffff88020c9b1cc0 0000000000000046 0000000000000000 ffff880227294580 Call Trace: [<ffffffff81092a73>] ? idle_balance+0x133/0x180 [<ffffffff810848a8>] ? finish_task_switch+0x48/0x110 [<ffffffff810717e0>] ? process_one_work+0x1a0/0x5f0 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0 [<ffffffffa08b15e0>] ? ceph_msg_new+0x2e0/0x2e0 [libceph] [<ffffffff810735fd>] ? worker_thread+0x18d/0x4f0 [<ffffffff81073470>] ? manage_workers+0x320/0x320 [<ffffffff810791fe>] ? kthread+0xae/0xc0 [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163f3c4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13 [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163f3c0>] ? gs_change+0x13/0x13
Updated by Sage Weil over 11 years ago
- Status changed from 12 to 7
libceph: avoid NULL kref_put when osd reset races with alloc_msg
Actions