Bug #3342
con_work crash on bad con pointer or mutex (kclient ffsb + thrashing)
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
libceph
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
[0]kdb> rd ax: 0000000000000000 bx: ffff880223db6830 cx: 0000000000000000 dx: ffff880223db69f0 si: ffff880223db69f0 di: 0000000000000090 bp: ffff88020d7efc40 sp: ffff88020d7efb50 r8: 2222222222222222 r9: 2222222222222222 r10: 2222222222222222 r11: 0000000000000000 r12: 0000000000000000 r13: 0000000000000000 r14: ffff880223db69b0 r15: ffff880223db6c20 ip: ffffffffa047e7ed flags: 00010293 cs: 00000010 ss: 00000018 ds: 00000018 es: 00000018 fs: 00000018 gs: 00000018 [0]kdb> bt Stack traceback for pid 62 0xffff88020d793ec0 62 2 1 0 R 0xffff88020d794320 *kworker/0:2 ffff88020d7efb50 0000000000000018 0000000000000000 0000000000000246 0000000000000000 0000000000000000 0000000000000000 ffff880223db6a18 0000000000000000 ffff880223db69b0 ffff880223db69b0 0000000000000246 Call Trace: [<ffffffff8163274c>] ? mutex_lock_nested+0x25c/0x320 [<ffffffffa047eedf>] ? con_work+0x3f/0x16b0 [libceph] [<ffffffffa047efd3>] ? con_work+0x133/0x16b0 [libceph] [<ffffffff810848a8>] ? finish_task_switch+0x48/0x110 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0 [<ffffffff810717e0>] ? process_one_work+0x1a0/0x5f0 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0 [<ffffffffa047eea0>] ? try_read+0x1920/0x1920 [libceph] [<ffffffff810735fd>] ? worker_thread+0x18d/0x4f0 [<ffffffff81073470>] ? manage_workers+0x320/0x320 [<ffffffff810791fe>] ? kthread+0xae/0xc0 [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163f3c4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13 [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163f3c0>] ? gs_change+0x13/0x13
that was
kernel: &id001 kdb: true sha1: 6285bc231277419255f3498d3eb5ddc9f8e7fe79 nuke-on-error: true overrides: ceph: conf: mds: debug mds: 1/20 fs: btrfs log-whitelist: - slow request sha1: 2528b5ee105b16352c91af064af5c0b5a7d45d7c s3tests: branch: testing workunit: sha1: 2528b5ee105b16352c91af064af5c0b5a7d45d7c roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 targets: ubuntu@plana72.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCwa+tIQskKvsvQ1/J9QOYGeunl0M+mIAvxqGeKFawdMKoXCNNb5YLRL6Fr9y/smJVCBIGugSb8LBrjVWF14gUOXAk1j16qJ6rsvIF8176L8tsIMxhXx3dw7dCacaGEHCrmK6KO8YhFm6ky1qftGCu7amfyzeJTj8kvrY4tl1ifwH0sv2M0iEzLXx63xn2UpAMAIvo5v+eqPjo+1w2PFXe+r2dViDN5wjlwQTKOXAPFRewmDy2o6K/rW/iRRg0tHLO4atCZr5Y8XlXYQkIBQVlXrL9CRwe8rmrxHyH8wnFqYcvpzLFk1oKw93sFNauy4mCxVntvIm3WT1S8nQuEE6HJ ubuntu@plana73.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxoJnvRI1V0OJuQI9SosOedC7mj9O627LjoQWPKilJiBbHduPe1byBaKrgwTeEghl43VNf+EBs1+MwVH7zlDolnwN4tAlW9bRpC2SzURJfhZskp2CSQY3l8ca7a5f0J3hdOhx47oSSapN7O2cqmPzwlL/+MrFKGi+ITT613nUtzCjduZRPdhjyqZ0cQWeb0p1neDw5hbDBKd+HAH+ek/E6DK2PaqN6YAtmIgP76q0fQ85Omd0oDlmGXpKe3jlxlPT0W/5KD1+mpobPsh/EF2qar7IG/WqHHJ6NZAcXbdZ4KiMf9erP+Pk4KkD5SJ+e3GF7OEOwXtahKIIR1An4P2GD ubuntu@plana74.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbl3n0OItF2T7uE/9SNA9YV+pq3nFmmr3EBjf2aMo7Bkyy6e9MC3kaK64vzlZxppbpxEhbrgWo7GsVHxphZqwtMt5Ek3RDbLUpQ8dPckbhOZWBhoZe9S8uCynDIZaF3Yq/b1//OBufSdM0TRYK4Vf4rtQnpyWPEn48HaairNhQvKCxY6XUImZc7Lh4CIaMkT0yyZkBlksJeAkTs4UgxhkSoSc9IAlAxwwPK5JpRL7GOowT+hsqYCVJM5+NVqIOGS66Vam/m9xHqecQtfYIEFoVFZiF/xi+b2kstxJzD39UTNP/XB5J2CtJOujW9J1jsdPKSz1L9XoDIkAFHzQjdhyx tasks: - internal.lock_machines: 3 - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: null - kclient: null - workunit: clients: all: - suites/ffsb.sh
ubuntu@teuthology:/a/teuthology-2012-10-17_02:00:05-regression-testing-master-basic/2759
History
#1 Updated by Sage Weil almost 11 years ago
- Status changed from New to 12
again on master branch, ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-10-19_02:00:08-regression-testing-master-basic/3975.
maybe this is only affecting master and not testing?
#2 Updated by Sage Weil almost 11 years ago
- Priority changed from Normal to Urgent
#3 Updated by Sage Weil almost 11 years ago
was able to reproduce on kernel master after 7 iterations of hammer.sh.
#4 Updated by Sage Weil almost 11 years ago
also reproduced on testing after 10 iterations.
#5 Updated by Sage Weil almost 11 years ago
crossing my fingers that
rbd: reset BACKOFF if unable to re-queue
will fix it
#6 Updated by Sage Weil almost 11 years ago
wip-msgr didn't help
this appears to be triggered by osd reset, which isn't exercised by the socket errors. going to instrument the osd to periodically stall clients.
<3>[ 456.934680] libceph: osd3 10.214.131.24:6800 socket closed <3>[ 459.426482] libceph: osd0 10.214.131.26:6800 socket closed <6>[ 470.795235] libceph: osd3 weight 0x0 (out) <3>[ 472.204997] libceph: osd3 10.214.131.24:6800 socket closed <6>[ 486.303070] libceph: osd3 weight 0x10000 (in) <6>[ 494.319882] ceph: mds0 caps stale <4>[ 495.817269] libceph: tid 5237 timed out on osd2, will reset osd <4>[ 495.851868] libceph: tid 13007 timed out on osd1, will reset osd <1>[ 495.952780] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 <1>[ 496.010904] IP: [<ffffffffa08b3b5d>] con_work+0x257d/0x2f40 [libceph] <4>[ 496.043558] PGD 0 <4>[ 496.070990] Oops: 0002 [#1] SMP [4]kdb> [4]kdb> [4]kdb> bt Stack traceback for pid 3288 0xffff88020ba21f60 3288 2 1 4 R 0xffff88020ba223c0 *kworker/u:6 ffff88020c9b1c00 0000000000000018 ffff88020c9b1d1c ffff880227214580 0000000000000000 ffffffff81092a73 0000000000000002 ffff88020ba21f60 ffff88020c9b1cc0 0000000000000046 0000000000000000 ffff880227294580 Call Trace: [<ffffffff81092a73>] ? idle_balance+0x133/0x180 [<ffffffff810848a8>] ? finish_task_switch+0x48/0x110 [<ffffffff810717e0>] ? process_one_work+0x1a0/0x5f0 [<ffffffff81071776>] ? process_one_work+0x136/0x5f0 [<ffffffffa08b15e0>] ? ceph_msg_new+0x2e0/0x2e0 [libceph] [<ffffffff810735fd>] ? worker_thread+0x18d/0x4f0 [<ffffffff81073470>] ? manage_workers+0x320/0x320 [<ffffffff810791fe>] ? kthread+0xae/0xc0 [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163f3c4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13 [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163f3c0>] ? gs_change+0x13/0x13
#7 Updated by Sage Weil almost 11 years ago
ooh, captured libceph logging leading up to this!
#8 Updated by Sage Weil almost 11 years ago
- Status changed from 12 to 7
libceph: avoid NULL kref_put when osd reset races with alloc_msg
#9 Updated by Sage Weil almost 11 years ago
- Status changed from 7 to Resolved