Project

General

Profile

Bug #1382

kclient: crash on resending osd ops

Added by Brian Chrisman about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Under performance testing with the SCST iSCSI driver on top of RBD (we'll switch to LIO at some point in the future, but probably a fair bit later), this crash occurs after a few hours of perf benchmarking.
I'm investigating whether we can run those same block io tests directly on top of RBD, but they may have Windows initiators (checking into that).

I'm attaching the kernel log.. and unfortunately I turned off ceph logging (ms=0) for this run to make sure logging isn't affecting performance.

I can probably recreate with ms=1 if it'll help.

I've included a bunch of the kernel messages in the runup to the crash (rsyslogd over udp), though most of it's scst stuff.

crashmsgs (326 KB) Brian Chrisman, 08/09/2011 10:34 AM

objdump_libceph_ko (1.46 MB) Brian Chrisman, 08/09/2011 11:30 AM

History

#1 Updated by Brian Chrisman about 9 years ago

It looks like the issue stems from having a bunch of osds going out.
I'm not certain why these osds fail, but this brings up the point that in this case, the rbd client and osd servers occupy the same nodes.

[brianchrisman ~] $ grep libceph crashmsgs
Aug 9 01:19:14 10.200.98.109 libceph
Aug 9 01:30:48 10.200.98.109 libceph
Aug 9 01:31:10 10.200.98.109 libceph: osd7 192.168.98.110:6810 socket closed
Aug 9 01:31:10 10.200.98.109 libceph: osd7 192.168.98.110:6810 connection failed
Aug 9 01:31:11 10.200.98.109 libceph: osd7 192.168.98.110:6810 connection failed
Aug 9 01:31:12 10.200.98.109 libceph: osd7 192.168.98.110:6810 connection failed
Aug 9 01:31:14 10.200.98.109 libceph: osd7 192.168.98.110:6810 connection failed
Aug 9 01:31:18 10.200.98.109 libceph: osd7 192.168.98.110:6810 connection failed
Aug 9 01:31:26 10.200.98.109 libceph: osd7 192.168.98.110:6810 connection failed
Aug 9 01:31:30 10.200.98.109 libceph: osd7 down
Aug 9 01:31:31 10.200.98.109 libceph: get_reply unknown tid 5852851 from osd11
Aug 9 01:31:31 10.200.98.109 libceph: get_reply unknown tid 5852850 from osd11
Aug 9 01:31:31 10.200.98.109 libceph: get_reply unknown tid 5852849 from osd11
Aug 9 01:31:31 10.200.98.109 libceph: get_reply unknown tid 5852848 from osd11
Aug 9 01:31:31 10.200.98.109 libceph: get_reply unknown tid 5852847 from osd11
Aug 9 01:31:32 10.200.98.109 libceph: get_reply unknown tid 5852846 from osd11
Aug 9 01:31:32 10.200.98.109 libceph: get_reply unknown tid 5852845 from osd11
Aug 9 01:31:36 10.200.98.109 libceph: get_reply unknown tid 5852841 from osd11
Aug 9 01:31:41 10.200.98.109 libceph: get_reply unknown tid 5852840 from osd11
Aug 9 01:31:46 10.200.98.109 libceph: get_reply unknown tid 5852839 from osd11
Aug 9 01:31:51 10.200.98.109 libceph: get_reply unknown tid 5852838 from osd11
Aug 9 01:31:57 10.200.98.109 libceph: get_reply unknown tid 5852837 from osd11
Aug 9 01:32:02 10.200.98.109 libceph: get_reply unknown tid 5852836 from osd11
Aug 9 01:32:07 10.200.98.109 libceph: get_reply unknown tid 5852835 from osd11
Aug 9 01:32:12 10.200.98.109 libceph: get_reply unknown tid 5852834 from osd11
Aug 9 01:32:17 10.200.98.109 libceph: get_reply unknown tid 5852833 from osd11
Aug 9 01:32:22 10.200.98.109 libceph: get_reply unknown tid 5852832 from osd11
Aug 9 01:32:27 10.200.98.109 libceph: get_reply unknown tid 5852831 from osd11
Aug 9 01:32:32 10.200.98.109 libceph: tid 5852892 timed out on osd11, will reset osd
Aug 9 01:34:27 10.200.98.109 libceph: tid 5862269 timed out on osd8, will reset osd
Aug 9 01:34:32 10.200.98.109 libceph: tid 5862490 timed out on osd6, will reset osd
Aug 9 01:34:32 10.200.98.109 libceph: tid 5862853 timed out on osd0, will reset osd
Aug 9 01:35:27 10.200.98.109 libceph: tid 5862876 timed out on osd8, will reset osd
Aug 9 01:36:27 10.200.98.109 libceph: tid 5862269 timed out on osd8, will reset osd
Aug 9 01:36:31 10.200.98.109 libceph: osd7 weight 0x0 (out)

#3 Updated by Sage Weil about 9 years ago

  • Assignee set to Sage Weil
  • Target version set to v0.34

#4 Updated by Sage Weil about 9 years ago

  • Target version changed from v0.34 to v0.35
  • translation missing: en.field_position set to 32

#5 Updated by Sage Weil about 9 years ago

  • Priority changed from Normal to High

Need to set up a teuthology job with rbd + thrasher and a suitable long-running workload.

#6 Updated by Sage Weil about 9 years ago

  • Target version changed from v0.35 to v0.36

#7 Updated by Sage Weil about 9 years ago

Martin Mailand is also hitting this (see ceph-devel):

[  182.721180] libceph: osd2 192.168.42.114:6800 socket closed
[  182.732642] libceph: osd2 192.168.42.114:6800 connection failed
[  183.040233] libceph: osd2 192.168.42.114:6800 connection failed
[  184.040204] libceph: osd2 192.168.42.114:6800 connection failed
[  186.040244] libceph: osd2 192.168.42.114:6800 connection failed
[  190.060233] libceph: osd2 192.168.42.114:6800 connection failed
[  198.060214] libceph: osd2 192.168.42.114:6800 connection failed
[  213.964994] ------------[ cut here ]------------
[  213.974288] kernel BUG at net/ceph/messenger.c:2193!
[  213.974470] invalid opcode: 0000 [#1] SMP
[  213.974470] CPU 0
[  213.974470] Modules linked in: rbd libceph libcrc32c ip6table_filter
ip6_tables iptable_filter ip_tables x_tables nv_tco bridge stp kvm_amd kvm
radeon lp psmouse shpchp parport i2c_nforce2 amd64_edac_mod ttm drm_kms_helper
drm edac_core i2c_algo_bit edac_mce_amd serio_raw k10temp ses enclosure aacraid
forcedeth
[  213.974470]
[  213.974470] Pid: 10, comm: kworker/0:1 Not tainted 3.1.0-rc5-custom #3
Supermicro H8DM8-2/H8DM8-2
[  213.974470] RIP: 0010:[<ffffffffa02cf3f1>]  [<ffffffffa02cf3f1>]
ceph_con_send+0x111/0x120 [libceph]
[  213.974470] RSP: 0018:ffff880405cddbd0  EFLAGS: 00010283
[  213.974470] RAX: ffff880403e93c78 RBX: ffff880803f97030 RCX: ffff8808034d2e50
[  213.974470] RDX: ffff880405cddfd8 RSI: ffff880403e93c00 RDI: ffff880803f971a8
[  213.974470] RBP: ffff880405cddbf0 R08: ffff88040fc0de40 R09: 000000000000fffb
[  213.974470] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880803f971a8
[  213.974470] R13: ffff880403e93c00 R14: ffff8808034d2e60 R15: ffff8808034d2e50
[  213.974470] FS:  00007f5909978720(0000) GS:ffff88040fc00000(0000)
knlGS:0000000000000000
[  213.974470] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  213.974470] CR2: ffffffffff600400 CR3: 0000000404e6f000 CR4: 00000000000006f0
[  213.974470] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  213.974470] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  213.974470] Process kworker/0:1 (pid: 10, threadinfo ffff880405cdc000, task
ffff880405cb5bc0)
[  213.974470] Stack:
[  213.974470]  ffff880405cddbf0 ffff880403e0ac00 ffff8808034d2e30
ffff8808034d2da8
[  213.974470]  ffff880405cddc40 ffffffffa02d490d ffff8808034d2c80
ffff8808034d2e00
[  213.974470]  ffff880405cddc40 ffff8804041d1c91 ffff8808034d2da8
0000000000000000
[  213.974470] Call Trace:
[  213.974470]  [<ffffffffa02d490d>] send_queued+0xed/0x130 [libceph]
[  213.974470]  [<ffffffffa02d6d91>] ceph_osdc_handle_map+0x261/0x3b0 [libceph]
[  213.974470]  [<ffffffffa02d331f>] dispatch+0x10f/0x580 [libceph]
[  213.974470]  [<ffffffffa02d154f>] con_work+0x214f/0x21d0 [libceph]
[  213.974470]  [<ffffffffa02cf400>] ? ceph_con_send+0x120/0x120 [libceph]
[  213.974470]  [<ffffffff8108110d>] process_one_work+0x11d/0x430
[  213.974470]  [<ffffffff81081c69>] worker_thread+0x169/0x360
[  213.974470]  [<ffffffff81081b00>] ? manage_workers.clone.21+0x240/0x240
[  213.974470]  [<ffffffff81086496>] kthread+0x96/0xa0
[  213.974470]  [<ffffffff815e5bb4>] kernel_thread_helper+0x4/0x10
[  213.974470]  [<ffffffff81086400>] ? flush_kthread_worker+0xb0/0xb0
[  213.974470]  [<ffffffff815e5bb0>] ? gs_change+0x13/0x13
[  213.974470] Code: 65 f0 4c 8b 6d f8 c9 c3 66 90 48 8d be 88 00 00 00 48 c7 c6
70 18 2d a0 e8 dd 2c 01 e1 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 c9 c3 <0f> 0b 0f
0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57
[  213.974470] RIP  [<ffffffffa02cf3f1>] ceph_con_send+0x111/0x120 [libceph]
[  213.974470]  RSP <ffff880405cddbd0>
[  214.640753] ---[ end trace 837698aee31a73fc ]---

#8 Updated by Sage Weil about 9 years ago

  • Subject changed from RBD messenger error to libceph: crash on resending osd ops

#9 Updated by Sage Weil about 9 years ago

  • Subject changed from libceph: crash on resending osd ops to kclient: crash on resending osd ops

#10 Updated by Sage Weil about 9 years ago

Maybe same crash, hit by Martin Mailand on ceph-devel: http://pastebin.com/9CNJk0Pw

#11 Updated by Sage Weil about 9 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF