Project

General

Profile

Bug #2260

libceph: null pointer dereference at try_write+0x638+0xfb0

Added by Alex Elder almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
libceph
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

It's not an exact match but it's close enough that I wanted to reopen
bug 1793 or 1866, but found myself unable to. So here's a new one...

I hit this while running xfstests, test 49 I believe. That may be a test
I crashed on once a few weeks ago, and now that I've got things running
pretty reliably I'll see if it is a reliable reproducer for this problem.

Here's some info from /var/log/syslog with a little context.

[86710.152267] XFS (loop0): Ending clean mount
[86712.305386] libceph: osd1 10.214.132.32:6800 socket closed
[86712.695642] libceph: osd0 10.214.132.31:6800 socket closed
[86712.725309] libceph: osd1 10.214.132.32:6800 socket closed
[86715.310218] libceph: osd1 10.214.132.32:6800 socket closed
[86715.337194] libceph: osd0 10.214.132.31:6800 socket closed
[86717.356958] libceph: osd1 10.214.132.32:6800 socket closed
[86717.381901] libceph: osd0 10.214.132.31:6800 socket closed
[86719.222524] libceph: osd1 10.214.132.32:6800 socket closed
[86719.246804] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[86719.289091] IP: [<ffffffffa01c8468>] try_write+0x638/0xfb0 [libceph]
[86719.314297] PGD 0
[86719.334655] Oops: 0000 [#1] SMP
[86719.356339] CPU 0
[86719.358597] Modules linked in: xfs exportfs aesni_intel cryptd aes_x86_64 aes_generic rbd libceph ipmi_devintf ipmi_si ipmi_msghandler lp dcdbas parport i7core_edac edac_core joydev serio_raw hed usbhid hid ixgbe mptsas mptscsih mptbase dca mdio scsi_transport_sas bnx2 btrfs zlib_deflate crc32c libcrc32c
[86719.486831]
[86719.509731] Pid: 15025, comm: kworker/0:1 Not tainted 3.3.0-ceph-00067-gafede88 #1 Dell Inc. PowerEdge R410/01V648
[86719.566377] RIP: 0010:[<ffffffffa01c8468>] [<ffffffffa01c8468>] try_write+0x638/0xfb0 [libceph]
[86719.624149] RSP: 0018:ffff88021c9b1b20 EFLAGS: 00010246
[86719.654956] RAX: 000000000000c000 RBX: ffff880222155030 RCX: 0000000000000000
[86719.688686] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88020d893f00
[86719.722829] RBP: ffff88021c9b1c20 R08: 0000000000000000 R09: 0000000000000000
[86719.757110] R10: 0000000000000000 R11: 0000000000000003 R12: ffffea0008124580
[86719.790899] R13: 0000000000001000 R14: ffff88021d6b5000 R15: 000000000001a000
[86719.825199] FS: 0000000000000000(0000) GS:ffff880227200000(0000) knlGS:0000000000000000
[86719.889006] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[86719.924552] CR2: 0000000000000048 CR3: 0000000001c05000 CR4: 00000000000006f0
[86719.962732] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[86720.001125] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[86720.039250] Process kworker/0:1 (pid: 15025, threadinfo ffff88021c9b0000, task ffff88020d893f00)
[86720.112205] Stack:
[86720.145509] ffff88021c9b1bd0 0000000000000000 ffff88021c9b1b40 ffff8802221551a8
[86720.216362] ffff88021c9b1c00 ffff88021c9b1bb0 ffffffffe698179f 0000000000000000
[86720.289080] ffff880222155230 ffff88020ca4c000 ffff880222155240 ffff880222155410
[86720.365199] Call Trace:
[86720.400789] [<ffffffffa01c7328>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[86720.441250] [<ffffffffa01ca0a0>] con_work+0xbe0/0x1b40 [libceph]
[86720.480431] [<ffffffff81615dc0>] ? _raw_spin_unlock_irq+0x30/0x40
[86720.519519] [<ffffffff8106b256>] process_one_work+0x1a6/0x520
[86720.557594] [<ffffffff8106b1e7>] ? process_one_work+0x137/0x520
[86720.595152] [<ffffffffa01c94c0>] ? ceph_con_revoke_message+0x130/0x130 [libceph]
[86720.663868] [<ffffffff8106d593>] worker_thread+0x173/0x400
[86720.700509] [<ffffffff8106d420>] ? manage_workers+0x210/0x210
[86720.737008] [<ffffffff8107280e>] kthread+0xbe/0xd0
[86720.772363] [<ffffffff8161f5f4>] kernel_thread_helper+0x4/0x10
[86720.808985] [<ffffffff81616134>] ? retint_restore_args+0x13/0x13
[86720.845766] [<ffffffff81072750>] ? __init_kthread_worker+0x70/0x70
[86720.882830] [<ffffffff8161f5f0>] ? gs_change+0x13/0x13
[86720.882830] [<ffffffff8161f5f0>] ? gs_change+0x13/0x13
[86720.918532] Code: 84 36 fb ff ff e9 fa fa ff ff 49 83 be 90 00 00 00 00 90 0f 84 65 04 00 00 49 63 96 a0 00 00 00 49 8b 8e 98 00 00 00 48 c1 e2 04 <48> 03 51 48 4c 8b 22 8b 4a 0c 44 8b 6a 08 e9 d8 fb ff ff 49 8d
[86721.031469] RIP [<ffffffffa01c8468>] try_write+0x638/0xfb0 [libceph]
[86721.069757] RSP <ffff88021c9b1b20>
[86721.105146] CR2: 0000000000000048
[86721.186060] ---[ end trace f990acc1e958a213 ]---

That was followed by some other faults as well--recursive faults by
the same process.


Related issues

Related to Linux kernel client - Bug #2287: rbd: crashes with 10Gbit network and fio Resolved 04/13/2012

History

#1 Updated by Alex Elder almost 12 years ago

Here's a disassembled block of the code where the fault occurred.
The address listed corresponds to offset 3468 below:

3468:       48 03 51 48             add    0x48(%rcx),%rdx

And that correlates with the fault address which is 0x0000000000000048,
so register rcx must have held a null pointer. I'll try to figure
out where that came from, shortly.

ceph_tcp_sendmsg():
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:312
3428: 45 31 e4 xor %r12d,%r12d
write_partial_kvec():
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:779
342b: 80 3d 00 00 00 00 00 cmpb $0x0,0x0(%rip) # 3432 <try_write+0x602>
3432: 0f 85 64 06 00 00 jne 3a9c <try_write+0xc6c>
try_write():
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:1874
3438: 41 83 fc 01 cmp $0x1,%r12d
343c: 0f 84 36 fb ff ff je 2f78 <try_write+0x148>
3442: e9 fa fa ff ff jmpq 2f41 <try_write+0x111>
write_partial_msg_pages():
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:865
3447: 49 83 be 90 00 00 00 cmpq $0x0,0x90(%r14)
344e: 00
344f: 90 nop
3450: 0f 84 65 04 00 00 je 38bb <try_write+0xa8b>
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:868
3456: 49 63 96 a0 00 00 00 movslq 0xa0(%r14),%rdx
345d: 49 8b 8e 98 00 00 00 mov 0x98(%r14),%rcx
3464: 48 c1 e2 04 shl $0x4,%rdx
3468: 48 03 51 48 add 0x48(%rcx),%rdx
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:869
346c: 4c 8b 22 mov (%rdx),%r12
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:870
346f: 8b 4a 0c mov 0xc(%rdx),%ecx
/srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:871
3472: 44 8b 6a 08 mov 0x8(%rdx),%r13d
3476: e9 d8 fb ff ff jmpq 3053 <try_write+0x223>
atomic_sub_and_test():
/srv/autobuild-ceph/gitbuilder.git/build/arch/x86/include/asm/atomic.h:82
347b: 49 8d be 88 00 00 00 lea 0x88(%r14),%rdi
/srv/autobuild-ceph/gitbuilder.git/build/arch/x86/include/asm/atomic.h:81
3482: f0 41 83 ae 88 00 00 lock subl $0x1,0x88(%r14)
3489: 00 01
348b: 0f 94 c0 sete %al
kref_sub():
/srv/autobuild-ceph/gitbuilder.git/build/include/linux/kref.h:68

#2 Updated by Alex Elder almost 12 years ago

I'm going to have to look at this again in the morning, but I think
we're in this block of code:

#ifdef CONFIG_BLOCK
} else if (msg->bio) {
struct bio_vec *bv;

bv = bio_iovec_idx(msg->bio_iter, msg->bio_seg);
page = bv->bv_page;
bio_offset = bv->bv_offset;
max_write = bv->bv_len;
#endif

And I think we're faulting on this line:
page = bv->bv_page;
Note:
#define bio_iovec_idx(bio, idx) (&((bio)->bi_io_vec[(idx)]))

I believe it's because msg->bio_iter must be a null pointer (and
msg->bio_seg is 0). Unless I'm counting wrong, msg->bio_iter->bi_io_vec
is found at offset 0x48 from the base of struct bio.

The null pointer and 0 segment number is the state right after
initialization of those fields by init_bio_iter() in
write_partial_msg_pages() or iter_bio_next().

#3 Updated by Alex Elder almost 12 years ago

  • Status changed from New to 12
  • Assignee set to Alex Elder

Looks like xfstests #49 is a reproducer for this problem, at least
after running the tests that lead up to it first (1, 2, ..., 47, 48).

I'll try once more to see if test 49 alone is enough to trigger the
failure.

#4 Updated by Alex Elder almost 12 years ago

FYI, xfstests 49 tests running XFS on a loop device. I have to wait for a
reboot in order to see if I can tell at what point things had trouble, but
the test does this:

- Creates an ext2 filesystem on the underlying (rbd) scratch device,
and mounts it.
- Creates a 40MB file on that ext2 filesystem.
- Creates an XFS filesystem using that file as storage.
- Does a loop mount of that XFS filesystem.
- Runs fsstress on that XFS filesystem.

Then:
- Creates a 10MB file on that loop-mounted XFS filesystem
- Creates an ext filesystem on that 10MB file and loop mounts it
- Runs fsstress on the loop-mounted ext2 filesystem

And finally unmounts each of the loop mounts in sequence.

#5 Updated by Alex Elder almost 12 years ago

Looks like the problem arose while running fsstress on the xfs loop
mount on top of a file on the ext2 filesystem.

I have run test 49 alone (using XFS on top of rbd--that's the scenario).
It hit the problem fairly quickly--within a minute or so.

#6 Updated by Nick Bartos almost 12 years ago

I believe we are seeing the same problem here. I have been able to reproduce it each time I have tried. The hardware is 3 x Dell R710 servers with 2 x Intel E5645, 96GB of RAM, and 3 x 200GB Dell enterprise SSDs (one osd per ssd) using btrfs, and 10gbit networking between all nodes. To recreate the crash, I create a 300GB rbd and put xfs on it, then run 'bonnie++ -f'. It is interesting to note that using a smaller 100GB rbd and telling bonnie++ to use a smaller test filesize (e.g. bonnie++ -f -r 0 -n 10G) does not result in a crash. I confirmed this with ceph 0.45 + linux 3.2.14, ceph 0.45 + linux 3.3.1, and on older ceph 0.41 + linux 3.2.5 kernel.

kernel: [ 2801.395575] libceph: osd50 down
kernel: [ 2801.662729] libceph: osd50 down
kernel: [ 2803.423801] libceph: osd50 down
kernel: [ 2801.567474] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
kernel: [ 2801.567918] IP: [<ffffffff81616426>] write_partial_msg_pages+0x191/0x42a
kernel: [ 2801.568152] PGD 0
kernel: [ 2801.568370] Oops: 0000 [#1] SMP
kernel: [ 2801.568596] CPU 0
kernel: [ 2801.568608] Modules linked in: ebtable_nat ebtables nbd kvm_intel kvm bridge 8021q garp stp llc ipt_LOG xt_state xt_NOTRACK iptable_raw iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables sd_mod crc_t10dif sg sr_mod cdrom ata_generic dcdbas ixgbe serio_raw iTCO_wdt ata_piix iTCO_vendor_support ioatdma i7core_edac edac_core dca mdio mptsas mptscsih mptbase hed scsi_transport_sas netconsole configfs
kernel: [ 2801.570403]
kernel: [ 2801.570616] Pid: 4870, comm: kworker/0:2 Tainted: G W 3.2.14 #1 Dell Inc. PowerEdge R710/00NH4P
kernel: [ 2801.571052] RIP: 0010:[<ffffffff81616426>] [<ffffffff81616426>] write_partial_msg_pages+0x191/0x42a
kernel: [ 2801.571479] RSP: 0018:ffff88177f0cfcb0 EFLAGS: 00010246
kernel: [ 2801.571694] RAX: 0000000000000000 RBX: ffff8817e105c830 RCX: 0000000000000080
kernel: [ 2801.571912] RDX: 0000000000002000 RSI: ffff88144affe860 RDI: ffff88144affe858
kernel: [ 2801.572130] RBP: ffff88177f0cfd40 R08: 0000000000000000 R09: 00000000000005a8
kernel: [ 2801.572351] R10: ffff88177f0cfb98 R11: 000000003896b172 R12: ffff88144affe7c0
kernel: [ 2801.572571] R13: ffffea004a156090 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 2801.572800] FS: 0000000000000000(0000) GS:ffff88183fc00000(0000) knlGS:0000000000000000
kernel: [ 2801.573229] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
kernel: [ 2801.573449] CR2: 0000000000000048 CR3: 0000000001a05000 CR4: 00000000000006f0
kernel: [ 2801.573673] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [ 2801.573894] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: [ 2801.574118] Process kworker/0:2 (pid: 4870, threadinfo ffff88177f0ce000, task ffff8817de05cba0)
kernel: [ 2801.574548] Stack:
kernel: [ 2801.574763] ffff88177f0cfd10 ffffffff81616139 0000000000000000 0000000000000000
kernel: [ 2801.575210] ffff88144affe858 ffff88144affe860 0008000000000000 0000000000080000
kernel: [ 2801.575655] 000800000000c040 000000000007e000 0000000000000000 000000003896b172
kernel: [ 2801.576100] Call Trace:
kernel: [ 2801.576313] [<ffffffff81616139>] ? ceph_tcp_sendmsg+0x5d/0x7e
kernel: [ 2801.576534] [<ffffffff81617ec6>] try_write+0x285/0x469
kernel: [ 2801.576757] [<ffffffff8104170e>] ? finish_task_switch+0x9e/0xcd
kernel: [ 2801.576974] [<ffffffff81618f03>] con_work+0x1cd/0x226
kernel: [ 2801.577190] [<ffffffff81618d36>] ? try_read+0x700/0x700
kernel: [ 2801.577412] [<ffffffff8106c3b2>] process_one_work+0x19a/0x2aa
kernel: [ 2801.577636] [<ffffffff8106db21>] worker_thread+0x113/0x1a9
kernel: [ 2801.577858] [<ffffffff8106da0e>] ? manage_workers.clone.15+0xcc/0xcc
kernel: [ 2801.578085] [<ffffffff8106da0e>] ? manage_workers.clone.15+0xcc/0xcc
kernel: [ 2801.578312] [<ffffffff8106da0e>] ? manage_workers.clone.15+0xcc/0xcc
kernel: [ 2801.578539] [<ffffffff8107286d>] kthread+0x91/0x99
kernel: [ 2801.578766] [<ffffffff816449b4>] kernel_thread_helper+0x4/0x10
kernel: [ 2801.578990] [<ffffffff810727dc>] ? kthread_worker_fn+0x167/0x167
kernel: [ 2801.579212] [<ffffffff816449b0>] ? gs_change+0xb/0xb
kernel: [ 2801.579433] Code: 89 ef e8 5e e5 ff ff e9 95 00 00 00 49 83 bc 24 90 00 00 00 00 74 48 4d 63 84 24 a0 00 00 00 49 8b 84 24 98 00 00 00 49 c1 e0 04 <4c> 03 40 48 31 c0 45 85 f6 4d 8b 28 41 8b 48 0c 74 1c 4c 89 ef
kernel: [ 2801.580552] RIP [<ffffffff81616426>] write_partial_msg_pages+0x191/0x42a
kernel: [ 2801.580776] RSP <ffff88177f0cfcb0>
kernel: [ 2801.580991] CR2: 0000000000000048
kernel: [ 2801.582588] ---[ end trace d292e72dd6352966 ]---
kernel: [ 2801.583276] BUG: unable to handle kernel paging request at fffffffffffffff8
kernel: [ 1835.871092] libceph: osd0 down
kernel: [ 1837.468491] libceph: osd0 down
kernel: [ 1835.989632] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
kernel: [ 1835.990078] IP: [<ffffffff8161ea76>] write_partial_msg_pages+0x191/0x42a
kernel: [ 1835.990313] PGD 0
kernel: [ 1835.990529] Oops: 0000 [#1] SMP
kernel: [ 1835.990751] CPU 0
kernel: [ 1835.990760] Modules linked in: iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ebtable_nat ebtables nbd kvm_intel kvm bridge 8021q garp stp llc sd_mod crc_t10dif sg sr_mod cdrom ata_generic dcdbas pcspkr serio_raw iTCO_wdt iTCO_vendor_support ata_piix i7core_edac ixgbe ioatdma edac_core dca mdio hed mptsas mptscsih mptbase scsi_transport_sas netconsole configfs
kernel: [ 1835.992228]
kernel: [ 1835.992446] Pid: 5213, comm: kworker/0:3 Tainted: G W 3.2.5 #1 Dell Inc. PowerEdge R710/00NH4P
kernel: [ 1835.992893] RIP: 0010:[<ffffffff8161ea76>] [<ffffffff8161ea76>] write_partial_msg_pages+0x191/0x42a
kernel: [ 1835.993335] RSP: 0018:ffff881395885cb0 EFLAGS: 00010246
kernel: [ 1835.993557] RAX: 0000000000000000 RBX: ffff8817d7d57830 RCX: 0000000000000080
kernel: [ 1835.993780] RDX: 000000000007a000 RSI: ffff880bf5587e60 RDI: ffff880bf5587e58
kernel: [ 1835.994006] RBP: ffff881395885d40 R08: 0000000000000000 R09: 00000000a2300000
kernel: [ 1835.994231] R10: ffffffffffffff8d R11: 0000000039d3dd66 R12: ffff880bf5587dc0
kernel: [ 1835.994455] R13: ffffea002d147928 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 1835.994680] FS: 0000000000000000(0000) GS:ffff88183fc00000(0000) knlGS:0000000000000000
kernel: [ 1835.995111] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
kernel: [ 1835.995330] CR2: 0000000000000048 CR3: 0000000bd4e51000 CR4: 00000000000006f0
kernel: [ 1835.995551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [ 1835.995776] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: [ 1835.996002] Process kworker/0:3 (pid: 5213, threadinfo ffff881395884000, task ffff8817f6b1ea60)
kernel: [ 1835.996427] Stack:
kernel: [ 1835.996636] ffff881395885d10 ffffffff8161e789 0000000000000000 0000000000000000
kernel: [ 1835.997075] ffff880bf5587e58 ffff880bf5587e60 0008000000000000 0000000000080000
kernel: [ 1835.997511] 000800000000c040 0000000000006000 ffff881395885d40 0000000039d3dd66
kernel: [ 1835.997949] Call Trace:
kernel: [ 1835.998164] [<ffffffff8161e789>] ? ceph_tcp_sendmsg+0x5d/0x7e
kernel: [ 1835.998389] [<ffffffff81620516>] try_write+0x285/0x469
kernel: [ 1835.998612] [<ffffffff81621553>] con_work+0x1cd/0x226
kernel: [ 1835.998834] [<ffffffff81621386>] ? try_read+0x700/0x700
kernel: [ 1835.999056] [<ffffffff8106c0ca>] process_one_work+0x19a/0x2aa
kernel: [ 1835.999281] [<ffffffff8106d839>] worker_thread+0x113/0x1a9
kernel: [ 1835.999503] [<ffffffff8106d726>] ? manage_workers.clone.15+0xcc/0xcc
kernel: [ 1835.999725] [<ffffffff8106d726>] ? manage_workers.clone.15+0xcc/0xcc
kernel: [ 1835.999947] [<ffffffff8106d726>] ? manage_workers.clone.15+0xcc/0xcc
kernel: [ 1836.000170] [<ffffffff81072585>] kthread+0x91/0x99
kernel: [ 1836.000392] [<ffffffff8164cd74>] kernel_thread_helper+0x4/0x10
kernel: [ 1836.000615] [<ffffffff810724f4>] ? kthread_worker_fn+0x167/0x167
kernel: [ 1836.000837] [<ffffffff8164cd70>] ? gs_change+0xb/0xb
kernel: [ 1836.001054] Code: 89 ef e8 5e e5 ff ff e9 95 00 00 00 49 83 bc 24 90 00 00 00 00 74 48 4d 63 84 24 a0 00 00 00 49 8b 84 24 98 00 00 00 49 c1 e0 04 <4c> 03 40 48 31 c0 45 85 f6 4d 8b 28 41 8b 48 0c 74 1c 4c 89 ef
kernel: [ 1836.001952] RIP [<ffffffff8161ea76>] write_partial_msg_pages+0x191/0x42a
kernel: [ 1836.002180] RSP <ffff881395885cb0>
kernel: [ 1836.002395] CR2: 0000000000000048
kernel: [ 1836.003075] ---[ end trace bb3d925a0bd0c57a ]---
kernel: [ 1836.003388] BUG: unable to handle kernel paging request at fffffffffffffff8
kernel: [ 1836.003744] IP: [<ffffffff81072867>] kthread_data+0x30/0x3d
kernel: [ 1836.004050] PGD 1a07067 PUD 1a08067 PMD 0
kernel: [ 1836.004443] Oops: 0000 [#2] SMP

#7 Updated by Alex Elder almost 12 years ago

Another report, likely related:
http://tracker.newdream.net/issues/2287

I don't understand it well enough yet though to mark either
as a duplicate of the other.

#8 Updated by Alex Elder almost 12 years ago

Yay!!! I reproduced this with KDB enabled. Now to start poking around on the
machine to see what I can learn.

#9 Updated by Alex Elder almost 12 years ago

I meant to update this yesterday morning... KDB on the system definitely
works, which is really great. But since integration with kgdb it has
lost its disassembler, and it seems to be missing some features I am
accustomed to (which may have been provided by add-ons I didn't realize
I was using). So it's a little primitive, but it does provide access
to the contents of memory.

I got a little side-tracked trying to verify what I was looking at.
The try_write address that was being reported did not appear to be
in try_write at all, but in something well beyond that; I believe
try_write was the next symbol kdb knew about that was at a lower
address than the program counter, so it identified it relative to
that.

Also, for some reason the preamble of the function I wanted was
not exactly matching the assembly I had from the kernel module (on
a separate machine). The difference was the use of a different
register on one instruction, but I took some time to try to figure
out why the difference was there. I finally abandoned that.

So then I started digging through the stack, along with some assembly,
and found the address of the ceph_connection structure (or rather,
the work_struct embedded in the ceph_connection) that was passed to
con_work.

The next step was to decode that to see if there was anything that
could be learned.

#10 Updated by Alex Elder almost 12 years ago

After some false starts I decoded the beginning of the connection
structure, and I found a problem, but I have not yet followed that
through to diagnose why that particular issue caused the crash, nor
what the possible paths might be that might cause it.

The problem is that 7 bytes got dropped somewhere in either
sending or receiving. The problem in this instance showed up as
bytes in the peer_addr_for_me field (and possibly beyond) shifted
by seven bytes. I have a hunch about what's causing it--it's been
the source of difficulty as I try to decode this structure. More
detail on that in another entry to this bug.

I don't know if it happened during sending or receiving--or
whether it's just a mismatch between what the two sides are
doing in communicating this structure (i.e., they differ but
it's not clear which one is "right.")

Anyway, here's what I find. As an example of what things should
look like, here is the content of the peer_addr field in this
ceph_connection:
struct ceph_entity_addr peer_addr = {
type = 00000000
nonce = 0x00004f67 = 20327
struct sockaddr_in in_addr = {
sin_family = 0x0002 (network order) = AF_INET
sin_port = 0x901a (network order) = 6800
sin_addr = {
s_addr = 0a d6 83 0d = 10.214.131.13 = plana27.front.sepia.ceph.com
} (network order)
__pad8 = { 00, 00, 00, 00, 00, 00, 00, 00 }
(pad112) = { 00, ..., 00 }
}
}

The next field, peer_name, is fine (plana27.front.sepia.ceph.com was
serving "osd.1"):
struct ceph_entity_name peer_name = {
type = 0x04 = CEPH_ENTITY_TYPE_OSD
num = 0x0000000000000001
}

After that is the problem field, peer_addr_for_me:
struct ceph_entity_addr peer_addr_for_me = {
type = e5000200 = CRAP (as is the rest)
nonce = 83d60a81
struct sockaddr_in in_addr = {
sin_family = 0x000c (network order) = AF_DECnet
sin_port = 0x0000 (network order) = 0 ???
sin_addr = {
s_addr = 0x00000000 = 00 00 00 00 = 0.0.0.0
} (network order)
__pad8 = { 0c, 00, 00, 00, 00, 00, 00, 00 }
(pad112) = {
00, ..., 00, ff 7f 00 00 00 00 00 00
}
}
}

The thing to notice, however, is that the bytes in the nonce and type
field are fairly similar looking to the bytes in the sin_addr field of
the peer_address. Which suggests the whole thing might be correct data
but placed in the wrong memory locations.

Here's what it looks like if you shift everything 7 bytes:
struct ceph_entity_addr peer_addr_for_me = {
type = 00000000
nonce = 00000000
struct sockaddr_in in_addr = {
sin_family = 0x0002 (network order) = AF_INET
sin_port = 0x81e5 (network order) = 58753 (?)
sin_addr = {
s_addr = 0x0c83d60a -> 0a d6 83 0c = 10.214.131.12 = plana28.front.sepia.ceph.com
} (network order)
__pad8 = { 00, 00, 00, 00, 00, 00, 00, 00 }
(pad112) = { 00, ..., 00 }
}
}

So it looks like either the remote end of the connection neglected to
send 7 bytes, or (more likely) the local end skipped over 7 bytes as
it was interpreting what it received.

#11 Updated by Alex Elder almost 12 years ago

Here is my hunch about what's going on, and what I mentioned was a
cause for trouble interpreting the memory content.

struct ceph_entity_name is defined like this:
struct ceph_entity_name {
u8 type; /* CEPH_ENTITY_TYPE_* */
__le64 num;
} __attribute
((packed));

The "packed" attribute means the structure isn't padded out to
(in this case) an 8-byte aligned size. My first reaction to
to seeing this was "WHY?" Packing is fine, but why would anyone
define a 9-byte type? Especially when it's not clear that we'd
ever need 2^64 bits worth of entity numbers. Ah, but enough of
the commentary.

When types are laid out in a structure, they normally will be
padded so the overall structure has a size aligned to the
alignment of largest component of that structure. (Objects
of that type will be normally placed at addresses having that
are aligned to the type's alignment as well.) That's why the
above type would normally be padded out to 16 bytes if it
weren't for the "packed" attribute.

In this case though, this packed 9-byte structure is itself
embedded in the definition of struct ceph_connection. So at
first glance--which was colored by what I was seeing in the memory
dump--I wasn't sure whether the field following the peer_name
(peer_addr_for_me) would be aligned normally (on an 8-byte
boundary) or would be lined up based on the packing of its
predecessor in the type. I was pretty sure it should have been
8-byte aligned, but that's not what it looked like in the memory
dump, so it took a bit to really verify I understood what I
was looking at.

It turns out that, yes, the ceph_connection structure's fields
are aligned normally, since that structure is not defined with
the "packed" attribute.

And this packed/nopacked problem is I think the source of this
error. I'll explore that in another update.

#12 Updated by Alex Elder almost 12 years ago

The skipped 7 bytes could be explained this way.

The sender sends out over the wire the sub-components of the
structure one-by-one, sending exactly as many bytes as the
component's defined size. I.e., a ceph_entity_name will be
sent using exactly 9 bytes, even though in memory it consumes 16.

On the receiving side, if the code receiving the header data
for the ceph connection assumed there would be pad bytes in
the message (which might happen if the over-the-wire data
were loaded directly into the fields in the structure, or
perhaps if sizeof(field) were used in determining where to
place received data into the in-memory structure), we might
end up with this result--7 bytes would be dropped, or possibly
placed in the wrong spot--overwriting the hidden pad between
the peer_name and the peer_addr_for_me fields in the ceph
connection structure.

In the instance I'm looking at, the bytes in question are all
0x00, and those could be valid (shifted) data, so it's hard
to tell which of those two options it could be.

But for now I'm going to run with this assumption about what
happened, and will look at the code that operates on these
fields to see if I can find code that confirms this as root cause.

#13 Updated by Alex Elder almost 12 years ago

Clarifying something I described earlier, with respect to alignment
and sizing of fields, including those that have the "packed" attribute.

- The alignment of an integral type is the size of that type:
char > 1, short -> 2, int -> 4, and long -> 8 (for x8664).
The alignment of an array is the same as that of each of its
members.
- Alignment for a struct or union will match the largest alignment
among its member fields (recursively).
- The size of a struct object will be an even multiple of the
object's alignment.
- The size of a union object is the size of its largest member field.
- The base address of an object, or the offset of a field with a
struct or union type, will be an even multiple of the alignment
for the object's type.
- The effect of the "packed" attribute is to make the alignment
for the affected struct or union 1, regardless of its makeup.

This means that a non-packed structure behaves as follows:
- The offset of the first field in a structure is always 0,
packed or not.
- The offset of a packed field is the offset of its predecessor
plus the size of its predecessor. So if its predecessor is
also packed, it may start on an "unusual" alignment. But if
its predecessor is not packed, the start offset of the packed
field will be aligned on a "usual" boundary (small power of 2,
which is the alignment of the predecessor field).
- The offset of a non-packed field will be aligned normally,
meaning there is implicit padding following its predecessor
if it was defined to be packed.
- A packed structure will honor the non-packed nature of its
fields--any pad bytes at the end of a non-packed field will
remain, but no implicit padding will be present between the
end of a packed field and the start of another one.

Still working my way through how various things in memory on
this one crash. It's a bit confusing with these misaligned
things making sure they're sitting where they belong in memory.

#14 Updated by Alex Elder almost 12 years ago

Revisiting my decoding from before, in light of my new clarity on
padding and field layout, things in the message structure are
looking OK after all. I'm still going to finish going through
this exercise, then will restart and reproduce again to both
correlate my findings with this one, and run with more debug
messaging turned on (but I'm unsure how well that will work with
my serial console enabled).

#15 Updated by Alex Elder almost 12 years ago

I've completed decoding all of the fields of the ceph_connection
structure, along with some other stuff dereferenced by it. I did
find one thing that seems anomalous but at this point I've switched
gears and am trying to see exactly where we were in the code when
the problem hit. I've been holding off on that because the symbolic
information available is not giving very precise information.

Anyway, I believe I've narrowed down the exact bit of code, and
as long as I'm updating this I'll record it here. I have not
yet decoded what exactly is going on but my hunch is the auth client
passed in is a bogus (probably null) pointer.

Here is the code, at the top of ceph_auth_none_create_authorizer():

if (!ai->built_authorizer) {
p = au->buf;
end = p + sizeof(au->buf);
ceph_encode_8(&p, 1); <-------- I believe this is the spot
ret = ceph_entity_name_encode(ac->name, &p, end - 8);
if (ret < 0)
goto bad;

That spot happens to be the first instruction that writes to the output
buffer.

Again, a little more digging ought to tell me how we got there in the
first place.

#16 Updated by Alex Elder almost 12 years ago

Last night I continued to decode assembly, working through making
sense of the particular instructions and the state of the registers
and stack as I went.

But I finally concluded this approach was going nowhere. Or perhaps
it was addling my brain at a pace that was becoming more noticeable.

So I rebooted the crashed system and will be trying again, this time
with additional debug enabled. I've been a little concerned this will
lead to problems with the serial kernel attached but I'm going to give
it a try anyway.

I may also try with a version of messenger.c with all the "static" scope
qualifiers removed from functions, and with optimization turned off (I
haven't determined if that's necessary), to make better information
available in KDB about the path through the code at the time of a crash.

#17 Updated by Alex Elder almost 12 years ago

@Wish I knew about the "pahole" utility (part of the "dwarves" package)
last week.

% pahole libceph.ko -C ceph_connection
struct ceph_connection {
void * private; /* 0 8 /
atomic_t nref; /
8 4 */

/* XXX 4 bytes hole, try to pack */
const struct ceph_connection_operations  * ops;  /*    16     8 /
struct ceph_messenger * msgr; /
24 8 /
struct socket * sock; /
32 8 /
long unsigned int state; /
40 8 /
const char * error_msg; /
48 8 /
struct ceph_entity_addr peer_addr; /
56 136 /
/
--- cacheline 3 boundary (192 bytes) --- /
struct ceph_entity_name peer_name; /
192 9 /
struct ceph_entity_addr peer_addr_for_me; /
201 136 */
/* XXX 3 bytes hole, try to pack */
/* --- cacheline 5 boundary (320 bytes) was 20 bytes ago --- /
unsigned int peer_features; /
340 4 /
u32 connect_seq; /
344 4 /
u32 peer_global_seq; /
348 4 /
int auth_retry; /
352 4 */
/* XXX 4 bytes hole, try to pack */
void *                     auth_reply_buf;       /*   360     8 /
int auth_reply_buf_len; /
368 4 */
/* XXX 4 bytes hole, try to pack */
struct mutex               mutex;                /*   376   136 /
/
--- cacheline 8 boundary (512 bytes) --- /
struct list_head out_queue; /
512 16 /
struct list_head out_sent; /
528 16 /
u64 out_seq; /
544 8 /
u64 in_seq; /
552 8 /
u64 in_seq_acked; /
560 8 /
char in_banner[30]; /
568 30 /
/
--- cacheline 9 boundary (576 bytes) was 22 bytes ago --- /
union {
struct {
struct ceph_msg_connect out_connect; /
598 33 /
struct ceph_msg_connect_reply in_reply; /
631 26 /
}; /
59 /
struct {
struct ceph_msg_connect in_connect; /
598 33 /
struct ceph_msg_connect_reply out_reply; /
631 26 /
}; /
59 /
}; /
598 59 /
/
--- cacheline 10 boundary (640 bytes) was 17 bytes ago --- /
struct ceph_entity_addr actual_peer_addr; /
657 136 */
/* XXX 7 bytes hole, try to pack */
/* --- cacheline 12 boundary (768 bytes) was 32 bytes ago --- /
struct ceph_msg * out_msg; /
800 8 /
bool out_msg_done; /
808 1 */
/* XXX 3 bytes hole, try to pack */
struct ceph_msg_pos        out_msg_pos;          /*   812    16 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 13 boundary (832 bytes) --- /
struct kvec out_kvec[8]; /
832 128 /
/
--- cacheline 15 boundary (960 bytes) --- /
struct kvec * out_kvec_cur; /
960 8 /
int out_kvec_left; /
968 4 /
int out_skip; /
972 4 /
int out_kvec_bytes; /
976 4 /
bool out_kvec_is_msg; /
980 1 */
/* XXX 3 bytes hole, try to pack */
int                        out_more;             /*   984     4 */
/* XXX 4 bytes hole, try to pack */
__le64                     out_temp_ack;         /*   992     8 /
struct ceph_msg_header in_hdr; /
1000 53 */
/* XXX 3 bytes hole, try to pack */
/* --- cacheline 16 boundary (1024 bytes) was 32 bytes ago --- /
struct ceph_msg * in_msg; /
1056 8 /
struct ceph_msg_pos in_msg_pos; /
1064 16 /
u32 in_front_crc; /
1080 4 /
u32 in_middle_crc; /
1084 4 /
/
--- cacheline 17 boundary (1088 bytes) --- /
u32 in_data_crc; /
1088 4 /
char in_tag; /
1092 1 */
/* XXX 3 bytes hole, try to pack */
int                        in_base_pos;          /*  1096     4 */
/* XXX 4 bytes hole, try to pack */
__le64                     in_temp_ack;          /*  1104     8 /
struct delayed_work work; /
1112 176 /
/
--- cacheline 20 boundary (1280 bytes) was 8 bytes ago --- /
long unsigned int delay; /
1288 8 */
/* size: 1296, cachelines: 21, members: 47 /
/
sum members: 1250, holes: 12, sum holes: 46 /
/
last cacheline: 16 bytes */
};
%

@

#18 Updated by Alex Elder almost 12 years ago

Well, after getting my dout() lines to produce useful information
I have a new theory. Basically I think this error is occurring
some time shortly (like, almost immediately) after closing the
socket on a connection. So I suspect something is using the
socket pointer (or something else that's become invalid) after
this point.

It has been seriously difficult to figure out what exactly is
going on in this code given the state after a crash... I'm sure
part of this is because it's my first time going through this
sort of exercise with ceph code. But, well, hopefully this can
be made easier somehow for future debugging.

#19 Updated by Yehuda Sadeh almost 12 years ago

Not sure if that's a problem, need to dig deeper into that, but note that in the OPENING case there we're not jumping to done, but continue to work on the connection:

       if (test_bit(CLOSED, &con->state)) { /* e.g. if we are replaced */
                dout("con_work CLOSED\n");
                con_close_socket(con);
                goto done;
        }
        if (test_and_clear_bit(OPENING, &con->state)) {
                /* reopen w/ new peer */
                dout("con_work OPENING\n");
                con_close_socket(con);
        }
        if (test_and_clear_bit(SOCK_CLOSED, &con->state))
                goto fault;

        ret = try_read(con);

#20 Updated by Alex Elder almost 12 years ago

Sorry I haven't updated this for some time.

In response to Yehuda: yes, I noticed that, and my instinct when
first looking at that little block of code is that it just didn't
look right. However now that I've been working with this messenger
code for a while I think that may have been OK. I think what was
happening there was that if we were marked OPENING, we were supposed
to start a new connection sequence, and in that case we wanted to
be sure the existing connection had been closed. Once it got closed,
the try_write() call would find a null socket pointer (because it had
been closed) and would initiate the connection.

#21 Updated by Alex Elder almost 12 years ago

I've been working for a week or two on refactoring the messaging
code in hopes to make it be based on a much cleaner and simpler
state machine model. It's ending up changing things around quite
a bit, but I believe they are improving. Along the way I have
noticed a few things I think were bugs, and after some discussion
am fixing them.

One thing I found this morning is that we are dropping the final
reference to a connection (via ceph_con_put()) when the connection
has a non-null socket pointer. There is a BUG_ON() there that
caught this.

This could easily lead to use of the socket pointer after the
connection that referred to it was freed. However the BUG_ON()
call is old (dates to commit 71eceda dated 11/18/09), and is
unconditional, so would have killed the system before it could
dereference the bad pointer.

Still, this indicates to some flawed logic and could point to
an area of code that could cause some of these failures. So
I thought I'd report it here and hope for the best.

#22 Updated by Alex Elder over 11 years ago

Lots of work on the messenger client, but still not completely
clear this particular bug is fixed. There are a few significant
bug fixes along the way though, and a few minor ones as well.

At this point I think we're stepping back to reassess the
situation, to see if we've managed to improve things or whether
this is still a persistent problem.

#23 Updated by Sage Weil over 11 years ago

  • Assignee deleted (Alex Elder)

#24 Updated by Sage Weil over 11 years ago

  • Status changed from 12 to Resolved

this is either #2867, or a similar issue that is since resolved.

Also available in: Atom PDF