Project

General

Profile

Bug #5760

libceph: osdc_build_request(): BUG_ON(p > msg->front.iov_base + msg->front.iov_len);

Added by Josh Durgin over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/003105.html:

Hi,

I have a bug in the 3.10 kernel under debian, be it a self compiled linux-stable from the git (built with make-kpkg) or the sid's package.

I'm using format-2 images (ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)) to make snapshots and clones of a database for development purposes. So I have a replay of the database's logs on a ceph volume and I take a snapshots at fixed points in time : mount -> recover database until a given time -> umount -> snapshot -> go back to 1.

In both cases, it works for a while (mount/umount cycles) and after some time it gives me the following error on mount :

Jul 25 15:20:46 **host** kernel: [14623.808604] ------------[ cut here ]------------
Jul 25 15:20:46 **host** kernel: [14623.808622] kernel BUG at /build/linux-dT6LW0/linux-3.10.1/net/ceph/osd_client.c:2103!
Jul 25 15:20:46 **host** kernel: [14623.808641] invalid opcode: 0000 [#1] SMP
Jul 25 15:20:46 **host** kernel: [14623.808657] Modules linked in: cbc rbd libceph nfsd auth_rpcgss oid_registry nfs_acl nfs lockd sunrpc sha256_generic hmac nls_utf8 cifs dns_resolver fscache bridge stp llc xfs loop coretemp kvm_intel kvm crc32c_intel psmouse serio_raw snd_pcm snd_page_alloc snd_timer snd soundcore iTCO_wdt iTCO_vendor_support i2c_i801 i7core_edac microcode pcspkr lpc_ich mfd_core joydev ioatdma evdev edac_core acpi_cpufreq mperf button processor thermal_sys ext4 crc16 jbd2 mbcache btrfs xor zlib_deflate raid6_pq crc32c libcrc32c raid1 ohci_hcd hid_generic usbhid hid sr_mod sg cdrom sd_mod crc_t10dif dm_mod md_mod ata_generic ata_piix libata uhci_hcd ehci_pci ehci_hcd scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core
Jul 25 15:20:46 **host** kernel: [14623.809005] CPU: 6 PID: 9583 Comm: mount Not tainted 3.10-1-amd64 #1 Debian 3.10.1-1
Jul 25 15:20:46 **host** kernel: [14623.809024] Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.1b       12/30/2011
Jul 25 15:20:46 **host** kernel: [14623.809041] task: ffff88082dfa2840 ti: ffff88080e2c2000 task.ti: ffff88080e2c2000
Jul 25 15:20:46 **host** kernel: [14623.809059] RIP: 0010:[<ffffffffa05d08ff>]  [<ffffffffa05d08ff>] ceph_osdc_build_request+0x370/0x3e9 [libceph]
Jul 25 15:20:46 **host** kernel: [14623.809092] RSP: 0018:ffff88080e2c39b8  EFLAGS: 00010216
Jul 25 15:20:46 **host** kernel: [14623.809120] RAX: ffff88082e589a80 RBX: ffff88082e589b72 RCX: 0000000000000007
Jul 25 15:20:46 **host** kernel: [14623.809151] RDX: ffff88082e589b6f RSI: ffff88082afd9078 RDI: ffff88082b308258
Jul 25 15:20:46 **host** kernel: [14623.809182] RBP: 0000000000001000 R08: ffff88082e10a400 R09: ffff88082afd9000
Jul 25 15:20:46 **host** kernel: [14623.809213] R10: ffff8806bfb1cd60 R11: ffff88082d153c01 R12: ffff88080e88e988
Jul 25 15:20:46 **host** kernel: [14623.809243] R13: 0000000000000001 R14: ffff88080eb874d8 R15: ffff88080eb875b8
Jul 25 15:20:46 **host** kernel: [14623.809275] FS:  00007f2c893b77e0(0000) GS:ffff88083fc40000(0000) knlGS:0000000000000000
Jul 25 15:20:46 **host** kernel: [14623.809322] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 25 15:20:46 **host** kernel: [14623.809350] CR2: ffffffffff600400 CR3: 00000006bfbc6000 CR4: 00000000000007e0
Jul 25 15:20:46 **host** kernel: [14623.809381] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 25 15:20:46 **host** kernel: [14623.809413] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 25 15:20:46 **host** kernel: [14623.809442] Stack:
Jul 25 15:20:46 **host** kernel: [14623.814598]  0000000000002201 ffff88080e2c3a30 0000000000001000 ffff88042edf2240
Jul 25 15:20:46 **host** kernel: [14623.814656]  00000024a05cbb01 0000000000000000 ffff88082e84f348 ffff88080e2c3a58
Jul 25 15:20:46 **host** kernel: [14623.814710]  ffff88080eb874d8 ffff88080e9aa290 ffff88027abc6000 0000000000001000
Jul 25 15:20:46 **host** kernel: [14623.814765] Call Trace:
Jul 25 15:20:46 **host** kernel: [14623.814793]  [<ffffffffa05bb7f3>] ? rbd_osd_req_format_write+0x81/0x8c [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814827]  [<ffffffffa05bea1c>] ? rbd_img_request_fill+0x679/0x74f [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814865]  [<ffffffff8105f670>] ? should_resched+0x5/0x23
Jul 25 15:20:46 **host** kernel: [14623.814896]  [<ffffffffa05bf3d1>] ? rbd_request_fn+0x180/0x226 [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814929]  [<ffffffff811a819c>] ? __blk_run_queue_uncond+0x1e/0x26
Jul 25 15:20:46 **host** kernel: [14623.814960]  [<ffffffff811a905f>] ? blk_queue_bio+0x299/0x2e8
Jul 25 15:20:46 **host** kernel: [14623.814990]  [<ffffffff811a7523>] ? generic_make_request+0x96/0xd5
Jul 25 15:20:46 **host** kernel: [14623.815021]  [<ffffffff811a810f>] ? submit_bio+0x10a/0x13b
Jul 25 15:20:46 **host** kernel: [14623.815053]  [<ffffffff8112fe3d>] ? bio_alloc_bioset+0xd0/0x172
Jul 25 15:20:46 **host** kernel: [14623.815083]  [<ffffffff8112d36a>] ? _submit_bh+0x1b7/0x1d4
Jul 25 15:20:46 **host** kernel: [14623.815117]  [<ffffffff8112d4e9>] ? __sync_dirty_buffer+0x4e/0x7b
Jul 25 15:20:46 **host** kernel: [14623.815164]  [<ffffffffa03053b6>] ? ext4_commit_super+0x192/0x1db [ext4]
Jul 25 15:20:46 **host** kernel: [14623.815206]  [<ffffffffa0306cfe>] ? ext4_setup_super+0xff/0x146 [ext4]
Jul 25 15:20:46 **host** kernel: [14623.815248]  [<ffffffffa03094e2>] ? ext4_fill_super+0x1c55/0x2500 [ext4]
Jul 25 15:20:46 **host** kernel: [14623.815282]  [<ffffffff811c7194>] ? string.isra.3+0x36/0x99
Jul 25 15:20:46 **host** kernel: [14623.815322]  [<ffffffffa030788d>] ? ext4_calculate_overhead+0x2a5/0x2a5 [ext4]
Jul 25 15:20:46 **host** kernel: [14623.815371]  [<ffffffff8110b721>] ? sget+0x460/0x478
Jul 25 15:20:46 **host** kernel: [14623.815410]  [<ffffffffa030788d>] ? ext4_calculate_overhead+0x2a5/0x2a5 [ext4]
Jul 25 15:20:46 **host** kernel: [14623.815457]  [<ffffffff8110b8ed>] ? mount_bdev+0x143/0x1a5
Jul 25 15:20:46 **host** kernel: [14623.815490]  [<ffffffff810f9857>] ? __kmalloc_track_caller+0xd5/0xe5
Jul 25 15:20:46 **host** kernel: [14623.815522]  [<ffffffff8110c08d>] ? mount_fs+0x5f/0x140
Jul 25 15:20:46 **host** kernel: [14623.815554]  [<ffffffff8111e70f>] ? vfs_kern_mount+0x60/0xe1
Jul 25 15:20:46 **host** kernel: [14623.815585]  [<ffffffff8112078b>] ? do_mount+0x678/0x7f2
Jul 25 15:20:46 **host** kernel: [14623.815615]  [<ffffffff810d47be>] ? memdup_user+0x36/0x5b
Jul 25 15:20:46 **host** kernel: [14623.815645]  [<ffffffff81120983>] ? SyS_mount+0x7e/0xb7
Jul 25 15:20:46 **host** kernel: [14623.815676]  [<ffffffff813938e9>] ? system_call_fastpath+0x16/0x1b
Jul 25 15:20:46 **host** kernel: [14623.815705] Code: 00 00 00 8b 54 24 28 66 89 50 22 49 8b 86 c0 00 00 00 8b 54 24 10 89 50 1e 49 8b 44 24 48 48 89 c2 49 03 54 24 50 48 39 d3 76 02 <0f> 0b 48 29 c3 49 89 5c 24 50 41 89 5c 24 16 eb 59 66 81 fd 01
Jul 25 15:20:46 **host** kernel: [14623.815934] RIP  [<ffffffffa05d08ff>] ceph_osdc_build_request+0x370/0x3e9 [libceph]
Jul 25 15:20:46 **host** kernel: [14623.815987]  RSP <ffff88080e2c39b8>
Jul 25 15:20:46 **host** kernel: [14623.816398] ---[ end trace 556a473d0b86002e ]---

It seems that if I rollback to the previous snapshot I can mount the image again, but I have to reboot the machine every time :'( 

History

#1 Updated by Josh Durgin over 10 years ago

A short inspection suggests ceph_osdc_alloc_request is setting iov_len 4 bytes too large, but there may be some later data taking up more space than was allocated for it.

#2 Updated by Ian Colle over 10 years ago

  • Assignee set to Josh Durgin
  • Priority changed from Urgent to High

#4 Updated by Josh Durgin over 10 years ago

  • Status changed from New to Fix Under Review

fix in branch wip-rbd-bugs in ceph-client.git, test in wip-krbd-workunits for ceph.git

#5 Updated by Mikaël Cluseau over 10 years ago

git cherry-pick 103673bf04c8207c92c3286005dfaa2d259ac9b6 68d253bc92e5fd780869b1fb31dd8e49267b8d4e
from v3.10.9 (0a4b6d4ff200a553951f77f765971cb3e4c91ec0)

Currently replayed ~6GB of database work flawlessly. So it seems to work well :)

#6 Updated by Mikaël Cluseau over 10 years ago

more than 20GB later, still no bug so for my part it's solved. Thanks! :)

#7 Updated by Olivier Bonvalet over 10 years ago

Same thing here : I cherry-picked your commit (a9fb92762883e2522fc4d1dcd403c5d888264746 : rbd: fix buffer size for writes to images with snapshots) and applied on a 3.10.10 kernel, from now it works fine (VM was hangging very soon after start, so it seems fixed).

Thanks !

#8 Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Resolved

#9 Updated by Olivier Bonvalet over 10 years ago

@Sage : does this fix present upstream ? in which kernel version ?

Also available in: Atom PDF