Bug #3291

rbd: BUG in bio from rbd_rq_fn

Added by Josh Durgin almost 11 years ago. Updated almost 11 years ago.

Target version:
% Done:


Community (user)
3 - minor
Affected Versions:
Crash signature (v1):
Crash signature (v2):


From #ceph, with linux 3.6.1:

(01:01:04 PM) tziOm: I do a " dd if=/dev/zero of=foo bs=1024k count=512" on a rbd
(01:01:14 PM) tziOm: and if hangs totally.. cant even kill dd
[42691.654211] ------------[ cut here ]------------
[42691.656012] kernel BUG at fs/bio.c:1504!
[42691.656012] invalid opcode: 0000 [#1] SMP 
[42691.656012] Modules linked in: rbd libceph fuse bonding raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx md_mod loop radeon lpc_ich mfd_core ttm drm_kms_helper drm joydev i2c_algo_bit i2c_core hid_generic rng_core e752x_edac tpm_tis tpm tpm_bios edac_core shpchp snd_pcm dcdbas snd_timer snd evdev soundcore snd_page_alloc processor video microcode thermal_sys psmouse pcspkr serio_raw button ext3 jbd mbcache btrfs zlib_deflate crc32c libcrc32c sr_mod sg cdrom sd_mod crc_t10dif ata_generic usbhid hid pata_sil680 megaraid_mbox floppy ata_piix uhci_hcd megaraid_mm ehci_hcd libata scsi_mod e1000 usbcore usb_common
[42691.656012] CPU 1 
[42691.656012] Pid: 3456, comm: btrfs-submit-1 Not tainted 3.6.1 #1 Dell Computer Corporation PowerEdge 2850/0XC320
[42691.656012] RIP: 0010:[<ffffffff81135ece>]  [<ffffffff81135ece>] bio_split+0x81/0x1af
[42691.656012] RSP: 0018:ffff88001ecefb80  EFLAGS: 00010212
[42691.656012] RAX: 0000000000000000 RBX: ffff880034140580 RCX: 0000000000000200
[42691.656012] RDX: 00000000000001e8 RSI: ffff88003db61c00 RDI: ffffffff810caa72
[42691.656012] RBP: 0000000000000180 R08: 0000000000011200 R09: dead000000200200
[42691.656012] R10: dead000000100100 R11: 0000000000000000 R12: ffff88003db61c00
[42691.656012] R13: 0000000000000000 R14: ffff88007a7d1038 R15: 000000000017a000
[42691.656012] FS:  0000000000000000(0000) GS:ffff88007fc40000(0000) knlGS:0000000000000000
[42691.656012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[42691.656012] CR2: 00007fa5a5c47c18 CR3: 000000000160b000 CR4: 00000000000007e0
[42691.656012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42691.656012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[42691.656012] Process btrfs-submit-1 (pid: 3456, threadinfo ffff88001ecee000, task ffff88007ad7ade0)
[42691.768103] Stack:
[42691.768103]  0000000000000000 ffff8800091e6000 0000000000030000 ffff88002d97d348
[42691.768103]  0000000000000000 ffff880063df45c0 ffff880034140580 ffffffffa047e77c
[42691.768103]  0000000000000001 ffffffff811c0fa4 0000000000000000 00000000338a49c6
[42691.768103] Call Trace:
[42691.768103]  [<ffffffffa047e77c>] ? rbd_rq_fn+0x253/0x42b [rbd]
[42691.768103]  [<ffffffff811c0fa4>] ? cfq_add_rq_rb+0x77/0x8a
[42691.768103]  [<ffffffff811c28fb>] ? cfq_insert_request+0x1f1/0x3a6
[42691.768103]  [<ffffffff811adeae>] ? queue_unplugged+0x7f/0x93
[42691.768103]  [<ffffffff811ae5a3>] ? blk_flush_plug_list+0x1ca/0x1e3
[42691.768103]  [<ffffffff811ae5c7>] ? blk_finish_plug+0xb/0x2a
[42691.768103]  [<ffffffffa01d0870>] ? run_scheduled_bios+0x422/0x434 [btrfs]
[42691.768103]  [<ffffffffa01d72ca>] ? worker_loop+0x18c/0x4fe [btrfs]
[42691.768103]  [<ffffffffa01d713e>] ? btrfs_queue_worker+0x260/0x260 [btrfs]
[42691.768103]  [<ffffffffa01d713e>] ? btrfs_queue_worker+0x260/0x260 [btrfs]
[42691.768103]  [<ffffffff8105b451>] ? kthread+0x81/0x89
[42691.768103]  [<ffffffff813865c4>] ? kernel_thread_helper+0x4/0x10
[42691.768103]  [<ffffffff8105b3d0>] ? kthread_freezable_should_stop+0x53/0x53
[42691.768103]  [<ffffffff813865c0>] ? gs_change+0x13/0x13
[42691.768103] Code: 02 55 00 4d 85 ed 74 1c 49 8b 7d 08 44 89 f9 48 89 da 4c 89 f6 41 ff 55 00 49 83 c5 10 49 83 7d 00 00 eb e2 66 83 7b 28 01 74 04 <0f> 0b eb fe 66 83 7b 2a 00 74 04 0f 0b eb fe 41 c7 84 24 e0 01 
[42691.768103] RIP  [<ffffffff81135ece>] bio_split+0x81/0x1af
[42691.768103]  RSP <ffff88001ecefb80>
[42691.968358] ---[ end trace 4e48bf3d2ec08273 ]---

btrfs_3291_fix.patch View - Josef's btrfs fix (4.99 KB) Alex Elder, 10/19/2012 01:54 PM


#1 Updated by Alex Elder almost 11 years ago

This may be a duplicate of
and even if not, it's probably related.

#2 Updated by Sage Weil almost 11 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Alex Elder almost 11 years ago

  • Status changed from New to In Progress
  • Assignee set to Alex Elder

Additional notes on the configuration that reproduced this:
- 4 OSD's, 3 monitors, (1 MDS)
- 1 GB RBD image

#4 Updated by Alex Elder almost 11 years ago

  • Status changed from In Progress to 12

Reproduced it using stock kernel v3.6.1 (which was what was
reported used by tziOm).

#5 Updated by Alex Elder almost 11 years ago

After poking around with kdb this afternoon I have narrowed it down to
bio_split() being called with something other than exactly one page in
it. In this case, there are 0x17 = 23 pages in the bio to be split.

I'm now going to try to work backward from there to determine how
it could happen.

#6 Updated by Alex Elder almost 11 years ago

Just adding a little debug notes for my own reference.

Here is a dump of the bio involved in a crash using the base code:
0xffff880213477740 00000000000abd00 0000000000000000 ................
0xffff880213477750 ffff88020d0a2c00 4000000000000009 .,.............@
0xffff880213477760 0000000000000001 0000001700000080 ................
0xffff880213477770 0000300000080000 0000008000009000 .....0..........
0xffff880213477780 0000000000000001 ffff88020d4a0000 ..........J.....
0xffff880213477790 ffffffffa0068c30 ffff88020cbf1420 0....... .......
0xffff8802134777a0 0000000000000000 ffffffff811b2d50 ........P-......
0xffff8802134777b0 ffffea00085c11c0 0000000000001000 ..\.............

And here's one with my code with modified rbd_merge_bvec() function:
[1]kdb> md8c32 ffff88020bb2db00
0xffff88020bb2db00 00000000000abd00 0000000000000000 ................
0xffff88020bb2db10 ffff88020d0f8b00 4000000000000009 ...............@
0xffff88020bb2db20 0000000000000001 0000003c00000080 ............<...
0xffff88020bb2db30 0000200000080000 0000008000002000 ..... ... ......
0xffff88020bb2db40 0000000000000001 ffff8802205c8000 ..........\ ....
0xffff88020bb2db50 ffffffffa0068c30 ffff88020bdcfba0 0...............
0xffff88020bb2db60 0000000000000000 ffffffff811b2d50 ........P-......
0xffff88020bb2db70 ffffffff81c2a260 ffffffff81c2a2c0 `...............
0xffff88020bb2db80 ffff88020ca0de00 ffff8802245852c0 .........RX$....
0xffff88020bb2db90 ffffffff81080350 0000000000000000 P...............
0xffff88020bb2dba0 0000000000000000 0000000000000000 ................
0xffff88020bb2dbb0 0000000000000000 0000000000000000 ................
0xffff88020bb2dbc0 ffff88020bb2d8c0 0000000000000000 ................
0xffff88020bb2dbd0 ffff88020d0f8b00 f000000000000009 ................
0xffff88020bb2dbe0 0000000000000000 0000000100000001 ................
0xffff88020bb2dbf0 0000100000000000 0000000400001000 ................

Note that the problem occurs at the same sector offset,
0x000abd00 = 703744, and involves a bio with 0x80 = 128
full pages (4096 bytes apiece) in it.

#7 Updated by Alex Elder almost 11 years ago

I'm think I might know what's happening here.

Although rbd_merge_bvec() is a little odd in how it works, I
believe it is producing a result that will work the way its
callers expect. (I have a patch to clean this up though.)

The problem is in bio_chain_clone(), which is called only
by bio_rq_fn() in order to divide a single I/O request's
chain of bio's into a set of cloned chains of bio's. Each
"segment" (osd object) of the range of bytes in an rbd image
covered by the request will get its own cloned bio chain.

When bio_rq_fn() has a request to process, it allocates a
collection structure (rbd_req_coll) to hold the response
codes for each cloned chain submitted until they are all
complete. It then determines the end of the rbd segment
that contains the first byte in the request. It clones the
beginning of the bio chain up to the end of that segment,
using bio_chain_clone(), and passes the result (ultimately)
to rbd_do_request(), which submits it to the OSD containing
the target object. This continues until the full request
has been processed.

When each of these sub-requests completes, it calls its
callback function rbd_req_cb(), which handles zeroing
target pages for read operations as appropriate, recording
completion status, dropping references to the collection,
bio, and request, and freeing other data associated with
this cloned operation.

So bio_chain_clone() is responsible for splitting off a
cloned chain of bio's, up to a fixed offset. For the
most part this process is a simple matter of building a
new chain by cloning each bio in the original chain.
The only interesting case occurs at the boundary of
an rbd image segment, where a single bio might need
to be divided into two cloned bio's, one covering the
end of one segment and one covering the beginning of
the next.

In that case, bio_chain_clone() calls bio_split(), which
is a generic routine meant to handle exactly this sort
of division of a bio into two cloned pieces.

Now one of the requirements of using bio_split is that
the only splittable bio is one containing only a single
I/O vector--therefore affecting only one page. And the
BUG call that's being tripped here is occurring because
bio_split() is being passed a bio structure that describes
I/O to more than one page.

(OK, I'm going to save this and continue in another post.)

#8 Updated by Alex Elder almost 11 years ago

The way things are supposed to work for bio_split() is that
any bio that will be subject to a split will have already
been guaranteed to involve only a single page. This is
supposed to happen through the use of a callback function
tied to the block queue--set via blk_queue_merge_bvec().

Whenever an I/O vector (a bio_vec) is added to an existing
bio in a block queue, the generic code calls this callback
to see if there's any reason it cannot add it. This is
intended for stacking/layered drivers, which might have
certain boundaries or other characteristics (such as
different-sized blocks over different ranges of a logical
device) that might result in a bio that can't be processed
correctly once the request gets issued. If it's OK, the
bio_iov gets added to the existing bio; if it's not, the
existing bio gets closed out, more or less, and a new one
is allocated and used to contain the bio_iov in question.

An rbd image is exactly that type of device--its backing
storage is divided at fixed boundaries (based on its object
order, 4MB by default), and if a new bio_vec would cross
such a boundary the callback (rbd_merge_bvec()) should
indicate that the bio_vec should not be added.

So that's the way it all works... What's happening here?

Now that I've described all this I'm having doubts about
my suspicion, so I'll describe it in yet another post after
I think through it again.

#9 Updated by Alex Elder almost 11 years ago

I can easily reproduce this problem by doing this in my UML

rbd create image1 --size=1024
rbd map image1 # let's assume we get /dev/rbd1
sleep 1 # for udev
mkfs.btrfs /dev/rbd1
mkdir -p /b
mount /dev/rbd1 /b
  1. note that count=8 below will not crash
    dd if=/dev/zero of=/b/foo bs=1024k count=9
    sync # push stuff to disk

#10 Updated by Alex Elder almost 11 years ago

Whoops, should have previewed that. Here's the same thing,
formatted as I intended.

I can easily reproduce this problem by doing this in my UML

    rbd create image1 --size=1024
    rbd map image1     # let's assume we get /dev/rbd1
    sleep 1            # for udev
    mkfs.btrfs /dev/rbd1
    mkdir -p /b
    mount /dev/rbd1 /b
    # note that count=8 below will not crash
    dd if=/dev/zero of=/b/foo bs=1024k count=9
    sync               # push stuff to disk

#11 Updated by Alex Elder almost 11 years ago

I added a printk() inside our merge function, rbd_merge_bvec().
It never gets called when I reproduce the problem as described
above.  The failure occurs when attempting to split a bio that
has these attributes:
    bi_sector:  703744
    bi_rw:      1 (write)
    bi_vcnt:    128
    bi_idx:     0
    bi_size:    524288

If I attempt to do a similar I/O touching the same offsets
directly to the rbd device using dd, I *do* see the printk()
lines, meaning we are calling the function.

    dd if=/dev/zero of=/dev/rbd1 bs=512 count=5 seek=703740 oflag=direct

Currently trying to track down why I'm seeing this.

#12 Updated by Alex Elder almost 11 years ago

After a long collaborative session on the btrfs IRC channel I think we may
have figured out the cause of this problem.

The return value from bio_add_page() will be less then the number of bytes
requested if the underlying block device finds the proposed addition not
acceptable.  In that case, the caller should not add to the bio and should
start a new one instead.

Although btrfs consistently uses the bio_add_page() interface, it does so
with the bi_sector field encoded using "logical" rather than "physical" 
sector offsets.  In this context, the logical offset has to do with an
internal btrfs extent map (I think), while the physical offset is the
sector offsets as understood by the underlying block device.  Only later,
just before submitting I/O to the backing store, will these logical
offsets be translated into physical ones.

As a result, the information used by the merge function is not correct,
so the "answer" to the "is adding this data to this bio OK?" question
therefore won't be correct.

#13 Updated by Alex Elder almost 11 years ago

The fix therefore needs to be in btrfs.  It somehow needs to
translate the bio so its bi_sector field is a physical sector
offset before any call to bio_add_page().  When that is done,
the rbd (and every other) merge function can provide the correct
return value, and we should not end up with a multi-page bio
when we need to split a bio in bio_chain_clone().

#14 Updated by Alex Elder almost 11 years ago

I just tested a btrfs patch provided by Josef Bacik and
I no longer am hitting this problem. I will attach the
patch to this bug for reference, but we'll want to use
whatever ends up getting committed after review.

#15 Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to Resolved

Also available in: Atom PDF