Project

General

Profile

Actions

Bug #4646

open

kcephfs: writeback pagevec pool size vs stripe unit limit

Added by Alex Elder about 11 years ago. Updated over 9 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

This was described here:
http://tracker.ceph.com/issues/4603
But that issue has been closed after fixing the most
obvious and simple bugs it described.

The following may still be an issue though, so I'm
the description I made in that issue here so it can
be looked at and either dealt with or ignored.

----------------
I forgot to mention, the other problem is that the number
of pages required in ceph_writepages_start() is computed
as (essentially):

    max_size = 1 << inode->i_blkbits;
    if (fsc->mount_options->wsize && fsc->mount_options->wsize < max_size)
        max_size = fsc->mount_options->wsize;
    if (max_size < PAGE_CACHE_SIZE)
        max_size = PAGE_CACHE_SIZE;
    pages_required = max_size >> PAGE_CACHE_SHIFT;

So as mentioned above, we need at least one page. But in addition
if the "wsize" mount option has value 0 (which is the default), then
the number of pages required could be as much as

    inode->i_blkbits >> PAGE_CACHE_SHIFT

And that is initialized in fs/ceph/inode.c:fill_inode() as:

        inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;

The layout comes from the mds, but it looks to me like the
stripe unit therein can be set without limit by an ioctl() via
CEPH_MDS_OP_SETLAYOUT or CEPH_MDS_OP_SETDIRLAYOUT.

So bottom line of all this is that inside ceph_writepages_start()
if we are going to fall back to using the memory pool to satisfy
our request for a page pointer array, the size of the array in
the pool needs to be sufficient to hold the maximum sized stripe
unit value (which is apparently 0xffffffff).

My fix addresses the more simple issues described initially. I
do not intend to address this broader problem, I'll leave it to
someone else (or at least some other time) to fix that.

Actions #1

Updated by Sage Weil almost 11 years ago

  • Project changed from Ceph to rbd
Actions #2

Updated by Ian Colle almost 11 years ago

  • Assignee set to Alex Elder
Actions #3

Updated by Alex Elder almost 11 years ago

On the osd, it looks to me like CEPH_MDS_OP_SETLAYOUT uses
ceph_file_layout_is_valid() to verify the layout supplied
is valid. The constraints defined for the stripe unit
there are:
- it must be non-zero; and
- it must be a multiple of CEPH_MIN_STRIPE_UNIT=65536

Since the stripe unit is represented in 32 bits, that
means the maximum valid stripe unit is 65535 or 0xffff.

Actions #4

Updated by Alex Elder almost 11 years ago

I think an easy fix for now is just to allocate the pagevec_pool
to have objects sufficient to hold pages that would cover a
maximally-sized stripe unit.

Note that this also seems to show that the "wsize" mount
option is ignored if the stripe unit is set larger than
that via ioctl(). (Maybe that's a new bug to fix.)

Actions #5

Updated by Alex Elder almost 11 years ago

I implemented a fix for this, and got all the way to
the end of describing it, when I realized the math
makes this impractical. Basically, the largest stripe
is about 4GB or about a million 4K pages. That's 8MB
per pagevec set aside for this out-of-memory case.

I don't think that's practical, so maybe a memory
failure in the midst of writeback should put the
file system in a failure mode and give up.

I can post the patch for review, but at this point I'd
like to have someone argue in favor of it before
doing so.

I'm going to post my explanation below.

From: Alex Elder <>
Subject: ceph: size writeback pagevec pool appropriately

There is a memory pool set aside for pagevecs to be used in the
writeback path when a kmalloc() fails, in ceph_writepages_start().
The size of the objects in the pool is based on the "wsize" mount
option. But that is not adequate, and this patch fixes that.

The pool is used if a malloc for a page vector fails deep inside
ceph_writepages_start(). In that case, the number of pages needed
in the vector depends on the inode block size (1 << inode->i_blkbits)
or if specified at mount time, the file system's "wsize" parameter
instead (but at least a page).

The inode block size is defined as the stripe unit value in the
inode's layout, and that can be set to a fairly arbitrary value
using a CEPH_IOC_SET_LAYOUT ioctl() call.

If the "wsize" mount option is supplied, the existing code is OK,
because ceph_writepages_start() limits the amount written to that
amount if it is smaller than the inode's block size.

But the default "wsize" value is 0, and for that case, the pool will
be created with pagevec objects sufficient to hold just one page.
An earlier fix made sure we used a non-zero size here, but it turns
out that's not enough either. If no "wsize" mount option was
supplied, ceph_writepages_start() will choose to write a whole
stripe unit's worth of data. And that is likely to be more than
just one page.

Because an inode's layout can be set with a stripe unit that's not
explicitly limited, to be safe we need to make the pagevecs in
this memory pool be big enough for the biggest stripe unit possible.
It must be represented in a 32-bit unsigned, and it must be a
non-zero multiple of CEPH_MIN_STRIPE_UNIT (which is 65536). So
the size is 2^32 - 65536 or 0xffff0000.

Change the size of the objects allocated in the file system's
writeback pagevec pool to be big enough to hold pages covering that
maximal size.

And now that that's all laid out...

One last (deadly) note: For 4KB pages, this means about a million
page slots per page vector, or about 8MB. These pools are sized to
have 10 objects in them, which doesn't seem like a number that
was determined scientifically. Scale that back to just 2,
to reduce the memory used. (This still multiples the memory used
in the default case by a factor of half a million or so.)

This resolves:
http://tracker.ceph.com/issues/4646

Signed-off-by: Alex Elder <>

Actions #6

Updated by Alex Elder almost 11 years ago

  • Status changed from New to Need More Info

I'd like someone (like Sage) to determine whether
we should just mark this "won't fix."

Actions #7

Updated by Ian Colle over 10 years ago

  • Assignee deleted (Alex Elder)
Actions #8

Updated by Ilya Dryomov over 9 years ago

  • Project changed from rbd to Linux kernel client
  • Subject changed from ceph: writeback pagevec pool size vs stripe unit limit to kcephfs: writeback pagevec pool size vs stripe unit limit
Actions

Also available in: Atom PDF