ceph: writeback pagevec pool is created incorrectly
In ceph_writepages_start() if there are any pages to write back
an array of page pointers is needed, sufficient to hold pointes
to as many pages as are required to satisfy the minimum I/O size
for the inode (or the file system containing it).
If a "normal" kmalloc() request fails to allocate the array,
a memory pool associated with the file system client is used.
The memory pool is created in create_fs_client(), and its size
parameter is computed incorrectly in two ways:
- it uses fsc->mount_options->wsize >> PAGE_CACHE_SHIFT, to
determine how many pages should be supported for a request; and
- the size supplied is supposed to be a multiple of the size
of a page pointer, but it's the count of pages.
The first can be a problem because the wsize parameter could
be less than PAGE_SIZE (including 0), but we will always need
to support writing at least one page. This might not be a
real problem in practice.
The second problem is serious though. The argument value needs
to be the product of the page count and the size of a page pointer.
I have a fix for this, I'm just documenting the problem.
#1 Updated by Alex Elder about 8 years ago
I forgot to mention, the other problem is that the number
of pages required in ceph_writepages_start() is computed
max_size = 1 << inode->i_blkbits; if (fsc->mount_options->wsize && fsc->mount_options->wsize < max_size) max_size = fsc->mount_options->wsize; if (max_size < PAGE_CACHE_SIZE) max_size = PAGE_CACHE_SIZE; pages_required = max_size >> PAGE_CACHE_SHIFT;
So as mentioned above, we need at least one page. But in addition
if the "wsize" mount option has value 0 (which is the default), then
the number of pages required could be as much as
inode->i_blkbits >> PAGE_CACHE_SHIFT
And that is initialized in fs/ceph/inode.c:fill_inode() as:
inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
The layout comes from the mds, but it looks to me like the
stripe unit therein can be set without limit by an ioctl() via
CEPH_MDS_OP_SETLAYOUT or CEPH_MDS_OP_SETDIRLAYOUT.
So bottom line of all this is that inside ceph_writepages_start()
if we are going to fall back to using the memory pool to satisfy
our request for a page pointer array, the size of the array in
the pool needs to be sufficient to hold the maximum sized stripe
unit value (which is apparently 0xffffffff).
My fix addresses the more simple issues described initially. I
do not intend to address this broader problem, I'll leave it to
someone else (or at least some other time) to fix that.
#5 Updated by Alex Elder almost 8 years ago
On the osd, it looks to me like CEPH_MDS_OP_SETLAYOUT uses
ceph_file_layout_is_valid() to verify the layout supplied
is valid. The constraints defined for the stripe unit
- it must be non-zero; and
- it must be a multiple of CEPH_MIN_STRIPE_UNIT=65536
Since the stripe unit is represented in 32 bits, that
means the maximum valid stripe unit is 65535 or 0xffff.