Project

General

Profile

Actions

Bug #63634

open

[RFC] limit iov structures to 1024 while performing async I/O

Added by Dhairya Parmar 5 months ago. Updated 5 months ago.

Status:
Triaged
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This tracker is just to get initial thoughts/opinions about limiting the no. of iovec structures that can be read/written in one go while performing async I/O. According to the man page of pwritev:

NOTES
       POSIX.1  allows  an  implementation to place a limit on the number of items that can be passed in iov.  An implementation can
       advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX).  On mod‐
       ern Linux systems, the limit is 1024.  Back in Linux 2.0 days, this limit was 16.

So it doesn't say that it is strictly limited but it "allows" to put a limit. While testing async I/0 libcephfs call, I found that we can happily breach the limit of 1024, there is no check for the IOV_MAX in client code but I do see a lot of code paths within ceph handling IOV_MAX:

blk/kernel/KernelDevice.cc:  if ((!buffered || bl.get_num_buffers() >= IOV_MAX) &&
blk/kernel/KernelDevice.cc:      bl.rebuild_aligned_size_and_memory(block_size, block_size, IOV_MAX)) {
blk/kernel/KernelDevice.cc:  if ((!buffered || bl.get_num_buffers() >= IOV_MAX) &&
blk/kernel/KernelDevice.cc:      bl.rebuild_aligned_size_and_memory(block_size, block_size, IOV_MAX)) {
client/fuse_ll.cc:    if (bl.get_num_buffers() > IOV_MAX)
common/buffer.cc:  iovec iov[IOV_MAX];
common/buffer.cc:    if (iovlen == IOV_MAX ||
common/buffer.cc:  iovec iov[IOV_MAX];
common/buffer.cc:    uint64_t size = std::min<uint64_t>(left_pbrs, IOV_MAX);
common/buffer.cc:  iov_vec_t iovs{_num / IOV_MAX + 1};
common/buffer.cc:       std::min(_num - IOV_MAX * nr_iov_created, (size_t)IOV_MAX));
common/buffer.cc:    if (++index == IOV_MAX) {
include/buffer.h:      ceph_assert(_num <= IOV_MAX);
include/buffer.h:      assert(_num <= IOV_MAX);
include/compat.h:#define IOV_MAX 1024
msg/async/AsyncConnection.h:static const int ASYNC_IOV_MAX = (IOV_MAX >= 1024 ? IOV_MAX / 4 : IOV_MAX);
msg/async/PosixStack.cc:      struct iovec msgvec[IOV_MAX];
msg/async/PosixStack.cc:      uint64_t size = std::min<uint64_t>(left_pbrs, IOV_MAX);
msg/async/PosixStack.cc:      WSABUF msgvec[IOV_MAX];
msg/async/PosixStack.cc:      uint64_t size = std::min<uint64_t>(left_pbrs, IOV_MAX);
os/bluestore/BlueStore.cc:  bl.rebuild_aligned_size_and_memory(BDEV_LABEL_BLOCK_SIZE, BDEV_LABEL_BLOCK_SIZE, IOV_MAX);
test/bufferlist.cc:  for (unsigned i = 0; i < IOV_MAX * 2; i++) {
test/bufferlist.cc:  EXPECT_EQ(IOV_MAX * 2, st.st_size);
test/bufferlist.cc:  for (unsigned i = 0; i < IOV_MAX * 2; i++) {
test/bufferlist.cc:  EXPECT_EQ(IOV_MAX * 2 + offset, (unsigned)st.st_size);

the kernel allocates resources and performs operations based on assumptions tied to IOV_MAX, exceeding this limit might lead to resource allocation issues, such as exhausting available kernel memory or encountering buffer overflows thus impacting system stability and performance. What i feel is if not now then maybe later on we might eventually have to handle this in client code too.

Actions #1

Updated by Venky Shankar 5 months ago

Dhairya Parmar wrote:

This tracker is just to get initial thoughts/opinions about limiting the no. of iovec structures that can be read/written in one go while performing async I/O. According to the man page of pwritev:

[...]

So it doesn't say that it is strictly limited but it "allows" to put a limit. While testing async I/0 libcephfs call, I found that we can happily breach the limit of 1024, there is no check for the IOV_MAX in client code but I do see a lot of code paths within ceph handling IOV_MAX:

Right. That's what POSIX says and its meant to be a standard for any app to follow so as to be portable across systems that are POSIX compatible.

[...]

the kernel allocates resources and performs operations based on assumptions tied to IOV_MAX, exceeding this limit might lead to resource allocation issues, such as exhausting available kernel memory or encountering buffer overflows thus impacting system stability and performance. What i feel is if not now then maybe later on we might eventually have to handle this in client code too.

It isn't totally incorrect to exceed IOV_MAX - its just a standard that POSIX compatible systems should ensure that apps written with this in mind work across system. I don't think there is merit in catching such calls and erroring out. Also, the client could process chunks of IOV_MAX vectors (maybe the client already does this, have you checked)?

Actions #2

Updated by Frank Filz 5 months ago

The limit to IOV_MAX should not be a problem for the intended consumer, Ganesha. At the moment, Ganesha doesn't submit any vectors longer than 1... Our zero-copy efforts will introduce longer vectors, and at that time, within Ganesha we will want to respect IOV_MAX. So I think it would be fair to return an error at any API interface point that the vector passed is longer.

Actions #3

Updated by Venky Shankar 5 months ago

  • Tracker changed from Fix to Bug
  • Status changed from New to Triaged
  • Assignee set to Dhairya Parmar
  • Target version set to v19.0.0
  • Backport set to quincy,reef
  • Regression set to No
  • Severity set to 3 - minor

Frank Filz wrote:

The limit to IOV_MAX should not be a problem for the intended consumer, Ganesha. At the moment, Ganesha doesn't submit any vectors longer than 1... Our zero-copy efforts will introduce longer vectors, and at that time, within Ganesha we will want to respect IOV_MAX. So I think it would be fair to return an error at any API interface point that the vector passed is longer.

Fair enough.

Actions

Also available in: Atom PDF