Bug #63634
open[RFC] limit iov structures to 1024 while performing async I/O
0%
Description
This tracker is just to get initial thoughts/opinions about limiting the no. of iovec structures that can be read/written in one go while performing async I/O. According to the man page of pwritev:
NOTES
POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can
advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On mod‐
ern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.
So it doesn't say that it is strictly limited but it "allows" to put a limit. While testing async I/0 libcephfs call, I found that we can happily breach the limit of 1024, there is no check for the IOV_MAX in client code but I do see a lot of code paths within ceph handling IOV_MAX:
blk/kernel/KernelDevice.cc: if ((!buffered || bl.get_num_buffers() >= IOV_MAX) &&
blk/kernel/KernelDevice.cc: bl.rebuild_aligned_size_and_memory(block_size, block_size, IOV_MAX)) {
blk/kernel/KernelDevice.cc: if ((!buffered || bl.get_num_buffers() >= IOV_MAX) &&
blk/kernel/KernelDevice.cc: bl.rebuild_aligned_size_and_memory(block_size, block_size, IOV_MAX)) {
client/fuse_ll.cc: if (bl.get_num_buffers() > IOV_MAX)
common/buffer.cc: iovec iov[IOV_MAX];
common/buffer.cc: if (iovlen == IOV_MAX ||
common/buffer.cc: iovec iov[IOV_MAX];
common/buffer.cc: uint64_t size = std::min<uint64_t>(left_pbrs, IOV_MAX);
common/buffer.cc: iov_vec_t iovs{_num / IOV_MAX + 1};
common/buffer.cc: std::min(_num - IOV_MAX * nr_iov_created, (size_t)IOV_MAX));
common/buffer.cc: if (++index == IOV_MAX) {
include/buffer.h: ceph_assert(_num <= IOV_MAX);
include/buffer.h: assert(_num <= IOV_MAX);
include/compat.h:#define IOV_MAX 1024
msg/async/AsyncConnection.h:static const int ASYNC_IOV_MAX = (IOV_MAX >= 1024 ? IOV_MAX / 4 : IOV_MAX);
msg/async/PosixStack.cc: struct iovec msgvec[IOV_MAX];
msg/async/PosixStack.cc: uint64_t size = std::min<uint64_t>(left_pbrs, IOV_MAX);
msg/async/PosixStack.cc: WSABUF msgvec[IOV_MAX];
msg/async/PosixStack.cc: uint64_t size = std::min<uint64_t>(left_pbrs, IOV_MAX);
os/bluestore/BlueStore.cc: bl.rebuild_aligned_size_and_memory(BDEV_LABEL_BLOCK_SIZE, BDEV_LABEL_BLOCK_SIZE, IOV_MAX);
test/bufferlist.cc: for (unsigned i = 0; i < IOV_MAX * 2; i++) {
test/bufferlist.cc: EXPECT_EQ(IOV_MAX * 2, st.st_size);
test/bufferlist.cc: for (unsigned i = 0; i < IOV_MAX * 2; i++) {
test/bufferlist.cc: EXPECT_EQ(IOV_MAX * 2 + offset, (unsigned)st.st_size);
the kernel allocates resources and performs operations based on assumptions tied to IOV_MAX, exceeding this limit might lead to resource allocation issues, such as exhausting available kernel memory or encountering buffer overflows thus impacting system stability and performance. What i feel is if not now then maybe later on we might eventually have to handle this in client code too.
Updated by Venky Shankar 5 months ago
Dhairya Parmar wrote:
This tracker is just to get initial thoughts/opinions about limiting the no. of iovec structures that can be read/written in one go while performing async I/O. According to the man page of pwritev:
[...]
So it doesn't say that it is strictly limited but it "allows" to put a limit. While testing async I/0 libcephfs call, I found that we can happily breach the limit of 1024, there is no check for the IOV_MAX in client code but I do see a lot of code paths within ceph handling IOV_MAX:
Right. That's what POSIX says and its meant to be a standard for any app to follow so as to be portable across systems that are POSIX compatible.
[...]
the kernel allocates resources and performs operations based on assumptions tied to IOV_MAX, exceeding this limit might lead to resource allocation issues, such as exhausting available kernel memory or encountering buffer overflows thus impacting system stability and performance. What i feel is if not now then maybe later on we might eventually have to handle this in client code too.
It isn't totally incorrect to exceed IOV_MAX - its just a standard that POSIX compatible systems should ensure that apps written with this in mind work across system. I don't think there is merit in catching such calls and erroring out. Also, the client could process chunks of IOV_MAX vectors (maybe the client already does this, have you checked)?
Updated by Frank Filz 5 months ago
The limit to IOV_MAX should not be a problem for the intended consumer, Ganesha. At the moment, Ganesha doesn't submit any vectors longer than 1... Our zero-copy efforts will introduce longer vectors, and at that time, within Ganesha we will want to respect IOV_MAX. So I think it would be fair to return an error at any API interface point that the vector passed is longer.
Updated by Venky Shankar 5 months ago
- Tracker changed from Fix to Bug
- Status changed from New to Triaged
- Assignee set to Dhairya Parmar
- Target version set to v19.0.0
- Backport set to quincy,reef
- Regression set to No
- Severity set to 3 - minor
Frank Filz wrote:
The limit to IOV_MAX should not be a problem for the intended consumer, Ganesha. At the moment, Ganesha doesn't submit any vectors longer than 1... Our zero-copy efforts will introduce longer vectors, and at that time, within Ganesha we will want to respect IOV_MAX. So I think it would be fair to return an error at any API interface point that the vector passed is longer.
Fair enough.