Bug #47210: librbd:rbd.OSError: [errno 2147483648] error reading myimage 0~2147483648 - rbd - Ceph

Actions

Copy link

Bug #47210

open

librbd:rbd.OSError: [errno 2147483648] error reading myimage 0~2147483648

Added by dongdong tao over 3 years ago. Updated over 3 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v0.51

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

rbd.pyx will report error "librbd:rbd.OSError: [errno 2147483648] error reading..." when we try to read
2GB+ data from an image. the test case to reproduce is

https://pastebin.ubuntu.com/p/qBhhDFB6wf/

In this test case I first created a 5GB image, then tried to read 2GB from it via image.read(0, 2 * 1024**3)
but it report above error message, it is fine if I change the length to 2GB -1.

The issue here is because of the integer overflow.
https://github.com/ceph/ceph/blob/luminous/src/librbd/librbd.cc#L3356
As we can see from above line of code, it is using integer r to store a return value that could be ssize_t (64 bit),
But it's not the only place that is wrong, if we chase deeper, we'll find C_Safer_cond's rval is an "int", but we
are using it to store the total bytes that being read by rbd client for one rbd read operation, and the total bytes could
exceeds the maximum "int" value when we specify the read length >= 2GB.

This is preventing user to read a >=2GB length data at one time, eg. when cinder-backup want to read 2GB chunk at a time.

Actions

Copy link

Updated by dongdong tao over 3 years ago

This is a code defect still appears in "Master", can we change the affected version, I can't edit on my side

Actions

Copy link

Updated by Jason Dillaman over 3 years ago

Status changed from New to Need More Info

Is that a configuration setting where cinder-backup is attempting to read 2GiB+ chunks? That seems like a terrible default and we have no plans to change the API to support that large of reads. At best we can fix it to throw an immediate error if someone attempts something like that.

Actions

Copy link

Updated by dongdong tao over 3 years ago

That's correct, cinder-backup can change the configuration setting to read 2GiB+ chunks.
The default setting is less than 2GiB.

So, looks like this is a limitation and ceph have no plan to support it.
is there any documentation noting this limitation ?

Actions

Copy link

Updated by Jason Dillaman over 3 years ago

dongdong tao wrote:

is there any documentation noting this limitation ?

Negative. I've never heard of such an attempt to issue such large IOs before and I suspect you haven't evaluated the effects on your cluster and client nodes when attempting to perform such large IOs. It's important to remember that when you attempt to manually tweak the software outside the normal configuration and bounds, you are effectively running a unique software configuration and you no longer get the stability assurances from knowing thousands of other deployments are running in the same configuration.

Actions

Copy link

Updated by Ponnuvel P over 3 years ago

Jason Dillaman wrote:

[..] we have no plans to change the API to support that large of reads.

I am not sure if this would amount to an API change. The function in question, rbd_read2, actually returns ssize_t. The value returned by ictx->io_work_queue->read, which is a ssize_t, is assigned to int. When returning, it's converted back to ssize_t. The following sequence happens:

ssize_t -> int -> ssize_t

(the noted overflow occurs during the 1st conversion if it's greater than INT_MAX).

So this limitation doesn't appear a deliberate decision. Regardless of the merits/demerits of I/O's larger than 2GiB, I think, it could be fixed without API breakage if that's the only concern.

Actions

Copy link

Updated by Jason Dillaman over 3 years ago

Ponnuvel P wrote:

So this limitation doesn't appear a deliberate decision. Regardless of the merits/demerits of I/O's larger than 2GiB, I think, it could be fixed without API breakage if that's the only concern.

That really isn't the concern. Your single >2GiB IO request will get translated internally as 512 concurrent requests to the OSDs consuming up to 4GiB of RAM. It's really not efficient.

Actions

Copy link

Updated by dongdong tao over 3 years ago

@Jason Borden, I got your concern about the performance/ram issue and that make sense,
but that doesn't explain why we allow a 64 bit "size_t" length to be passed to rbd_read2 at the first place,
I mean if we pass 2GB length to the current rbd_read2, it indeed reads that much data back, but the issue here
is that the return value, which is the one to represent the size of data being read, will be overflowed as we are using int to store the value.

I think another concern is from the code level, as I've tried to make a patch to fix the integer overflow
I found that there are some fundamental code issue here, it's non-trival to fix and the fix will likely make the code ugly.

So, we are using C_SaferCond's rval (an "int") to store the size of data been read, which I don't think it's a good choice , but can understand.
I don't believe we want to change the C_SaferCond's rval type to int64_t, as this context class is used everywhere in ceph.
Another way would be introducing a completely new callback context to just used for these kind 64 bit rval,
but it will require big efforts, and I'm not sure upstream want to do it, cause it might just not able to bring any WIN like your concern.

Actions

Copy link

Updated by dongdong tao over 3 years ago

Jason Dillaman wrote:

Ponnuvel P wrote:

So this limitation doesn't appear a deliberate decision. Regardless of the merits/demerits of I/O's larger than 2GiB, I think, it could be fixed without API breakage if that's the only concern.

That really isn't the concern. Your single >2GiB IO request will get translated internally as 512 concurrent requests to the OSDs consuming up to 4GiB of RAM. It's really not efficient.

Hi Jason,

would really appreciate your thoughts on my last comment at https://tracker.ceph.com/issues/47210#note-7

Thanks!

Actions

Copy link

Updated by Jason Dillaman over 3 years ago

dongdong tao wrote:

would really appreciate your thoughts on my last comment at https://tracker.ceph.com/issues/47210#note-7

Comment on what, specifically? librbd would need lots of internal refactoring if we wanted to support large IOs and I personally don't see the worth. I think the important thing to think about is why you think it's a better design to be unique and switch your Cinder backup configuration to use 2GiB IOs?

Actions

Copy link

#10

Updated by dongdong tao over 3 years ago

Jason Dillaman wrote:

dongdong tao wrote:

would really appreciate your thoughts on my last comment at https://tracker.ceph.com/issues/47210#note-7

Comment on what, specifically? librbd would need lots of internal refactoring if we wanted to support large IOs and I personally don't see the worth. I think the important thing to think about is why you think it's a better design to be unique and switch your Cinder backup configuration to use 2GiB IOs?

Thank you Jason! you've just commented my concern, I wasn't a hundred percent sure if there is a way to fix this without refactoring lots of the rbd code.
Since you also think it will need lots of code refactoring work in order to surpport the large IOs,
and it won't bring much WIN for us, but likely bring some concern for the Ram and Perf(ceph cluster) issue.

So your first comment " At best we can fix it to throw an immediate error if someone attempts something like that." does make sense to me.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #47210

librbd:rbd.OSError: [errno 2147483648] error reading myimage 0~2147483648

Updated by dongdong tao over 3 years ago

Updated by Jason Dillaman over 3 years ago

Updated by dongdong tao over 3 years ago

Updated by Jason Dillaman over 3 years ago

Updated by Ponnuvel P over 3 years ago

Updated by Jason Dillaman over 3 years ago

Updated by dongdong tao over 3 years ago

Updated by dongdong tao over 3 years ago

Updated by Jason Dillaman over 3 years ago

Updated by dongdong tao over 3 years ago