Bug #20616
closed
pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but not reach
Added by Mehdi Abaakouk almost 7 years ago.
Updated over 6 years ago.
Category:
Correctness/Safety
Component(RADOS):
librados
Description
Hi,
In Gnocchi, with use the python-rados API and we recently encounter some data corruption when "rados_osd_op_timeout" is set.
After digging, we end up that aio_read() doesn't return the expected data and doesn't return any error.
The issue on Gnocchi side: https://github.com/gnocchixyz/gnocchi/pull/190
This have been workarounded by doing read() instead of aio_read()
Ceph version was 10.2.7, but I can reproduce it on many other version.
I have attached a script to reproduce, it actual outputs:
no timeout read(): 'my fancy blob' : True
with timeout read(): 'my fancy blob' : True
no timeout aio_read(): 'my fancy blob' (length or errno: 13): True
with timeout aio_read(): 'exc_traceback' (length or errno: 13): False
The last line shows that aio_read doesn't return the expected blob.
Files
This can't be reproduced with 12.1.0. So this have been fixed in the meantime.
- Project changed from Ceph to RADOS
- Subject changed from aio_read doesn't return expected data with rados_osd_op_timeout is set. to pre-luminous: aio_read returns success on rados_osd_op_timeout?
- Subject changed from pre-luminous: aio_read returns success on rados_osd_op_timeout? to pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but not reach
- Status changed from New to 12
- Priority changed from Normal to Urgent
i am able to reproduce this issue with the last jewel, but not master.
reverting 126d0b30e990519b8f845f99ba893fdcd56de447 fixes this issue. i am going to pull together a pure C++ reproducer.
- Category set to Correctness/Safety
- Status changed from 12 to Fix Under Review
- Assignee set to Kefu Chai
- Release set to jewel
- Component(RADOS) librados added
this only happens if "rados_osd_op_timeout > 0", where the rx_buffer optimization is disabled, due to #9582. in that case, the reply message's data field is claimed by the return buf, hence the raw buf passed in by librados client is not memcpy'ed.
https://github.com/ceph/ceph/pull/17594
- Severity changed from 2 - major to 1 - critical
- Status changed from Fix Under Review to Pending Backport
- Backport set to jewel
- Copied to Backport #21308: jewel: pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but not reach added
- Status changed from Pending Backport to Resolved
Also available in: Atom
PDF