Project

General

Profile

Bug #20616

pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but not reach

Added by Mehdi Abaakouk about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
07/13/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
librados
Pull request ID:

Description

Hi,

In Gnocchi, with use the python-rados API and we recently encounter some data corruption when "rados_osd_op_timeout" is set.
After digging, we end up that aio_read() doesn't return the expected data and doesn't return any error.

The issue on Gnocchi side: https://github.com/gnocchixyz/gnocchi/pull/190
This have been workarounded by doing read() instead of aio_read()

Ceph version was 10.2.7, but I can reproduce it on many other version.

I have attached a script to reproduce, it actual outputs:

no timeout read(): 'my fancy blob' : True
with timeout read(): 'my fancy blob' : True
no timeout aio_read(): 'my fancy blob' (length or errno: 13): True
with timeout aio_read(): 'exc_traceback' (length or errno: 13): False

The last line shows that aio_read doesn't return the expected blob.

ceph_aio_read_timeout_bug.py View (1.63 KB) Mehdi Abaakouk, 07/13/2017 01:42 PM


Related issues

Copied to RADOS - Backport #21308: jewel: pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but not reach Resolved

History

#1 Updated by Mehdi Abaakouk about 2 years ago

This can't be reproduced with 12.1.0. So this have been fixed in the meantime.

#2 Updated by Greg Farnum about 2 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from aio_read doesn't return expected data with rados_osd_op_timeout is set. to pre-luminous: aio_read returns success on rados_osd_op_timeout?

#3 Updated by Mehdi Abaakouk about 2 years ago

  • Subject changed from pre-luminous: aio_read returns success on rados_osd_op_timeout? to pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but not reach

#4 Updated by Sage Weil almost 2 years ago

  • Status changed from New to Verified
  • Priority changed from Normal to Urgent

#5 Updated by Kefu Chai almost 2 years ago

i am able to reproduce this issue with the last jewel, but not master.

reverting 126d0b30e990519b8f845f99ba893fdcd56de447 fixes this issue. i am going to pull together a pure C++ reproducer.

#6 Updated by Kefu Chai almost 2 years ago

  • Category set to Correctness/Safety
  • Status changed from Verified to Need Review
  • Assignee set to Kefu Chai
  • Release set to jewel
  • Component(RADOS) librados added

this only happens if "rados_osd_op_timeout > 0", where the rx_buffer optimization is disabled, due to #9582. in that case, the reply message's data field is claimed by the return buf, hence the raw buf passed in by librados client is not memcpy'ed.

https://github.com/ceph/ceph/pull/17594

#7 Updated by Kefu Chai almost 2 years ago

  • Severity changed from 2 - major to 1 - critical

#8 Updated by Nathan Cutler almost 2 years ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel

#9 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #21308: jewel: pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but not reach added

#10 Updated by Kefu Chai over 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF