Project

General

Profile

Actions

Bug #53698

closed

a slow reader of a large object receives corrupt object contents from rgw with civetweb frontend

Added by Jaka Močnik over 2 years ago. Updated over 2 years ago.

Status:
Won't Fix - EOL
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

first, the real-life scenario:

we have been copying a number of large objects from swift interface of rgw of a mimic ceph cluster to another (non-ceph, never mind the vendor) storage with an S3 interface.

the copying works as follows:
- start a http connection to swift interface on rgw, issue a get request
- read 50MB from that connection,
- start writing the read 50MB to the destination,
- start reading next 50MB from the rgw,
- and so on and on until the whole object is copied,
- close the http connection

since the destination storage can be quite slow at times, the reading from ceph often pauses for a few seconds (waiting for the previous chunk to be written to the destination). if this pause is long enough, all the object contents read from rgw from that point on are "garbage." actually, they seem to be the last 4MB tail object read before the pause, sent over and over and over again until the complete object size of bytes are sent.

the issue seems to be the request_timeout_ms that defaults to 30000ms. if the pause in reading is longer than 30s, the above described behaviour occurs.

here is a minimal test case:
- upload a sufficiently large object to rgw via swift API: say, a 100MB, but imho it being long enough to have at least a few tail objects within rgw should suffice,
- open a http get connection to fetch this object via swift API, send all the http request headers,
- do not read any data, but instead, sleep for 35 seconds,
- read the object.

expected result: the object contents, or a server-side broken connection due to timeout.

actual result: garbage (the first 4MB chunk of the real object, repeated over and over again). this is reproducible 100%.

increasing the request_timeout_ms for civetweb to 60000 results in a 35s sleep being handled ok, but a 70s sleep results in the aforementioned bug again.

we made a docker-based test case to allow for testing this on different ceph versions

https://github.com/bancek/ceph-corrupted-test

according to the results from this, nautilus and above is ok, also, the beast frontend, tested on three real-life clusters (one pacific and two octopus) is not affected by this bug, it does have a similar timeout, but closes the connection to the client on timing out, resulting in an EOF on the client side, which is the correct behaviour.

I am aware that mimic is past EOL, but as it seems it's still widely used in the wild, I wanted to put this here for people who might encounter this bug in the future. :)

Actions #1

Updated by Casey Bodley over 2 years ago

  • Status changed from New to Won't Fix - EOL

I am aware that mimic is past EOL, but as it seems it's still widely used in the wild, I wanted to put this here for people who might encounter this bug in the future. :)

makes sense, thanks. i'll close as Won't Fix, but it should be visible in searches

Actions

Also available in: Atom PDF