Project

General

Profile

Actions

Bug #63206

open

beast: S3 download stalls without useful logs upon encountering an invalid RADOS object

Added by Alexander Patrakov 7 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Dear developers,

A customer has a Ceph cluster recently updated to Quincy. The RADOS gateway was switched from civetweb to beast in the process.

They started complaining that the download of certain S3 objects started stalling. In other words, the transfer downloads, let's say, 8 MB out of 19 MB, and then the download speed drops to zero. There is nothing bad in the logs, even with `--debug-rgw=20`. Upon inspection, it was found that the problematic S3 objects had at least one of the underlying RADOS objects corrupted. The mode of corruption is that the length of the corrupted RADOS object (with the word "shadow" in its name) is zero instead of 4 megabytes. I don't know (and have no hope of eventually knowing) why these objects got corrupted, but this is not the subject of this bug. The subject is the RADOS gateway behavior under a pre-existing RADOS object corruption.

You can simulate this kind of pre-existing corruption on your own cluster this way, by overwriting one of the objects with an empty file:

touch empty-file
rados put -p default.rgw.buckets.data 'a30e6c45-9d9d-4c29-b7dc-43b0c8f9f36a.1704468.1__shadow_positron-veux-20230825.zip.2~EGs6uKCytGBgClNB1O9k57bO2BMZnIx.1_2' empty-file

The customer complaints at this time are:

  • Upon encountering the invalid (e.g. wrong-length) RADOS object, the RADOS gateway should log at least something, while it logs nothing;
  • The transfer stalls for at least 30 seconds instead of ending immediately with a premature EOF from the server;
  • There should be a radosgw-admin subcommand to check all S3 objects for this kind of problem.

I have also tested Ceph Pacific with the now-removed "civetweb" module and found that civetweb, upon encountering a corrupted RADOS object, closes the connection immediately - i.e., the second complaint does not apply there.


Related issues 1 (1 open0 closed)

Related to rgw - Bug #46770: rgw incorrect http status on RADOS i/o errorIn ProgressOr Friedmann

Actions
Actions

Also available in: Atom PDF