Project

General

Profile

Actions

Bug #47451

open

RGW appends control character to etags in bucket index

Added by Nick Janus over 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
etag
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've encountered a semi-rare bug where rgw appends a control character to objects' etag fields in the bucket index. When s3 clients try to list such an object, the control character invalidates the xml response. We are running both nautilus and luminous clusters, and this only seems to happening on nautilus (14.2.5 and 14.2.8).

As far as mitigation goes, `radosgw-admin bucket check --fix` removes the control character in the bucket index entry. However, I don't know how to reproduce this condition, as it seems to effect at most 1 in a million objects.

Example hex of the etag before:

00000090  25 0f 23 00 00 00 37 63  39 65 32 62 39 36 63 63  |%.#...7c9e2b96cc|
000000a0  37 65 63 36 31 35 39 37  38 65 35 66 63 61 39 33  |7ec615978e5fca93|
000000b0  35 31 35 61 33 61 2d 31  0e 07 00 00 00 36 34 34  |515a3a-1.....644|

And after running bucket check:

00000090  25 0f 22 00 00 00 37 63  39 65 32 62 39 36 63 63  |%."...7c9e2b96cc|
000000a0  37 65 63 36 31 35 39 37  38 65 35 66 63 61 39 33  |7ec615978e5fca93|
000000b0  35 31 35 61 33 61 2d 31  07 00 00 00 36 34 34 39  |515a3a-1....6449|

This report about a luminous bug, now closed, seems like it might be related: https://tracker.ceph.com/issues/23188

Actions #1

Updated by André Cruz almost 3 years ago

Was this issue ever fixed?

I have encountered this while trying to upgrade a Luminous cluster to Nautilus. We noticed it when we introduced a Nautilus OSD and RGW. The problem seems to have gone away after we disabled the Nautilus RGW, but kept the Nautilus OSD.

There is also a reference to the same issue in the mailing list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/OWVCUXPO6U6EWKHBSGW7W5DQ6ANXT6GM/

Actions #2

Updated by Nick Janus almost 3 years ago

Andre, I haven't seen this issue for some time since upgrading. I suspect there was some backwards incompatibility during the upgrade the inserted these control characters. I don't know if a fix has been implemented, I haven't spent much time digging into root cause.

Actions #3

Updated by André Cruz almost 3 years ago

Hey Nick.

Originally you mentioned that you were seeing the issue on Nautilus. Which upgrade ended up fixing the issue?

Thanks.

Actions #4

Updated by Nick Janus almost 3 years ago

We stayed on those versions of Nautilus for 6+ months in various clusters, but our users stopped reporting the issue a couple weeks after the upgrades completed. Given the timing, I'm guessing the control characters were only written during the upgrade.

Actions #5

Updated by Ilsoo Byun over 2 years ago

I had the same issue. I found that getting 'user.rgw.etag' xattr from the rados object returns the exact 32bit-length etag. The control character was appended only when listing a bucket.

Actions #6

Updated by André Cruz over 2 years ago

Ilsoo Byun wrote:

I had the same issue. I found that getting 'user.rgw.etag' xattr from the rados object returns the exact 32bit-length etag. The control character was appended only when listing a bucket.

I am still having this issue when I introduce a Nautilus RGW (beast or civetweb) into a cluster with already Nautilus OSDs, MGRs and MONs (v14.2.16).

The object metadata returned by radosgw does not show anything out of the ordinary, but listing the bucket using goamz client library fails due to the invalid char in the etag. This only happens on the one Nautilus RGW (albeit rarely) and never on Luminous RGWs.

Actions #7

Updated by Casey Bodley over 2 years ago

  • Assignee set to Marcus Watts
  • Tags set to etag
Actions #8

Updated by André Cruz over 2 years ago

I just want to add that the issue only happened while there were Luminous and Nautulus RGW coexisting on the same cluster. We were upgrading the cluster and were doing it in phases. We ended up switching the RGW version to Nautilus on all of them at once and the issue hasn't happened since.

Actions

Also available in: Atom PDF