Bug #50056: Newly created bucket randomly unavailable for a few minutes - rgw - Ceph

Actions

Copy link

Bug #50056

closed

Newly created bucket randomly unavailable for a few minutes

Added by Benoît Knecht about 3 years ago. Updated about 3 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.16

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Running three RadosGW 14.2.16 instances behind HAproxy (as deployed by ceph-ansible 4.0), we noticed that after creating a bucket with `s3cmd mb s3://my-bucket`, requests such as `s3cmd multipart s3://my-bucket` would fail about a third of the time with `WARNING: Retrying failed request: /?uploads (500 (UnknownError))`.

Upon further investigation, we noticed that `s3cmd info s3://my-bucket` would return different data in successive calls:

```
$ s3cmd info s3://my-bucket
s3://my-bucket/ (bucket):
Location: us-east-1
Payer: BucketOwner
Expiration Rule: none
Policy: none
CORS: none
ACL: none
$ s3cmd info s3://my-bucket
s3://my-bucket/ (bucket):
Location: us-east-1
Payer: BucketOwner
Expiration Rule: none
Policy: none
CORS: none
ACL: User Name: FULL_CONTROL
```

After 5-10 minutes, everything works as expected, no more errors, and the `s3cmd info` output becomes consistent.

The culprit seems to be some form of caching done in RadosGW, because `s3cmd mb` first does a `GET /my-bucket/?location` before doing a `PUT /my-bucket/`; if they end up on different RadosGW instances (due to load-balancing), the instance that received the `GET /my-bucket/?location` request will continue to believe that the bucket doesn't exist, until some cache expires, apparently.

The issue can also be reproduced explicitly by targeting two separate RadosGW instances (without going through HAproxy) with minio-client for instance:

```
$ mc stat first-rgw/test
mc: <ERROR> Unable to stat `first-rgw/test`. Bucket `test` does not exist.

$ mc mb second-rgw/test
Bucket created successfully `second-rgw/test`.

$ mc stat first-rgw/test
mc: <ERROR> Unable to stat `first-rgw/test`. Bucket `test` does not exist.

$ mc stat second-rgw/test
Name : test/
Size : 0 B
Type : folder
Metadata :
Versioning: Un-versioned
Location: default
Policy: none
```

While it may seem this issue can be solved by using sticky session in HAproxy, the problem remains if two separate clients try to create a bucket with the same name within a few minutes, or if one client is supposed to create a bucket that will be used by different clients. Therefore, it seems like something that should be fixed in RadosGW itself rather than a configuration or deployment issue.