Project

General

Profile

Bug #47655

AWS put-bucket-lifecycle command fails on the latest minor Octopus release

Added by Niko Smeds 29 days ago. Updated 18 days ago.

Status:
Triaged
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature:

Description

We are rebuilding some servers, and reinstalling Ceph using ceph-ansible also updated a RADOS gateway to Octopus version 15.2.5. The other gateways are running v15.2.4 and v15.2.2.

Since that update, requests to `put-bucket-lifecycle` which are received by the v15.2.5 backend fail:

```
$ cat lifecycle.json {
"Rules": [ {
"Expiration": {
"Days": 7
},
"Prefix": "",
"Status": "Enabled"
}
]
}
$ aws s3api --endpoint-url=https://REDACTED put-bucket-lifecycle --bucket bucketname --lifecycle-configuration file://lifecycle.json

An error occurred (InvalidArgument) when calling the PutBucketLifecycle operation: Unknown
```

In the RGW logs we see

```
root@REDACTED:/var/log/ceph# tail -f ceph-rgw-REDACTED.rgw0.log
2020-09-25T22:26:04.692+0200 7f7e767ec700 1 ====== starting new request req=0x7f7f7469f680 =====
2020-09-25T22:26:04.704+0200 7f7e767ec700 0 RGWLC::RGWPutLC() failed to set entry on lc.6, ret=-22
2020-09-25T22:26:04.708+0200 7f7e767ec700 1 ====== req done req=0x7f7f7469f680 op status=-22 http_status=400 latency=0.016000314s ======
2020-09-25T22:26:04.708+0200 7f7e767ec700 1 beast: 0x7f7f7469f680: 10.206.10.2 - - [2020-09-25T22:26:04.708887+0200] "PUT /bucketname?lifecycle HTTP/1.1" 400 240 - "aws-cli/2.0.38 Python/3.8.5 Darwin/19.2.0 source/x86_64 command/s3api.put-bucket-lifecycle" -
```

The same request succeeds on the older versions.

This is causing issues with some of our Ansible playbooks which use the `s3_lifecycle` module.

I reviewed https://docs.ceph.com/en/latest/releases/octopus/ and found two changes which touch the LC (lifecycle) codebase:

- https://github.com/ceph/ceph/pull/36085
- https://github.com/ceph/ceph/pull/36018

I have minimal C coding experience - could either of these be responsible?

History

#1 Updated by Matt Benjamin 29 days ago

Thanks, Niko.

Neither of those commits would seem able to cause this. Will try your lifecycle policy and update.

Matt

#2 Updated by Matt Benjamin 29 days ago

no, but could you try using an alternate lifecycle document?

e.g., try:


<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Rule>
<ID>delete-7-days</ID>
<Filter>
<Prefix></Prefix>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>7</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>

We accept a <Prefix> element at the top level, as well as in a filter. I think the problem might be that you didn't send an <ID>.

Matt6

#3 Updated by Niko Smeds 26 days ago

Okay - so I do have an update: while both client and server report errors, the lifecycle policies are still being updated.

i.e. if I specify the ID (or change the expiration days) in the JSON file, the change is stored by Ceph even though the command returns an error.

Matt I tried your example XML but might be doing something wrong:

```
$ aws s3api --endpoint-url=https://REDACTED put-bucket-lifecycle --bucket bucketname --lifecycle-configuration file://lifecycle-upstream.xml

Error parsing parameter '--lifecycle-configuration': Expected: '=', received: '<' for input:
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
```

We also updated all three Ceph RADOS gateways to the same latest version:

```
$ radosgw-admin --version
ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)
```

Oddly enough, we now experience the issue with two of the three.

Still uncertain if this is an issue on our side or an issue with Ceph.

#4 Updated by Niko Smeds 26 days ago

I don't believe the issue is related to format of the policy file. While testing I also ran `delete-bucket-lifecycle` multiple times to remove policies from the test bucket and experienced the same error.

#5 Updated by lei cao 25 days ago

Maybe you can improve the log level of rgw to 20, then try again, this will be helpful for positioning problem.

#6 Updated by Casey Bodley 24 days ago

Have you upgraded the osds to match? I would expect these errors to go away when all rgws and osds are running the latest octopus. Sorry for the inconvenience!

#7 Updated by Matt Benjamin 23 days ago

  • Status changed from New to Triaged

#8 Updated by Niko Smeds 18 days ago

Casey Bodley wrote:

Have you upgraded the osds to match? I would expect these errors to go away when all rgws and osds are running the latest octopus. Sorry for the inconvenience!

Sorry for the slow reply - I'm actually blocked by https://github.com/ceph/ceph-ansible/issues/5916 on updating the OSDs.

Right now the OSDs are a mixed bag.

$ sudo ceph tell osd.* version | grep version | awk '{print $2}' | sort | uniq -c
12 "15.2.2",
6 "15.2.5",

After resolving the above issue and updating all OSDs to 15.2.5 I'll try again.

Also available in: Atom PDF