Documentation #46992: RGW lifecycle processing lost on failover and lack of documentation - rgw - Ceph

Actions

Copy link

Documentation #46992

open

RGW lifecycle processing lost on failover and lack of documentation

Added by Alex Kershaw over 3 years ago. Updated about 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

Hello,

We're are having a few problems with the S3 lifecycle policy. I have two issues here:

Lifecycle processing is lost on following the multi-site failover process.
There is very limited docs for lifecycle processing.

Issue 1:
Our setup is running a multi-site configuration of two paired clusters, deployed using stretch-ansible on 14.2.9. After deployment and installation of the clusters, the master cluster does lifecycle processing correctly - "radosgw-admin lc list" returns all the buckets with lifecycle config and the status of their processing.

The secondary site doesn't do any lifecycle processing - "radosgw-admin lc list" returns empty. This is fine as far as I'm aware, until we need to do a fail-over.
I then simulate the master site being destroyed by deleting the master cluster VMs from my openstack host.
I promote the secondary site to master following the instructions here: https://docs.ceph.com/docs/master/radosgw/multisite/. After promotion, this site isn't doing any lifecycle processing - "radosgw-admin lc list" returns empty.
I spin up a new replacement cluster and pair it with the newly promoted master site. Neither site is doing any lifecycle processing - "radosgw-admin lc list" returns empty on both clusters.

So in the process of the loss of my master site, failover to the secondary site and re-pairing of a new cluster to regain the multisite redundancy, I've gone from having my lifecycle processing being carried out nightly, to not at all, with no warnings/alarms raised. We noticed this by observing lots of version objects remaining on our cluster manually. I’ve found `radosgw-admin lc reshard fix` will “remind” the promoted cluster that it needs to do lifecycle processing. Although I found no mention of having to use this in the docs, for that command the docs state a different purpose and it only seems relevant on earlier Ceph versions.

I was expecting either:

The promote master process to kick the cluster getting promoted to start doing lifecycle processing if it isn't already.
The secondary cluster to always be doing lifecycle processing. It seems both clusters doing lifecycle processing is a fine state to be in?
The docs to state that the `radosgw-admin lc reshard fix` command should be used as part of the promote-master process if using S3 lifecycle policy.

Some advice on if the above bullets are correct/totally wrong would be much appreciated.

Issue 2:
The docs for lifecycle processing are almost non-existent, there is mention of the commands "lc list", "lc process", "lc reshard fix", but not much more. The config options for lifecycle processing I found by grepping the live RGW config and scouring email trails on pipermail. There is a merge request updating docs I found from years ago here: https://github.com/ceph/ceph/pull/13990/commits/02dae7bfeecfb2350b9b19c014fe6dc408d87bac that seems to nicely describe some of these, but seems it was never merged.

I think the docs are missing answers to a few key questions for using this feature, including:

The frequency of the lifecycle processing, is it daily? or whenever there is some other indication it needs done?
How does the processing scheduling behave given the "rgw_lifecycle_work_time" config parameter, does it always start at the beginning of this window? What if it doesn't complete before the cutoff?
Expanding on the previous - is it possible to outpace the lifecycle processing through heavy use? Such that lifecycle processing will never complete?
The performance implications of running this, is it expected to cause an impact? Someone on the mailing list hinted the performance impact/CPU usage could be significant. The default is set to run between 00:00 - 06:00, I expect this is intended to be a low utilisation timing?
The implications (or lack thereof) of running lifecycle processing on two sites in a multi-site cluster setup? (this links in to Issue 1 above).
How to continue doing lifecycle processing correctly upon multi-site failover (again links to Issue 1)

Thanks,
Alex