Documentation #46992
openRGW lifecycle processing lost on failover and lack of documentation
0%
Description
Hello,
We're are having a few problems with the S3 lifecycle policy. I have two issues here:- Lifecycle processing is lost on following the multi-site failover process.
- There is very limited docs for lifecycle processing.
Our setup is running a multi-site configuration of two paired clusters, deployed using stretch-ansible on 14.2.9. After deployment and installation of the clusters, the master cluster does lifecycle processing correctly - "radosgw-admin lc list" returns all the buckets with lifecycle config and the status of their processing.
- The secondary site doesn't do any lifecycle processing - "radosgw-admin lc list" returns empty. This is fine as far as I'm aware, until we need to do a fail-over.
- I then simulate the master site being destroyed by deleting the master cluster VMs from my openstack host.
- I promote the secondary site to master following the instructions here: https://docs.ceph.com/docs/master/radosgw/multisite/. After promotion, this site isn't doing any lifecycle processing - "radosgw-admin lc list" returns empty.
- I spin up a new replacement cluster and pair it with the newly promoted master site. Neither site is doing any lifecycle processing - "radosgw-admin lc list" returns empty on both clusters.
So in the process of the loss of my master site, failover to the secondary site and re-pairing of a new cluster to regain the multisite redundancy, I've gone from having my lifecycle processing being carried out nightly, to not at all, with no warnings/alarms raised. We noticed this by observing lots of version objects remaining on our cluster manually. I’ve found `radosgw-admin lc reshard fix` will “remind” the promoted cluster that it needs to do lifecycle processing. Although I found no mention of having to use this in the docs, for that command the docs state a different purpose and it only seems relevant on earlier Ceph versions.
I was expecting either:- The promote master process to kick the cluster getting promoted to start doing lifecycle processing if it isn't already.
- The secondary cluster to always be doing lifecycle processing. It seems both clusters doing lifecycle processing is a fine state to be in?
- The docs to state that the `radosgw-admin lc reshard fix` command should be used as part of the promote-master process if using S3 lifecycle policy.
Some advice on if the above bullets are correct/totally wrong would be much appreciated.
Issue 2:
The docs for lifecycle processing are almost non-existent, there is mention of the commands "lc list", "lc process", "lc reshard fix", but not much more. The config options for lifecycle processing I found by grepping the live RGW config and scouring email trails on pipermail. There is a merge request updating docs I found from years ago here: https://github.com/ceph/ceph/pull/13990/commits/02dae7bfeecfb2350b9b19c014fe6dc408d87bac that seems to nicely describe some of these, but seems it was never merged.
- The frequency of the lifecycle processing, is it daily? or whenever there is some other indication it needs done?
- How does the processing scheduling behave given the "rgw_lifecycle_work_time" config parameter, does it always start at the beginning of this window? What if it doesn't complete before the cutoff?
- Expanding on the previous - is it possible to outpace the lifecycle processing through heavy use? Such that lifecycle processing will never complete?
- The performance implications of running this, is it expected to cause an impact? Someone on the mailing list hinted the performance impact/CPU usage could be significant. The default is set to run between 00:00 - 06:00, I expect this is intended to be a low utilisation timing?
- The implications (or lack thereof) of running lifecycle processing on two sites in a multi-site cluster setup? (this links in to Issue 1 above).
- How to continue doing lifecycle processing correctly upon multi-site failover (again links to Issue 1)
Thanks,
Alex
Updated by Greg Farnum about 3 years ago
- Project changed from Ceph to rgw
- Category deleted (
documentation)