Project

General

Profile

Actions

Documentation #46992

open

RGW lifecycle processing lost on failover and lack of documentation

Added by Alex Kershaw over 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Hello,

We're are having a few problems with the S3 lifecycle policy. I have two issues here:
  1. Lifecycle processing is lost on following the multi-site failover process.
  2. There is very limited docs for lifecycle processing.
Issue 1:
Our setup is running a multi-site configuration of two paired clusters, deployed using stretch-ansible on 14.2.9. After deployment and installation of the clusters, the master cluster does lifecycle processing correctly - "radosgw-admin lc list" returns all the buckets with lifecycle config and the status of their processing.
  • The secondary site doesn't do any lifecycle processing - "radosgw-admin lc list" returns empty. This is fine as far as I'm aware, until we need to do a fail-over.
  • I then simulate the master site being destroyed by deleting the master cluster VMs from my openstack host.
  • I promote the secondary site to master following the instructions here: https://docs.ceph.com/docs/master/radosgw/multisite/. After promotion, this site isn't doing any lifecycle processing - "radosgw-admin lc list" returns empty.
  • I spin up a new replacement cluster and pair it with the newly promoted master site. Neither site is doing any lifecycle processing - "radosgw-admin lc list" returns empty on both clusters.

So in the process of the loss of my master site, failover to the secondary site and re-pairing of a new cluster to regain the multisite redundancy, I've gone from having my lifecycle processing being carried out nightly, to not at all, with no warnings/alarms raised. We noticed this by observing lots of version objects remaining on our cluster manually. I’ve found `radosgw-admin lc reshard fix` will “remind” the promoted cluster that it needs to do lifecycle processing. Although I found no mention of having to use this in the docs, for that command the docs state a different purpose and it only seems relevant on earlier Ceph versions.

I was expecting either:
  • The promote master process to kick the cluster getting promoted to start doing lifecycle processing if it isn't already.
  • The secondary cluster to always be doing lifecycle processing. It seems both clusters doing lifecycle processing is a fine state to be in?
  • The docs to state that the `radosgw-admin lc reshard fix` command should be used as part of the promote-master process if using S3 lifecycle policy.

Some advice on if the above bullets are correct/totally wrong would be much appreciated.

Issue 2:
The docs for lifecycle processing are almost non-existent, there is mention of the commands "lc list", "lc process", "lc reshard fix", but not much more. The config options for lifecycle processing I found by grepping the live RGW config and scouring email trails on pipermail. There is a merge request updating docs I found from years ago here: https://github.com/ceph/ceph/pull/13990/commits/02dae7bfeecfb2350b9b19c014fe6dc408d87bac that seems to nicely describe some of these, but seems it was never merged.

I think the docs are missing answers to a few key questions for using this feature, including:
  • The frequency of the lifecycle processing, is it daily? or whenever there is some other indication it needs done?
  • How does the processing scheduling behave given the "rgw_lifecycle_work_time" config parameter, does it always start at the beginning of this window? What if it doesn't complete before the cutoff?
  • Expanding on the previous - is it possible to outpace the lifecycle processing through heavy use? Such that lifecycle processing will never complete?
  • The performance implications of running this, is it expected to cause an impact? Someone on the mailing list hinted the performance impact/CPU usage could be significant. The default is set to run between 00:00 - 06:00, I expect this is intended to be a low utilisation timing?
  • The implications (or lack thereof) of running lifecycle processing on two sites in a multi-site cluster setup? (this links in to Issue 1 above).
  • How to continue doing lifecycle processing correctly upon multi-site failover (again links to Issue 1)

Thanks,
Alex

Actions #1

Updated by Alex Kershaw over 3 years ago

stretch-ansible should be ceph-ansible

Actions #2

Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to rgw
  • Category deleted (documentation)
Actions

Also available in: Atom PDF