Cache never becoming consistent after failed updates
#11 Updated by Matt Benjamin over 1 year ago
basically, in my judgment, yes. note that there were additional changes originally in this series that just optimized cached lookups--those have been removed
looking at this ticket now, would it help if we beefed up the description and maybe reproducer hints? @adamemerson, could you take a pass at that (being careful to sanitize downstream data)?
#12 Updated by Adam Emerson over 1 year ago
- Subject changed from Random 500 errors in Swift PutObject to Cache never becoming consistent after failed updates
The behavior that's been reported shows the cache being out of date when updates to bucket metadata are attempted, leading to 500 errors. Given that this state /persists/ (that is, the cache never becomes correct until RGW is restarted) and that it is accompanied by Notify call failures, we believe that severely loaded clusters are causing cache to, sometimes, fail to update.
This series of patches corrects several problems. It retries bucket metadata update calls when they are raced.
It forcibly reloads bucket metadata when -ECANCELLED from CLS version indicates the version we have is out of date.
It sets a bound on how long cached entries may live, to make sure that they will become consistent eventually.