Bug #22517
closed
- Status changed from Fix Under Review to Pending Backport
- Status changed from Pending Backport to Fix Under Review
- Status changed from Fix Under Review to Pending Backport
- Related to Bug #22560: segfault in ObjectCache::touch_lru() added
- Copied to Backport #22574: luminous: Random 500 errors in Swift PutObject added
- Copied to Backport #22575: jewel: Random 500 errors in Swift PutObject added
Luminous backport staged.
The changes seem too extensive for jewel. Is the jewel backport really necessary?
Ken Dreyer has it in his list of Things to Do For This Bug.
basically, in my judgment, yes. note that there were additional changes originally in this series that just optimized cached lookups--those have been removed
looking at this ticket now, would it help if we beefed up the description and maybe reproducer hints? @adamemerson, could you take a pass at that (being careful to sanitize downstream data)?
- Subject changed from Random 500 errors in Swift PutObject to Cache never becoming consistent after failed updates
The behavior that's been reported shows the cache being out of date when updates to bucket metadata are attempted, leading to 500 errors. Given that this state /persists/ (that is, the cache never becomes correct until RGW is restarted) and that it is accompanied by Notify call failures, we believe that severely loaded clusters are causing cache to, sometimes, fail to update.
This series of patches corrects several problems. It retries bucket metadata update calls when they are raced.
It forcibly reloads bucket metadata when -ECANCELLED from CLS version indicates the version we have is out of date.
It sets a bound on how long cached entries may live, to make sure that they will become consistent eventually.
- Related to Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED) added
- Status changed from Pending Backport to Resolved
Also available in: Atom
PDF