Project

General

Profile

Bug #22517

Cache never becoming consistent after failed updates

Added by Adam Emerson about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
12/20/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
luminous, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

This seems to happen with redundant POST/PUT requests on an existing container. The issue has been found during upload of an object on an existing container using python-swiftclient.

https://github.com/ceph/ceph/pull/18954
https://github.com/ceph/ceph/pull/19581


Related issues

Related to rgw - Bug #22560: segfault in ObjectCache::touch_lru() Resolved 01/03/2018
Related to rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED) Resolved 09/26/2017
Copied to rgw - Backport #22574: luminous: Random 500 errors in Swift PutObject Resolved
Copied to rgw - Backport #22575: jewel: Random 500 errors in Swift PutObject Resolved

History

#2 Updated by Orit Wasserman about 1 year ago

  • Status changed from Need Review to Pending Backport

#3 Updated by Orit Wasserman about 1 year ago

  • Status changed from Pending Backport to Need Review

#4 Updated by Matt Benjamin about 1 year ago

  • Status changed from Need Review to Pending Backport

#6 Updated by Casey Bodley about 1 year ago

  • Related to Bug #22560: segfault in ObjectCache::touch_lru() added

#7 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22574: luminous: Random 500 errors in Swift PutObject added

#8 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22575: jewel: Random 500 errors in Swift PutObject added

#9 Updated by Nathan Cutler about 1 year ago

Luminous backport staged.

The changes seem too extensive for jewel. Is the jewel backport really necessary?

#10 Updated by Adam Emerson about 1 year ago

Ken Dreyer has it in his list of Things to Do For This Bug.

#11 Updated by Matt Benjamin about 1 year ago

basically, in my judgment, yes. note that there were additional changes originally in this series that just optimized cached lookups--those have been removed

looking at this ticket now, would it help if we beefed up the description and maybe reproducer hints? @adamemerson, could you take a pass at that (being careful to sanitize downstream data)?

#12 Updated by Adam Emerson about 1 year ago

  • Subject changed from Random 500 errors in Swift PutObject to Cache never becoming consistent after failed updates

The behavior that's been reported shows the cache being out of date when updates to bucket metadata are attempted, leading to 500 errors. Given that this state /persists/ (that is, the cache never becomes correct until RGW is restarted) and that it is accompanied by Notify call failures, we believe that severely loaded clusters are causing cache to, sometimes, fail to update.

This series of patches corrects several problems. It retries bucket metadata update calls when they are raced.

It forcibly reloads bucket metadata when -ECANCELLED from CLS version indicates the version we have is out of date.

It sets a bound on how long cached entries may live, to make sure that they will become consistent eventually.

#13 Updated by Nathan Cutler about 1 year ago

  • Related to Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED) added

#14 Updated by Nathan Cutler 12 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF