Project

General

Profile

Actions

Bug #22517

closed

Cache never becoming consistent after failed updates

Added by Adam Emerson over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
luminous, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This seems to happen with redundant POST/PUT requests on an existing container. The issue has been found during upload of an object on an existing container using python-swiftclient.

https://github.com/ceph/ceph/pull/18954
https://github.com/ceph/ceph/pull/19581


Related issues 4 (0 open4 closed)

Related to rgw - Bug #22560: segfault in ObjectCache::touch_lru()ResolvedCasey Bodley01/03/2018

Actions
Related to rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)ResolvedJ. Eric Ivancich09/26/2017

Actions
Copied to rgw - Backport #22574: luminous: Random 500 errors in Swift PutObjectResolvedAdam EmersonActions
Copied to rgw - Backport #22575: jewel: Random 500 errors in Swift PutObjectResolvedMatt BenjaminActions
Actions #2

Updated by Orit Wasserman over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Orit Wasserman over 6 years ago

  • Status changed from Pending Backport to Fix Under Review
Actions #4

Updated by Matt Benjamin over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Casey Bodley over 6 years ago

  • Related to Bug #22560: segfault in ObjectCache::touch_lru() added
Actions #7

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #22574: luminous: Random 500 errors in Swift PutObject added
Actions #8

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #22575: jewel: Random 500 errors in Swift PutObject added
Actions #9

Updated by Nathan Cutler over 6 years ago

Luminous backport staged.

The changes seem too extensive for jewel. Is the jewel backport really necessary?

Actions #10

Updated by Adam Emerson over 6 years ago

Ken Dreyer has it in his list of Things to Do For This Bug.

Actions #11

Updated by Matt Benjamin over 6 years ago

basically, in my judgment, yes. note that there were additional changes originally in this series that just optimized cached lookups--those have been removed

looking at this ticket now, would it help if we beefed up the description and maybe reproducer hints? @adamemerson, could you take a pass at that (being careful to sanitize downstream data)?

Actions #12

Updated by Adam Emerson over 6 years ago

  • Subject changed from Random 500 errors in Swift PutObject to Cache never becoming consistent after failed updates

The behavior that's been reported shows the cache being out of date when updates to bucket metadata are attempted, leading to 500 errors. Given that this state /persists/ (that is, the cache never becomes correct until RGW is restarted) and that it is accompanied by Notify call failures, we believe that severely loaded clusters are causing cache to, sometimes, fail to update.

This series of patches corrects several problems. It retries bucket metadata update calls when they are raced.

It forcibly reloads bucket metadata when -ECANCELLED from CLS version indicates the version we have is out of date.

It sets a bound on how long cached entries may live, to make sure that they will become consistent eventually.

Actions #13

Updated by Nathan Cutler over 6 years ago

  • Related to Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED) added
Actions #14

Updated by Nathan Cutler about 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF