Bug #22517: Cache never becoming consistent after failed updates - rgw - Ceph

Actions

Copy link

Bug #22517

closed

Cache never becoming consistent after failed updates

Added by Adam Emerson over 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Adam Emerson

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

luminous, jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This seems to happen with redundant POST/PUT requests on an existing container. The issue has been found during upload of an object on an existing container using python-swiftclient.

https://github.com/ceph/ceph/pull/18954
https://github.com/ceph/ceph/pull/19581

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Adam Emerson over 6 years ago

https://github.com/ceph/ceph/pull/19601

Actions

Copy link

Updated by Orit Wasserman over 6 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Orit Wasserman over 6 years ago

Status changed from Pending Backport to Fix Under Review

Actions

Copy link

Updated by Matt Benjamin over 6 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Casey Bodley over 6 years ago

The pr https://github.com/ceph/ceph/pull/19581 backport needs to include the fix in https://github.com/ceph/ceph/pull/19768.

Actions

Copy link

Updated by Casey Bodley over 6 years ago

Related to Bug #22560: segfault in ObjectCache::touch_lru() added

Actions

Copy link

Updated by Nathan Cutler over 6 years ago

Copied to Backport #22574: luminous: Random 500 errors in Swift PutObject added

Actions

Copy link

Updated by Nathan Cutler over 6 years ago

Copied to Backport #22575: jewel: Random 500 errors in Swift PutObject added

Actions

Copy link

Updated by Nathan Cutler over 6 years ago

Luminous backport staged.

The changes seem too extensive for jewel. Is the jewel backport really necessary?

Actions

Copy link

#10

Updated by Adam Emerson over 6 years ago

Ken Dreyer has it in his list of Things to Do For This Bug.

Actions

Copy link

#11

Updated by Matt Benjamin over 6 years ago

basically, in my judgment, yes. note that there were additional changes originally in this series that just optimized cached lookups--those have been removed

looking at this ticket now, would it help if we beefed up the description and maybe reproducer hints? @adamemerson, could you take a pass at that (being careful to sanitize downstream data)?

Actions

Copy link

#12

Updated by Adam Emerson over 6 years ago

Subject changed from Random 500 errors in Swift PutObject to Cache never becoming consistent after failed updates

The behavior that's been reported shows the cache being out of date when updates to bucket metadata are attempted, leading to 500 errors. Given that this state /persists/ (that is, the cache never becomes correct until RGW is restarted) and that it is accompanied by Notify call failures, we believe that severely loaded clusters are causing cache to, sometimes, fail to update.

This series of patches corrects several problems. It retries bucket metadata update calls when they are raced.

It forcibly reloads bucket metadata when -ECANCELLED from CLS version indicates the version we have is out of date.

It sets a bound on how long cached entries may live, to make sure that they will become consistent eventually.

Actions

Copy link

#13

Updated by Nathan Cutler over 6 years ago

Related to Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED) added

Actions

Copy link

#14

Updated by Nathan Cutler about 6 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #22517

Cache never becoming consistent after failed updates

Updated by Adam Emerson over 6 years ago

Updated by Orit Wasserman over 6 years ago

Updated by Orit Wasserman over 6 years ago

Updated by Matt Benjamin over 6 years ago

Updated by Casey Bodley over 6 years ago

Updated by Casey Bodley over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Adam Emerson over 6 years ago

Updated by Matt Benjamin over 6 years ago

Updated by Adam Emerson over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Nathan Cutler about 6 years ago