Project

General

Profile

Actions

Bug #42883

open

rgw/multisite: inconsistency result in concurrency put/remove situations

Added by wanghao 王 over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In multisite situation, eg node1 means one radosgw program in master zone, node2 means one radosgw program in slave zone.
I write a shell with aws cli, try to put or remove the same object in sequence, and send this request to node1 or node2 randomly.

Some times, one case like the attached picture show. In the end, master zone doesn't have this obj, while slave zone exists this obj.

By reading the log, i found this reason: in timestamp1 (show in attached pitcture), node2 put a get obj request into multisite's stack by listing bilog from node1. Then node2 do s3 request, which is put obj and remove obj in the end. Then, multisite pop the stack and read obj from node1 and store it in slave zone. In the end, node1 read bilog to do remove op. Attention, when node2 store the obj, he also add an data entry log, while, in the end, node1 read this log then try to read from node2, node2 return 304 because mtime is the same, because node2 just get this obj from node1 and set the same mtime.

In my opinion, to resolve this bug, we need add an extra mtime_attr to head_obj, when put/remove, cmp this xattr, only process it when this value is bigger than we have.


Files

05a2d3657bb2aa1ac93616125.png (30 KB) 05a2d3657bb2aa1ac93616125.png wanghao 王, 11/19/2019 12:37 PM
Actions #1

Updated by wanghao 王 over 4 years ago

Actions #2

Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to rgw
Actions #3

Updated by Casey Bodley over 4 years ago

  • Assignee set to Casey Bodley
Actions

Also available in: Atom PDF