Bug #42883: rgw/multisite: inconsistency result in concurrency put/remove situations - rgw - Ceph

Actions

Copy link

Bug #42883

open

rgw/multisite: inconsistency result in concurrency put/remove situations

Added by wanghao 王 over 4 years ago. Updated over 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Casey Bodley

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In multisite situation, eg node1 means one radosgw program in master zone, node2 means one radosgw program in slave zone.
I write a shell with aws cli, try to put or remove the same object in sequence, and send this request to node1 or node2 randomly.

Some times, one case like the attached picture show. In the end, master zone doesn't have this obj, while slave zone exists this obj.

By reading the log, i found this reason: in timestamp1 (show in attached pitcture), node2 put a get obj request into multisite's stack by listing bilog from node1. Then node2 do s3 request, which is put obj and remove obj in the end. Then, multisite pop the stack and read obj from node1 and store it in slave zone. In the end, node1 read bilog to do remove op. Attention, when node2 store the obj, he also add an data entry log, while, in the end, node1 read this log then try to read from node2, node2 return 304 because mtime is the same, because node2 just get this obj from node1 and set the same mtime.

In my opinion, to resolve this bug, we need add an extra mtime_attr to head_obj, when put/remove, cmp this xattr, only process it when this value is bigger than we have.

Files

05a2d3657bb2aa1ac93616125.png (30 KB) 05a2d3657bb2aa1ac93616125.png

wanghao 王, 11/19/2019 12:37 PM