Project

General

Profile

Actions

Bug #22804

closed

multisite Synchronization failed when read and write delete at the same time

Added by Amine Liu about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
% Done:

0%

Source:
Community (user)
Tags:
multisite sync failed
Backport:
jewel luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

<?xml version="1.0" encoding="UTF-8" ?>
<workload name="s3-sample" description="sample benchmark for s3">

  <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://s3.z1.tt;path_style_access=true" />

  <workflow>

    <workstage name="init">  
      <work type="init" workers="1" config="cprefix=mix01;containers=r(1,2)" />
    </workstage>

    <workstage name="prepare">
      <work type="prepare" workers="1"  config="cprefix=mix01;containers=r(1,2);objects=r(1,10);sizes=c(100)KB" />
    </workstage>

    <workstage name="main">  
      <work name="main" workers="100" runtime="30">
        <operation type="read" ratio="30" config="cprefix=mix01;containers=u(1,2);objects=u(1,10)" />
        <operation type="write" ratio="40" config="cprefix=mix01;containers=u(1,2);objects=u(11,100);sizes=c(100)KB" />
        <operation type="delete" ratio="30" config="cprefix=mix01;containers=u(1,2);objects=u(11,100)" />
      </work>
    </workstage>

  </workflow>

</workload>

After the test, there are differences between the objects on both sides. from the log that these objects are `PUT `and `DELETE` at the same time.

But if just testing mix PUT and GET, sync ok.


Files

master-bilog.txt (316 KB) master-bilog.txt Amine Liu, 02/08/2018 09:40 AM
slave-bilog.txt (144 KB) slave-bilog.txt Amine Liu, 02/08/2018 09:40 AM
master-index100k_objs_591-bilog.rar (1.83 KB) master-index100k_objs_591-bilog.rar Amine Liu, 04/16/2018 04:04 AM

Related issues 2 (0 open2 closed)

Copied to rgw - Backport #23690: luminous: multisite Synchronization failed when read and write delete at the same timeResolvedAbhishek LekshmananActions
Copied to rgw - Backport #23692: jewel: multisite Synchronization failed when read and write delete at the same timeRejectedActions
Actions #1

Updated by John Spray about 6 years ago

  • Project changed from Ceph to rgw
  • Description updated (diff)
Actions #2

Updated by Amine Liu about 6 years ago

#!/bin/sh

usercfg="Ymliu.cfg"
bucket="s3://testmix"
file="s-mix011.txt"
for i in {1..100};
do {
s3cmd -c $usercfg put $file $bucket/$i &
sleep 5
#s3cmd -c $usercfg rm $bucket/$i
s3cmd -c $usercfg put $file $bucket/$i &
s3cmd -c $usercfg rm $bucket/$i &
}&
done

This script can also be reproduced. master ` s3cmd ls` is 99 objects, bucket stats is'"num_objects": 96';and slave `s3cmd ls `is 89 objects,bucket stats is'"num_objects": 89'.

Actions #3

Updated by Amine Liu about 6 years ago

[root@sx-3f3r-ceph-s3-c1-03 my-cluster]# radosgw-admin bucket stats --bucket=testmix {
"bucket": "testmix",
"zonegroup": "e1d5f39f-70f6-443e-98e8-3dc0b3b312f8",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "d49d824f-76c0-4d15-9219-ca7acf5c31b3.1514569.3",
"marker": "d49d824f-76c0-4d15-9219-ca7acf5c31b3.1514569.3",
"index_type": "Normal",
"owner": "Ymliu",
"ver": "0#163,1#74,2#265,3#246,4#161,5#164,6#179,7#244,8#166,9#220,10#179,11#180,12#134,13#180,14#194,15#103,16#78,17#154,18#170,19#214",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0",
"mtime": "2018-02-08 15:11:52.323452",
"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#",
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 3
},
"rgw.main": {
"size": 4785792,
"size_actual": 5111808,
"size_utilized": 4785792,
"size_kb": 4674,
"size_kb_actual": 4992,
"size_kb_utilized": 4674,
"num_objects": 96
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

[root@sx-3f3r-ceph-s3-c1-03 my-cluster]# s3cmd -c Ymliu.cfg ls s3://testmix|wc -l
99
[root@sx-3f3r-ceph-s3-c1-03 my-cluster]#

Actions #4

Updated by Amine Liu about 6 years ago

[root@sx-3f3r-ceph-s3-c1-03 my-cluster]# radosgw-admin bucket check --fix --check-objects --bucket=testmix
[] {
"object": "1",
"object": "10",
"object": "100",
"object": "11",
"object": "12",
"object": "13",
"object": "14",
"object": "15",
"object": "16",
"object": "17",
"object": "18",
"object": "19",
"object": "2",
"object": "20",
"object": "21",
"object": "22",
"object": "23",
"object": "24",
"object": "25",
"object": "26",
"object": "27",
"object": "28",
"object": "29",
"object": "3",
"object": "30",
"object": "31",
"object": "32",
"object": "33",
"object": "34",
"object": "35",
"object": "36",
"object": "37",
"object": "38",
"object": "39",
"object": "4",
"object": "40",
"object": "41",
"object": "42",
"object": "43",
"object": "44",
"object": "45",
"object": "46",
"object": "48",
"object": "49",
"object": "5",
"object": "50",
"object": "51",
"object": "52",
"object": "53",
"object": "54",
"object": "55",
"object": "56",
"object": "57",
"object": "58",
"object": "59",
"object": "6",
"object": "60",
"object": "61",
"object": "62",
"object": "63",
"object": "64",
"object": "65",
"object": "66",
"object": "67",
"object": "68",
"object": "69",
"object": "7",
"object": "70",
"object": "71",
"object": "72",
"object": "73",
"object": "74",
"object": "75",
"object": "76",
"object": "77",
"object": "78",
"object": "79",
"object": "8",
"object": "80",
"object": "81",
"object": "82",
"object": "83",
"object": "84",
"object": "85",
"object": "86",
"object": "87",
"object": "88",
"object": "89",
"object": "9",
"object": "90",
"object": "91",
"object": "92",
"object": "93",
"object": "94",
"object": "95",
"object": "96",
"object": "97",
"object": "98",
"object": "99"
} {
"existing_header": {
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 3
},
"rgw.main": {
"size": 4785792,
"size_actual": 5111808,
"size_utilized": 4785792,
"size_kb": 4674,
"size_kb_actual": 4992,
"size_kb_utilized": 4674,
"num_objects": 96
}
}
},
"calculated_header": {
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 3
},
"rgw.main": {
"size": 4785792,
"size_actual": 5111808,
"size_utilized": 4785792,
"size_kb": 4674,
"size_kb_actual": 4992,
"size_kb_utilized": 4674,
"num_objects": 96
}
}
}
}

Updated by Amine Liu about 6 years ago

Another test. dump of bilog:

master merge op(write,delete),but slave miss some op.

Actions #7

Updated by Amine Liu about 6 years ago

Casey Bodley wrote:

a couple of prs that are potentially related? https://github.com/ceph/ceph/pull/20396 https://github.com/ceph/ceph/pull/19895

yes, "rgw: fix index cancel op miss update header #20396" that is why index not equal by `ls`;

but "rgw: do not add cancel op in squash_map #19895" that does not seem to involve a OP merger, like miss `DELETE` on slave zone.

Actions #8

Updated by Yehuda Sadeh about 6 years ago

  • Subject changed from multipsite Synchronization failed when read and write delete at the same time to multisite Synchronization failed when read and write delete at the same time
  • Assignee set to Casey Bodley
  • Priority changed from Normal to High
Actions #9

Updated by Pengju Niu about 6 years ago

it beacase that master try to del first write objA, and it has been canceled however bilog record del op behind second write.So master rgw has objA,slave rgw can't find. pr is https://github.com/ceph/ceph/pull/20814.

Actions #10

Updated by Amine Liu about 6 years ago

Pengju Niu wrote:

it beacase that master try to del first write objA, and it has been canceled however bilog record del op behind second write.So master rgw has objA,slave rgw can't find. pr is https://github.com/ceph/ceph/pull/20814.

Yes, your pull solved my problem, thank you!

Actions #11

Updated by Yehuda Sadeh about 6 years ago

I commented on the PR.

Actions #12

Updated by Amine Liu about 6 years ago

Yehuda Sadeh wrote:

I commented on the PR.

sorry, Two days after this pull, bug reappeared.

Actions #14

Updated by Casey Bodley about 6 years ago

  • Status changed from New to Fix Under Review
  • Tags changed from multipsite sync failed to multisite sync failed
  • Backport set to jewel luminous
Actions #15

Updated by Amine Liu about 6 years ago

Casey Bodley wrote:

https://github.com/ceph/ceph/pull/21262

Miss something as error status code

Actions #16

Updated by Casey Bodley about 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #18

Updated by Abhishek Lekshmanan about 6 years ago

  • Copied to Backport #23690: luminous: multisite Synchronization failed when read and write delete at the same time added
Actions #19

Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #23692: jewel: multisite Synchronization failed when read and write delete at the same time added
Actions #20

Updated by Amine Liu about 6 years ago

cosbench with read/write/delete at the same time, master's objects and index was still more than slave's.

Actions #21

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF