Project

General

Profile

Bug #22804

multisite Synchronization failed when read and write delete at the same time

Added by Tave liu over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
01/26/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
multisite sync failed
Backport:
jewel luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:

Description

<?xml version="1.0" encoding="UTF-8" ?>
<workload name="s3-sample" description="sample benchmark for s3">

  <storage type="s3" config="accesskey=test1;secretkey=test1;endpoint=http://s3.z1.tt;path_style_access=true" />

  <workflow>

    <workstage name="init">  
      <work type="init" workers="1" config="cprefix=mix01;containers=r(1,2)" />
    </workstage>

    <workstage name="prepare">
      <work type="prepare" workers="1"  config="cprefix=mix01;containers=r(1,2);objects=r(1,10);sizes=c(100)KB" />
    </workstage>

    <workstage name="main">  
      <work name="main" workers="100" runtime="30">
        <operation type="read" ratio="30" config="cprefix=mix01;containers=u(1,2);objects=u(1,10)" />
        <operation type="write" ratio="40" config="cprefix=mix01;containers=u(1,2);objects=u(11,100);sizes=c(100)KB" />
        <operation type="delete" ratio="30" config="cprefix=mix01;containers=u(1,2);objects=u(11,100)" />
      </work>
    </workstage>

  </workflow>

</workload>

After the test, there are differences between the objects on both sides. from the log that these objects are `PUT `and `DELETE` at the same time.

But if just testing mix PUT and GET, sync ok.

master-bilog.txt View (316 KB) Tave liu, 02/08/2018 09:40 AM

slave-bilog.txt View (144 KB) Tave liu, 02/08/2018 09:40 AM

master-index100k_objs_591-bilog.rar (1.83 KB) Tave liu, 04/16/2018 04:04 AM


Related issues

Copied to rgw - Backport #23690: luminous: multisite Synchronization failed when read and write delete at the same time Resolved
Copied to rgw - Backport #23692: jewel: multisite Synchronization failed when read and write delete at the same time Rejected

History

#1 Updated by John Spray over 1 year ago

  • Project changed from Ceph to rgw
  • Description updated (diff)

#2 Updated by Tave liu over 1 year ago

#!/bin/sh

usercfg="Ymliu.cfg"
bucket="s3://testmix"
file="s-mix011.txt"
for i in {1..100};
do {
s3cmd -c $usercfg put $file $bucket/$i &
sleep 5
#s3cmd -c $usercfg rm $bucket/$i
s3cmd -c $usercfg put $file $bucket/$i &
s3cmd -c $usercfg rm $bucket/$i &
}&
done

This script can also be reproduced. master ` s3cmd ls` is 99 objects, bucket stats is'"num_objects": 96';and slave `s3cmd ls `is 89 objects,bucket stats is'"num_objects": 89'.

#3 Updated by Tave liu over 1 year ago

[root@sx-3f3r-ceph-s3-c1-03 my-cluster]# radosgw-admin bucket stats --bucket=testmix {
"bucket": "testmix",
"zonegroup": "e1d5f39f-70f6-443e-98e8-3dc0b3b312f8",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "d49d824f-76c0-4d15-9219-ca7acf5c31b3.1514569.3",
"marker": "d49d824f-76c0-4d15-9219-ca7acf5c31b3.1514569.3",
"index_type": "Normal",
"owner": "Ymliu",
"ver": "0#163,1#74,2#265,3#246,4#161,5#164,6#179,7#244,8#166,9#220,10#179,11#180,12#134,13#180,14#194,15#103,16#78,17#154,18#170,19#214",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0",
"mtime": "2018-02-08 15:11:52.323452",
"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#",
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 3
},
"rgw.main": {
"size": 4785792,
"size_actual": 5111808,
"size_utilized": 4785792,
"size_kb": 4674,
"size_kb_actual": 4992,
"size_kb_utilized": 4674,
"num_objects": 96
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

[root@sx-3f3r-ceph-s3-c1-03 my-cluster]# s3cmd -c Ymliu.cfg ls s3://testmix|wc -l
99
[root@sx-3f3r-ceph-s3-c1-03 my-cluster]#

#4 Updated by Tave liu over 1 year ago

[root@sx-3f3r-ceph-s3-c1-03 my-cluster]# radosgw-admin bucket check --fix --check-objects --bucket=testmix
[] {
"object": "1",
"object": "10",
"object": "100",
"object": "11",
"object": "12",
"object": "13",
"object": "14",
"object": "15",
"object": "16",
"object": "17",
"object": "18",
"object": "19",
"object": "2",
"object": "20",
"object": "21",
"object": "22",
"object": "23",
"object": "24",
"object": "25",
"object": "26",
"object": "27",
"object": "28",
"object": "29",
"object": "3",
"object": "30",
"object": "31",
"object": "32",
"object": "33",
"object": "34",
"object": "35",
"object": "36",
"object": "37",
"object": "38",
"object": "39",
"object": "4",
"object": "40",
"object": "41",
"object": "42",
"object": "43",
"object": "44",
"object": "45",
"object": "46",
"object": "48",
"object": "49",
"object": "5",
"object": "50",
"object": "51",
"object": "52",
"object": "53",
"object": "54",
"object": "55",
"object": "56",
"object": "57",
"object": "58",
"object": "59",
"object": "6",
"object": "60",
"object": "61",
"object": "62",
"object": "63",
"object": "64",
"object": "65",
"object": "66",
"object": "67",
"object": "68",
"object": "69",
"object": "7",
"object": "70",
"object": "71",
"object": "72",
"object": "73",
"object": "74",
"object": "75",
"object": "76",
"object": "77",
"object": "78",
"object": "79",
"object": "8",
"object": "80",
"object": "81",
"object": "82",
"object": "83",
"object": "84",
"object": "85",
"object": "86",
"object": "87",
"object": "88",
"object": "89",
"object": "9",
"object": "90",
"object": "91",
"object": "92",
"object": "93",
"object": "94",
"object": "95",
"object": "96",
"object": "97",
"object": "98",
"object": "99"
} {
"existing_header": {
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 3
},
"rgw.main": {
"size": 4785792,
"size_actual": 5111808,
"size_utilized": 4785792,
"size_kb": 4674,
"size_kb_actual": 4992,
"size_kb_utilized": 4674,
"num_objects": 96
}
}
},
"calculated_header": {
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 3
},
"rgw.main": {
"size": 4785792,
"size_actual": 5111808,
"size_utilized": 4785792,
"size_kb": 4674,
"size_kb_actual": 4992,
"size_kb_utilized": 4674,
"num_objects": 96
}
}
}
}

#5 Updated by Tave liu over 1 year ago

Another test. dump of bilog:

master merge op(write,delete),but slave miss some op.

#7 Updated by Tave liu over 1 year ago

Casey Bodley wrote:

a couple of prs that are potentially related? https://github.com/ceph/ceph/pull/20396 https://github.com/ceph/ceph/pull/19895

yes, "rgw: fix index cancel op miss update header #20396" that is why index not equal by `ls`;

but "rgw: do not add cancel op in squash_map #19895" that does not seem to involve a OP merger, like miss `DELETE` on slave zone.

#8 Updated by Yehuda Sadeh over 1 year ago

  • Subject changed from multipsite Synchronization failed when read and write delete at the same time to multisite Synchronization failed when read and write delete at the same time
  • Assignee set to Casey Bodley
  • Priority changed from Normal to High

#9 Updated by Pengju Niu over 1 year ago

it beacase that master try to del first write objA, and it has been canceled however bilog record del op behind second write.So master rgw has objA,slave rgw can't find. pr is https://github.com/ceph/ceph/pull/20814.

#10 Updated by Tave liu over 1 year ago

Pengju Niu wrote:

it beacase that master try to del first write objA, and it has been canceled however bilog record del op behind second write.So master rgw has objA,slave rgw can't find. pr is https://github.com/ceph/ceph/pull/20814.

Yes, your pull solved my problem, thank you!

#11 Updated by Yehuda Sadeh over 1 year ago

I commented on the PR.

#12 Updated by Tave liu over 1 year ago

Yehuda Sadeh wrote:

I commented on the PR.

sorry, Two days after this pull, bug reappeared.

#14 Updated by Casey Bodley over 1 year ago

  • Status changed from New to Need Review
  • Tags changed from multipsite sync failed to multisite sync failed
  • Backport set to jewel luminous

#15 Updated by Tave liu over 1 year ago

Casey Bodley wrote:

https://github.com/ceph/ceph/pull/21262

Miss something as error status code

#16 Updated by Casey Bodley over 1 year ago

  • Status changed from Need Review to Pending Backport

#18 Updated by Abhishek Lekshmanan over 1 year ago

  • Copied to Backport #23690: luminous: multisite Synchronization failed when read and write delete at the same time added

#19 Updated by Nathan Cutler over 1 year ago

  • Copied to Backport #23692: jewel: multisite Synchronization failed when read and write delete at the same time added

#20 Updated by Tave liu over 1 year ago

cosbench with read/write/delete at the same time, master's objects and index was still more than slave's.

#21 Updated by Nathan Cutler 10 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF