Bug #62774: datalog crash consistency: missing objects on the secondary site after the multisite replication is done - rgw - Ceph

Actions

Copy link

Bug #62774

open

datalog crash consistency: missing objects on the secondary site after the multisite replication is done

Added by Jane Zhu 8 months ago. Updated 2 months ago.

Status:

In Progress

Priority:

High

Assignee:

Adam Emerson

Target version:

% Done:

Source:

Tags:

multisite

Backport:

quincy reef

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Missing objects observed on the secondary site after the multisite replication is done.
Reproduced with the Ceph main branch (as of Aug. 29th, 2023).

Multisite clusters configuration:

2 zones
each cluster has 3 rgw nodes, and 16 rgw instances per node

Client traffic:

cosbench write only, 15 users, 30 workers, 4 hrs
generated 1800 buckets, >10 million objects

Wait for replication is done and sync status of both primary and secondary sites shows all caught up.

An example of missing object:
bucket: prod-mixed-8m-thisisbcstestuser0156008
object: myobjects115758599

On primary:

$  sudo radosgw-admin -n client.rgw.`hostname`.dev.1 bucket list --max-entries=100000 --bucket=prod-mixed-8m-thisisbcstestuser0156008 | grep "\"name\"" | wc -l
1098

$ sudo radosgw-admin -n client.rgw.`hostname`.dev.1 object stat --bucket=prod-mixed-8m-thisisbcstestuser0156008 --object=myobjects115758599
{
    "name": "myobjects115758599",
    "size": 8000000,
...

On secondary:

$ sudo radosgw-admin -n client.rgw.`hostname`.dev.1 bucket list --max-entries=100000 --bucket=prod-mixed-8m-thisisbcstestuser0156008 | grep "\"name\"" | wc -l
1097

$ sudo radosgw-admin -n client.rgw.`hostname`.dev.1 object stat --bucket=prod-mixed-8m-thisisbcstestuser0156008 --object=myobjects115758599
ERROR: failed to stat object, returned error: (2) No such file or directory

This is a different issue from https://tracker.ceph.com/issues/62709, since the "radosgw-admin bucket list" doesn't fix the mismatch.

Investigating now. Will update with more findings.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Casey Bodley 8 months ago

Tags set to multisite

Actions

Copy link

Updated by Jane Zhu 7 months ago

Status changed from New to In Progress
Priority changed from Normal to High

Root cause analysis¶

The multisite replication mismatches I encountered in my tests were generated in 3 scenarios:

1. At the end of a PUT request, an entry should be added to the corresponding datalog shard to indicate that the bucket shard has changes. These datalog updates are cached and squashed before being sent to rados if they fall into a window determined by the config "rgw_data_log_window". Please see the code snippet: https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_datalog.cc#L689-L694.

With this mechanism, in case of any rgw crashes, the cached entries will be lost. If a change happens to be the last one for a bucket index shard, the incremental sync process may never have a chance to realize there is a new change.

2. At the end of a PUT request, before adding an entry to the datalog, a bucket_complete_op is triggered to ask rados to add a bilog entry to the bucket index shard. This bucket_complete_op is currently done asynchronously. Please see the code: https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_rados.cc#L6751-L6754.

However the add_entry(adding an entry to the datalog) is called without waiting for the bucket_complete_op to complete. Please see the code: https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_rados.cc#L9051-L9058. If the data log incremental sync happens between the datalog being added and the bilog being added, the bilog entry will be ignored, and will never be able to be synced if this change happens to be the last in this bucket index shard.

3. This is different from the above 2 cases. In this case, the object has been uploaded to the primary and remains there, but the client may have gotten a request timeout. It looks to me the timeout is caused by slow osd response when rgw is reading bucket.sync-source-hints/bucket.sync-target-hints. Please see the following logs.

2023-09-26T17:40:19.249-0400 7f96494c1700  1 ====== starting new request req=0x7f937ee1b790 =====
2023-09-26T17:40:19.249-0400 7f96494c1700  2 req 17939659518360353270 0.000000000s initializing for trans_id = tx00000f8f679489752f9f6-0065134fc3-488f-dev-zone-bcc-master
...
2023-09-26T17:40:19.530-0400 7f9a11159700  0 BITX: ENTERING cls_obj_complete_op: bucket-shard=BucketShard:{ bucket=:prod-mixed-8m-thisisbcstestuser0046009[d672bf31-6599-4cac-99ad-782c15758eb8.7647.30]), shard_id=
94, bucket_ojb=dev-zone-bcc-master.rgw.buckets.index:.dir.d672bf31-6599-4cac-99ad-782c15758eb8.7647.30.94} obj=prod-mixed-8m-thisisbcstestuser0046009:myobjects114062567 tag=d672bf31-6599-4cac-99ad-782c15758eb8.18
575.17939659518360353270 op=write, remove_objs=
...
2023-09-26T17:40:19.560-0400 7f99837bd700  0 BITX: EXITING cls_obj_complete_op: ret=0
2023-09-26T17:40:19.560-0400 7f99837bd700 10 req 17939659518360353270 0.310998738s s3:put_obj cache get: name=dev-zone-bcc-master.rgw.log++bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 : miss
2023-09-26T17:40:19.560-0400 7f99837bd700 20 req 17939659518360353270 0.310998738s s3:put_obj rados->read ofs=0 len=0
2023-09-26T17:40:19.560-0400 7f99837bd700  1 -- 10.74.2.81:0/3350378031 --> [v2:10.74.6.94:6808/599963301,v1:10.74.6.94:6809/599963301] -- osd_op(unknown.0.0:145873 3.18 3:19889297:::bucket.sync-source-hints.prod
-mixed-8m-thisisbcstestuser0046009:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e61) v8 -- 0x556c666ba380 con 0x556c5db8ec00
...
2023-09-26T17:40:20.158-0400 7f9bc6dee700  1 -- 10.74.2.81:0/3350378031 <== osd.4 v2:10.74.6.94:6808/599963301 23673 ==== osd_op_reply(145873 bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 [call,
read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 249+0+0 (crc 0 0 0) 0x556c710db680 con 0x556c5db8ec00
2023-09-26T17:40:20.159-0400 7f989d3ed700 20 req 17939659518360353270 0.909996331s s3:put_obj rados_obj.operate() r=-2 bl.length=0
2023-09-26T17:40:20.159-0400 7f989d3ed700 10 req 17939659518360353270 0.909996331s s3:put_obj cache put: name=dev-zone-bcc-master.rgw.log++bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 info.flags=0x0
2023-09-26T17:40:20.159-0400 7f989d3ed700 10 req 17939659518360353270 0.909996331s s3:put_obj adding dev-zone-bcc-master.rgw.log++bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 to cache LRU end
2023-09-26T17:40:20.159-0400 7f989d3ed700 10 req 17939659518360353270 0.909996331s s3:put_obj cache get: name=dev-zone-bcc-master.rgw.log++bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009 : miss
2023-09-26T17:40:20.159-0400 7f989d3ed700 20 req 17939659518360353270 0.909996331s s3:put_obj rados->read ofs=0 len=0
2023-09-26T17:40:20.159-0400 7f989d3ed700  1 -- 10.74.2.81:0/3350378031 --> [v2:10.74.6.94:6808/599963301,v1:10.74.6.94:6809/599963301] -- osd_op(unknown.0.0:146173 3.18 3:1fec4601:::bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e61) v8 -- 0x556c762b8e00 con 0x556c5db8ec00
...
2023-09-26T17:40:20.425-0400 7f9bc6dee700  1 -- 10.74.2.81:0/3350378031 <== osd.4 v2:10.74.6.94:6808/599963301 23692 ==== osd_op_reply(146173 bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009 [call,read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 249+0+0 (crc 0 0 0) 0x556c710db680 con 0x556c5db8ec00
...
2023-09-26T17:41:16.993-0400 7fc5991dc700 10 rgw async rados processor: cache get: name=dev-zone-bcc-master.rgw.log++bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 : miss
2023-09-26T17:41:16.993-0400 7fc5991dc700 20 rgw async rados processor: rados->read ofs=0 len=0
2023-09-26T17:41:16.993-0400 7fc5991dc700  1 -- 10.74.2.81:0/1779277127 --> [v2:10.74.6.94:6808/599963301,v1:10.74.6.94:6809/599963301] -- osd_op(unknown.0.0:55426 3.18 3:19889297:::bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e61) v8 -- 0x56203eb0d180 con 0x5620308eec00
2023-09-26T17:41:16.994-0400 7fc5ada05700  1 -- 10.74.2.81:0/1779277127 <== osd.4 v2:10.74.6.94:6808/599963301 10191 ==== osd_op_reply(55426 bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 [call,read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 249+0+0 (crc 0 0 0) 0x5620306d1d40 con 0x5620308eec00
2023-09-26T17:41:16.994-0400 7fc5991dc700 20 rgw async rados processor: rados_obj.operate() r=-2 bl.length=0
2023-09-26T17:41:16.994-0400 7fc5991dc700 10 rgw async rados processor: cache put: name=dev-zone-bcc-master.rgw.log++bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 info.flags=0x0
2023-09-26T17:41:16.994-0400 7fc5991dc700 10 rgw async rados processor: adding dev-zone-bcc-master.rgw.log++bucket.sync-source-hints.prod-mixed-8m-thisisbcstestuser0046009 to cache LRU end
2023-09-26T17:41:16.994-0400 7fc5991dc700 10 rgw async rados processor: cache get: name=dev-zone-bcc-master.rgw.log++bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009 : miss
2023-09-26T17:41:16.994-0400 7fc5991dc700 20 rgw async rados processor: rados->read ofs=0 len=0
2023-09-26T17:41:16.994-0400 7fc5991dc700  1 -- 10.74.2.81:0/1779277127 --> [v2:10.74.6.94:6808/599963301,v1:10.74.6.94:6809/599963301] -- osd_op(unknown.0.0:55427 3.18 3:1fec4601:::bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e61) v8 -- 0x562041360700 con 0x5620308eec00
2023-09-26T17:41:16.995-0400 7fc5ada05700  1 -- 10.74.2.81:0/1779277127 <== osd.4 v2:10.74.6.94:6808/599963301 10192 ==== osd_op_reply(55423 bucket.sync-status.88795ff7-a4eb-4845-b669-e75b6a1282ab:prod-mixed-2k-thisisbcstestuser0112001:d672bf31-6599-4cac-99ad-782c15758eb8.7833.25:5 [call,call,setxattr (37)] v61'53948 uv53948 ondisk = 0) v8 ==== 369+0+0 (crc 0 0 0) 0x5620306d1d40 con 0x5620308eec00
2023-09-26T17:41:16.995-0400 7fc5ada05700  1 -- 10.74.2.81:0/1779277127 <== osd.4 v2:10.74.6.94:6808/599963301 10193 ==== osd_op_reply(55427 bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009 [call,read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 249+0+0 (crc 0 0 0) 0x5620306d1d40 con 0x5620308eec00
2023-09-26T17:41:16.995-0400 7fc5991dc700 20 rgw async rados processor: rados_obj.operate() r=-2 bl.length=0
2023-09-26T17:41:16.995-0400 7fc5991dc700 10 rgw async rados processor: cache put: name=dev-zone-bcc-master.rgw.log++bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009 info.flags=0x0
2023-09-26T17:41:16.995-0400 7fc5991dc700 10 rgw async rados processor: adding dev-zone-bcc-master.rgw.log++bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009 to cache LRU end
2023-09-26T17:41:16.995-0400 7fc5991dc700 20 rgw async rados processor: reflect(): flow manager (bucket=prod-mixed-8m-thisisbcstestuser0046009:d672bf31-6599-4cac-99ad-782c15758eb8.7647.30): adding source pipe: {s={b=prod-mixed-8m-thisisbcstestuser0046009:d672bf31-6599-4cac-99ad-782c15758eb8.7647.30,z=d672bf31-6599-4cac-99ad-782c15758eb8,az=0},d={b=prod-mixed-8m-thisisbcstestuser0046009:d672bf31-6599-4cac-99ad-782c15758eb8.7647.30,z=88795ff7-a4eb-4845-b669-e75b6a1282ab,az=0}}
2023-09-26T17:41:16.995-0400 7fc5991dc700 20 rgw async rados processor: reflect(): flow manager (bucket=prod-mixed-8m-thisisbcstestuser0046009:d672bf31-6599-4cac-99ad-782c15758eb8.7647.30): adding dest pipe: {s={b=prod-mixed-8m-thisisbcstestuser0046009:d672bf31-6599-4cac-99ad-782c15758eb8.7647.30,z=88795ff7-a4eb-4845-b669-e75b6a1282ab,az=0},d={b=prod-mixed-8m-thisisbcstestuser0046009:d672bf31-6599-4cac-99ad-782c15758eb8.7647.30,z=d672bf31-6599-4cac-99ad-782c15758eb8,az=0}}
2023-09-26T17:41:16.995-0400 7fc5991dc700 20 rgw async rados processor: reflect(): flow manager (bucket=): adding source pipe: {s={b=*,z=d672bf31-6599-4cac-99ad-782c15758eb8,az=0},d={b=*,z=88795ff7-a4eb-4845-b669-e75b6a1282ab,az=0}}
2023-09-26T17:41:16.995-0400 7fc5991dc700 20 rgw async rados processor: reflect(): flow manager (bucket=): adding dest pipe: {s={b=*,z=88795ff7-a4eb-4845-b669-e75b6a1282ab,az=0},d={b=*,z=d672bf31-6599-4cac-99ad-782c15758eb8,az=0}}
2023-09-26T17:41:16.995-0400 7fc5991dc700 10 rgw async rados processor: chain_cache_entry: cache_locator=dev-zone-bcc-master.rgw.meta+root+.bucket.meta.prod-mixed-8m-thisisbcstestuser0046009:d672bf31-6599-4cac-99ad-782c15758eb8.7647.30
2023-09-26T17:41:16.995-0400 7fc5a11ec700 10 data sync zone:88795ff7 cache get: name=dev-zone-bcc-master.rgw.log++bucket.sync-target-hints.prod-mixed-8m-thisisbcstestuser0046009 : hit (negative entry)

====================================
I verified the analysis of the scenarios 1 & 2 with some quick and dirty fixes. These 2 issues can be reproduced reliably earlier with the cosbench workload I described in the issue description, but are gone with the (temporary) fixes. The scenario 3 is derived purely from the logs and the source code.

Proposed fixes¶

The datalog entries and bilog entries should be treated as the data itself as far as the consistency is concerned, and should be crash-proof as much as possible. The response shouldn't be sent before these log entries are properly stored, otherwise the secondary site won't be able to catch up with the primary if some of the entries are lost for some reason, unless a full sync is performed.

1. Bypass the datalog change cache, to make them persistent as soon as they happen. The secondary can do a squash to collapse duplicate items before doing the bucket sync.

2. Make the bucket_complete_op operation synchronous. The datalog entry won't be added before the bucket_complete_op returns. I tried to add datalog entry in the completion handler of the bucket_complete_op, but encountered the same problem with the scenario 1, only at a different place, for a different in-memory queue.
Not sure the performance impact for this proposed fix though.

3. I was thinking about adding datalog entry before bucket.sync-source-hints/bucket.sync-target-hints are being read, but realized that it might cause issues in bucket sync if there is a bucket sync policy involved.
What if we leave this as it is, since the client doesn't get a success response anyway? But then the multisite clusters are not in a consistent state nonetheless, which is not good.

Actions

Copy link

Updated by Shilpa MJ 7 months ago

the crash that we fixed related to incremental sync was:
https://github.com/ceph/ceph/pull/53676

Actions

Copy link

Updated by Casey Bodley 7 months ago

Assignee set to Jane Zhu
Backport set to pacific quincy reef

Actions

Copy link

Updated by Casey Bodley 7 months ago

Subject changed from missing objects on the secondary site after the multisite replication is done to datalog crash consistency: missing objects on the secondary site after the multisite replication is done
Backport changed from pacific quincy reef to quincy reef