Bug #63549: rgw-multisite: occasionally bucket full sync fails to sync objects - rgw - Ceph

Actions

Copy link

Bug #63549

open

rgw-multisite: occasionally bucket full sync fails to sync objects

Added by Shilpa MJ 6 months ago. Updated 5 months ago.

Status:

New

Priority:

High

Assignee:

Shilpa MJ

Target version:

% Done:

Source:

Tags:

multisite-backlog

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

1. configure multisite with three zones.
2. stop rgw service on the one of the non-master zone
3. create versioned bucket and objects on primary.(could be possible on non versioned bucket as well. haven't tried)
4. allow them to sync to the other zone.
5. start rgw service on the zone where it was stopped.
6. wait for sync status to show as caught up.
7. list the bucket. objects are missing.

the logs on the zone in question suggest that a stat on the bucket index shards return ENOENT:

2023-11-15T11:07:19.560-0500 7fbf246d86c0 1 -- 10.0.0.184:0/130016932 <== osd.0 v2:10.0.0.184:6808/1757553665 1757 ==== osd_op_
reply(1779 .dir.b9832d17-c0bc-4b61-b22d-01e01903fc11.4417.1.2 [stat,call,call] v15'1 uv0 ondisk = -2 ((2) No such file or direct
ory)) v8 ==== 278+0+0 (crc 0 0 0) 0x55a3faffb200 con 0x55a3f9560000
2023-11-15T11:07:19.560-0500 7fbf13eb76c0 0 rgw async rados processor: ERROR: bucket shard callback failed. obj=c1obj3[QxWtdsgVZbZsuUrjxsccy3tqCsXpLaI]. ret=(2) No such file or directory
2023-11-15T11:07:19.560-0500 7fbf02e956c0 10 RGW-SYNC:data:sync:shard⁵:entry[new-bucket:b9832d17-c0bc-4b61-b22d-01e01903fc11.4417.1:2⁰]:bucket_sync_sources[source=:new-bucket[b9832d17-c0bc-4b61-b22d-01e01903fc11.4417.1]):2:source_zone=b9832d17-c0bc-4b61-b22d-01e01903fc11]:bucket[new-bucket:b9832d17-c0bc-4b61-b22d-01e01903fc11.4417.1<-new-bucket:b9832d17-c0bc-4b61-b22d-01e01903fc11.4417.1:2]:full_sync[new-bucket:b9832d17-c0bc-4b61-b22d-01e01903fc11.4417.1:2]:entry[c1obj3]: failed, retcode=-2 ((2) No such file or directory)

It appears to be coming from the UpdateIndex::prepare op call to guard_reshard() during write_meta call.

Further, the logs suggest that the create op on these bucket shards succeeded much later:

2023-11-15T11:07:20.644-0500 7fbf246d86c0 1 -- 10.0.0.184:0/130016932 <== osd.0 v2:10.0.0.184:6808/1757553665 1858 ==== osd_op_
reply(1880 .dir.b9832d17-c0bc-4b61-b22d-01e01903fc11.4417.1.2 [create,call] v15'61 uv61 ondisk = 0) v8 ==== 236+0+0 (crc 0 0 0)
0x55a3faffb200 con 0x55a3f9560000

both the data and bucket sync status shows caught up. A bucket sync init/run seems to be the workaround to sync the objects.