Bug #64999: Slow RGW multisite sync due to "304 Not Modified" responses on primary zone - rgw - Ceph

Actions

Copy link

Bug #64999

open

Slow RGW multisite sync due to "304 Not Modified" responses on primary zone

Added by Mohammad Saif about 1 month ago. Updated 2 days ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

Ceph - v18.2.3

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v18.2.1

ceph-qa-suite:

upgrade/quincy-x

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,
We have 2 clusters (v18.2.1) primarily used for RGW which has over 2+ billion RGW objects.
They are also in multisite configuration totaling to 2 zones and we've got around 2
Gbps of bandwidth dedicated (P2P) for the multisite traffic. We see that using
"radosgw-admin sync status" on the zone 2, all the 128 shards are recovering and
unfortunately there is very less data transfer from primary zone ie., the link utilization
is barely 100 Mbps / 2 Gbps. Our objects are quite small as well like avg. of 1 MB in
size.
On further inspection, we noticed the rgw access the logs at primary site are mostly
yielding "304 Not Modified" for RGWs at site-2. Is this expected? Here are some
of the logs (information is redacted)

root@host-04:~# tail f /var/log/haproxy-msync.log
Feb 12 05:06:51 host-04 haproxy⁹⁷¹¹⁷¹: 10.1.85.14:33730 [12/Feb/2024:05:06:51.047]
https~ backend/host-04-msync 0/0/0/2/2 304 143 - - --- 56/55/1/0/0 0/0 "GET
/bucket1/object1.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7
HTTP/1.1"
Feb 12 05:06:51 host-04 haproxy⁹⁷¹¹⁷¹: 10.1.85.14:59730 [12/Feb/2024:05:06:51.048]
https~ backend/host-04-msync 0/0/0/2/2 304 143 - - ---- 56/55/3/1/0 0/0 "GET
/bucket1/object91.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7
HTTP/1.1"

We also took a look at our grafana instance and out of 1000 requests / second, 200 are
"200 OK" and 800 are "304 Not Modified". Sync threads are run on only
2 rgw daemons per zone and are behind a Load Balancer. "# radosgw-admin sync error
list" also contains around 20 errors which are mostly automatically recoverable.
As we understand, does it mean that RGW multisite sync logs in the log pool are yet to be
generated or some sort? Please provide us some insights and let us know how to resolve
this.

Thanks,
Mohammd Saif

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Casey Bodley about 1 month ago

Has duplicate Bug #65071: Slow RGW multisite sync due to "304 Not Modified" responses on primary zone added

Actions

Copy link

Updated by Shilpa MJ 28 days ago

304 Not Modified means that there is no change in the object since the time it was last synced.
you mention that all shards are in recovering state. was a data full sync performed?
in that case, 304 is expected. If not, are there any other error codes in the logs?
Please also share the output of radosgw-admin sync status.

Thanks,
Shilpa

Actions

Copy link

Updated by Mohammad Saif 24 days ago

Hello Shilpa,
Thanks for your response.

At the moment full data not yet synced,
Currently On DC site we have 521 TiB , on DR synced to 183 TiB only. We also noticing only incremental data is being synced.
On haproxy logs we are not getting any other status code except "304" and "200"(OK)

We also added following parameters in our configuration file suggested by community ( attaching reference )but we dont see any improvement.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IL3TPYACCTKYZDS7AEAB4DYRTDSAXUPD/#IL3TPYACCTKYZDS7AEAB4DYRTDSAXUPD

client.rgw dev rgw_bucket_sync_spawn_window 40
client.rgw dev rgw_data_sync_spawn_window 40
client.rgw dev rgw_meta_sync_spawn_window 40

root@host-dr-01:~# radosgw-admin sync status
realm 9d399b90-d6cc-3f21-6c52-9049c43f8b7f (realm)
zonegroup 9d3ceb3d-0392-1dcd-437f-f9abg60c5971 (dc-dr)
zone b8214ce3-b59e-1d9d-bca2-94ebb05a1a34 (dr)
current time 2024-04-08T05:04:45Z
zonegroup features enabled: resharding
disabled: compress-encrypted
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: <-----------zone-id-dc--------------> (dc)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 95 shards
behind shards: [0,1,2,3,4,5,7,8,10,11,12,13,14,15,19,20,21,22,24,25,26,27,28,29,30,32,33,34,35,36,37,40,41,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,59,61,62,63,64,65,66,67,68,69,70,72,73,74,79,80,84,86,88,90,91,93,94,95,96,97,98,100,102,103,105,106,108,109,114,115,117,118,119,120,121,122,123,124,125,126,127]
oldest incremental change not applied: 2024-03-27T10:17:53.246068+0000 [62]
128 shards are recovering
recovering shards: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]

Thanks,
Mohammad Saif

Actions

Copy link

Updated by Mohammad Saif 24 days ago

Hello Shilpa,
Thanks for your response.

We also added following parameters in our configuration file suggested by community ( attaching reference )but we dont see any improvement.
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IL3TPYACCTKYZDS7AEAB4DYRTDSAXUPD/#IL3TPYACCTKYZDS7AEAB4DYRTDSAXUPD

client.rgw dev rgw_bucket_sync_spawn_window 40
client.rgw dev rgw_data_sync_spawn_window 40
client.rgw dev rgw_meta_sync_spawn_window 40

root@host-dr-01:~# radosgw-admin sync status
realm 9d399b90-d6cc-3f21-6c52-9049c43f8b7f (realm)
zonegroup 9d3ceb3d-0392-1dcd-437f-f9abg60c5971 (dc-dr)
zone b8214ce3-b59e-1d9d-bca2-94ebb05a1a34 (dr)
current time 2024-04-08T05:04:45Z
zonegroup features enabled: resharding
disabled: compress-encrypted
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: c5214ce3-g59e-2d9d-dca2-24ebb05a1a39 (dc)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 95 shards
behind shards: [0,1,2,3,4,5,7,8,10,11,12,13,14,15,19,20,21,22,24,25,26,27,28,29,30,32,33,34,35,36,37,40,41,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,59,61,62,63,64,65,66,67,68,69,70,72,73,74,79,80,84,86,88,90,91,93,94,95,96,97,98,100,102,103,105,106,108,109,114,115,117,118,119,120,121,122,123,124,125,126,127]
oldest incremental change not applied: 2024-03-27T10:17:53.246068+0000 [62]
128 shards are recovering
recovering shards: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]

Actions

Copy link

Updated by Mohammad Saif 10 days ago

Hi Shilpa,

We are eagerly waiting for your direction to resolve it.
I appreciate your attention to this matter.

Regards,
Mohammad Saif

Actions

Copy link

Updated by Mohammad Saif 8 days ago

Hi All,

I just wanted to quick follow-up on my previous query about "Slow RGW multisite sync
due to '304 Not Modified' responses on primary zone". I wanted to highlight
that I'm still facing the issue and urgently need your guidance to resolve it.

I appreciate your attention to this matter.

Thanks,
Mohammad Saif

Actions

Copy link

Updated by Mohammad Saif 2 days ago

Hi All,

We are eagerly awaiting the resolution of the mentioned issue.
Any guidance or insight would be greatly appreciated.

Regards,
Mohammad Saif

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #64999

Slow RGW multisite sync due to "304 Not Modified" responses on primary zone

Updated by Casey Bodley about 1 month ago

Updated by Shilpa MJ 28 days ago

Updated by Mohammad Saif 24 days ago

Updated by Mohammad Saif 24 days ago

Updated by Mohammad Saif 10 days ago

Updated by Mohammad Saif 8 days ago

Updated by Mohammad Saif 2 days ago