Project

General

Profile

Actions

Bug #61620

open

High multisite replication latencies on Ceph Object Store with two or more gateways with rgw_run_sync_thread being true

Added by Lucas Henry 12 months ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Summary

We have encountered a performance issue with the Multisite replication feature of Ceph Object Storage while using Rook clusters. Scaling the number of rados gateways to 2 or more significantly increases the replication latency, causing delays of 40 seconds or more for some files.

Context

We have been working with Rook clusters and investigating the performance of Ceph Object Storage Multisite replication. Initially, replication between two zones within a Ceph Realm showed excellent results, with 99% of files successfully replicated and latency consistently under 400 ms.

However, when we scaled the number of rados gateways to 2 or more using Rook, we observed a substantial degradation in performance. The p99 latency during replication skyrocketed to 40 seconds or more, indicating a severe issue. The cause of the degradation of the latency seems to be the lock of the Ceph Object Storage bucket index logs.

We were able to replicate this behavior on multiple Rook clusters.

We have clusters with Ceph in version 17.2.5 and initially noticed the performance problem on 16.2.7.

Our solution

As explained in this Rook issue (https://github.com/rook/rook/issues/12272), our solution has been to deploy one gateway dedicated to run the synchronization thread, while the other gateways only serve client traffic (we disabled the synchronization thread by setting 'rgw_run_sync_thread' to false. When the synchronization need are high, we can only scale the single synchronization gateway vertically.

With this approach and with our hardware, we were able to synchronize between 500 and 1000 objects per second between two datacenters.

Some maintainers of the Rook project suggested that this issue warrants further investigation. Is the increase in the latency expected when deploying multiple gateways with replication threads active?


Files

09_55_12.png (449 KB) 09_55_12.png Lucas Henry, 06/08/2023 01:18 PM
Actions #1

Updated by Casey Bodley 12 months ago

Lucas Henry wrote:

The cause of the degradation of the latency seems to be the lock of the Ceph Object Storage bucket index logs.

can you please explain what led you to this hypothesis?

Actions #2

Updated by Lucas Henry 12 months ago

Unfortunately, we have not verified this hypothesis. It is simply the cause that our team considers most likely, given our understanding of the replication mechanism and the fact that the problem occurs as soon as 2 gateways are exceeded.

It's possible that it's not the primary cause at all. We don't have the necessary hindsight to identify other possibilities.

Please note however that while our end to end test (creating a file on one zone, and waiting the the metadata of the file to be accessible on the other zone) showed that the latency skyrocketed, the metric 'ceph_data_sync_from_zone_poll_latency_sum' increased only a little (from 20ms to less than two seconds).

Actions #3

Updated by Lucas Henry 10 months ago

We have some news about this issue. Following the recommendations of Rook's maintainers, we deployed two CephObjectStore: one dedicated to the replication with a single replica, and one that serves user requests with many replicas. Before, we deployed only one CephObjectStore (the one dedicated to the replication). We discovered that Rook creates a zone endpoint for each CephObjectStore of the zone. With the two CephObjectStores, the performance of the synchronisation dropped suddently.

While we didn't noticed the drop of performance at first, we discovered that we could restore the performance only when removing the CephObjectStore that serves user requests from the zone endpoint.

I hope this new information will help debug the cause of the performance issue when using multiple replication rgw.

We tested :

1. One rgw that has rgw_run_sync_threads to true and one zone endpoint to this rgw -> performances OK
2. One rgw that has rgw_run_sync_threads to true and one zone endpoint that point to another rgw that do not run sync threads -> performances NOK/no sync
3. One rgw that has rgw_run_sync_threads to true and two zone endpoint that point to this rgw and another rgw that do not run sync threads -> performances NOK/no sync

Actions

Also available in: Atom PDF