Bug #59732: improve rbd-mirror slow downs when latency is added - rbd - Ceph

Actions

Copy link

Bug #59732

open

improve rbd-mirror slow downs when latency is added

Added by Christopher Hoffman 12 months ago. Updated 12 months ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When latency is introduced between peer sites, rbd-mirror slows down with inter-site interactions.

Let's consider copy_image in rbd-mirror to provide an example. With a continuous light workload running on primary side on an image (defined in comment below), the time to sync a mirror snap differs when latency is added. The time from when copy_image starts to handle_copy_image nearly doubles when added latency goes from 0ms to 100ms.

Timestamps for baseline latency and added latency:

0ms latency:
2023-05-11T17:37:01.527+0000 7f6657e7a6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 copy_image: remote_snap_id_start=238, remote_snap_id_end=240, local_snap_id_start=213, last_copied_object_number=0, snap_seqs={240=18446744073709551614}
2023-05-11T17:37:07.399+0000 7f6650e6c6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 handle_copy_image: r=0

100ms latency:
2023-05-11T17:39:02.389+0000 7f6657e7a6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 copy_image: remote_snap_id_start=242, remote_snap_id_end=244, local_snap_id_start=217, last_copied_object_number=0, snap_seqs={244=18446744073709551614}
2023-05-11T17:39:14.680+0000 7f6650e6c6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 handle_copy_image: r=0

Investigate further what is happening, what bottlenecks exist, and determine how to improve overall performance.

Files

Download all files

m-namespace.sh (2.53 KB) m-namespace.sh		Christopher Hoffman, 05/11/2023 05:55 PM
netns.sh (1.53 KB) netns.sh		Christopher Hoffman, 05/11/2023 05:56 PM

Actions

Copy link Download all files

Updated by Christopher Hoffman 12 months ago

File m-namespace.sh m-namespace.sh added
File netns.sh netns.sh added

Steps to reproduce:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.

step 1 can be done by running provided netns.sh

2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image

steps 2-4 can be done using "m-namespace.sh start"

5. On east, mapped rbd image to block device, formatted with xfs and mounted.

./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1

6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size

echo """

[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio

[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1

7. "injected" latency on the east(primary) network namespace using

tc qdisc add dev east root netem delay 100ms

Actions

Copy link

Updated by Ilya Dryomov 12 months ago

Christopher Hoffman wrote:

When latency is introduced between peer sites, rbd-mirror slows down with inter-site interactions.

Let's consider copy_image in rbd-mirror to provide an example. With a continuous light workload running on primary side on an image (defined in comment below), the time to sync a mirror snap differs when latency is added. The time from when copy_image starts to handle_copy_image nearly doubles when added latency goes from 0ms to 100ms.

But this is exactly the expectation, right? Replayer::copy_image() is where the sync actually happens so it makes sense that it takes more time with inter-site latency injected.

Is the issue that it slows down too much? Can you frame this more precisely?

Actions

Copy link