Project

General

Profile

Actions

Bug #59732

open

improve rbd-mirror slow downs when latency is added

Added by Christopher Hoffman 12 months ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When latency is introduced between peer sites, rbd-mirror slows down with inter-site interactions.

Let's consider copy_image in rbd-mirror to provide an example. With a continuous light workload running on primary side on an image (defined in comment below), the time to sync a mirror snap differs when latency is added. The time from when copy_image starts to handle_copy_image nearly doubles when added latency goes from 0ms to 100ms.

Timestamps for baseline latency and added latency:

0ms latency:
2023-05-11T17:37:01.527+0000 7f6657e7a6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 copy_image: remote_snap_id_start=238, remote_snap_id_end=240, local_snap_id_start=213, last_copied_object_number=0, snap_seqs={240=18446744073709551614}
2023-05-11T17:37:07.399+0000 7f6650e6c6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 handle_copy_image: r=0
100ms latency:
2023-05-11T17:39:02.389+0000 7f6657e7a6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 copy_image: remote_snap_id_start=242, remote_snap_id_end=244, local_snap_id_start=217, last_copied_object_number=0, snap_seqs={244=18446744073709551614}
2023-05-11T17:39:14.680+0000 7f6650e6c6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 handle_copy_image: r=0

Investigate further what is happening, what bottlenecks exist, and determine how to improve overall performance.


Files

m-namespace.sh (2.53 KB) m-namespace.sh Christopher Hoffman, 05/11/2023 05:55 PM
netns.sh (1.53 KB) netns.sh Christopher Hoffman, 05/11/2023 05:56 PM

Updated by Christopher Hoffman 12 months ago

Steps to reproduce:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.

step 1 can be done by running provided netns.sh

2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image

steps 2-4 can be done using "m-namespace.sh start"

5. On east, mapped rbd image to block device, formatted with xfs and mounted.

./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1

6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size

echo """

[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio

[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1

7. "injected" latency on the east(primary) network namespace using

tc qdisc add dev east root netem delay 100ms

Actions #2

Updated by Ilya Dryomov 12 months ago

Christopher Hoffman wrote:

When latency is introduced between peer sites, rbd-mirror slows down with inter-site interactions.

Let's consider copy_image in rbd-mirror to provide an example. With a continuous light workload running on primary side on an image (defined in comment below), the time to sync a mirror snap differs when latency is added. The time from when copy_image starts to handle_copy_image nearly doubles when added latency goes from 0ms to 100ms.

But this is exactly the expectation, right? Replayer::copy_image() is where the sync actually happens so it makes sense that it takes more time with inter-site latency injected.

Is the issue that it slows down too much? Can you frame this more precisely?

Actions #3

Updated by Christopher Hoffman 12 months ago

  • Subject changed from rbd-mirror slows when latency is added to improve rbd-mirror slow downs when latency is added
Actions #4

Updated by Christopher Hoffman 12 months ago

  • Description updated (diff)
Actions

Also available in: Atom PDF