Bug #59732
openimprove rbd-mirror slow downs when latency is added
0%
Description
When latency is introduced between peer sites, rbd-mirror slows down with inter-site interactions.
Let's consider copy_image in rbd-mirror to provide an example. With a continuous light workload running on primary side on an image (defined in comment below), the time to sync a mirror snap differs when latency is added. The time from when copy_image starts to handle_copy_image nearly doubles when added latency goes from 0ms to 100ms.
Timestamps for baseline latency and added latency:
0ms latency: 2023-05-11T17:37:01.527+0000 7f6657e7a6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 copy_image: remote_snap_id_start=238, remote_snap_id_end=240, local_snap_id_start=213, last_copied_object_number=0, snap_seqs={240=18446744073709551614} 2023-05-11T17:37:07.399+0000 7f6650e6c6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 handle_copy_image: r=0
100ms latency: 2023-05-11T17:39:02.389+0000 7f6657e7a6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 copy_image: remote_snap_id_start=242, remote_snap_id_end=244, local_snap_id_start=217, last_copied_object_number=0, snap_seqs={244=18446744073709551614} 2023-05-11T17:39:14.680+0000 7f6650e6c6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55f019c49f80 handle_copy_image: r=0
Investigate further what is happening, what bottlenecks exist, and determine how to improve overall performance.
Files
Updated by Christopher Hoffman 12 months ago
- File m-namespace.sh m-namespace.sh added
- File netns.sh netns.sh added
Steps to reproduce:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.
step 1 can be done by running provided netns.sh
2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image
steps 2-4 can be done using "m-namespace.sh start"
5. On east, mapped rbd image to block device, formatted with xfs and mounted.
./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1
6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size
echo """
[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio
[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1
7. "injected" latency on the east(primary) network namespace using
tc qdisc add dev east root netem delay 100ms
Updated by Ilya Dryomov 12 months ago
Christopher Hoffman wrote:
When latency is introduced between peer sites, rbd-mirror slows down with inter-site interactions.
Let's consider copy_image in rbd-mirror to provide an example. With a continuous light workload running on primary side on an image (defined in comment below), the time to sync a mirror snap differs when latency is added. The time from when copy_image starts to handle_copy_image nearly doubles when added latency goes from 0ms to 100ms.
But this is exactly the expectation, right? Replayer::copy_image() is where the sync actually happens so it makes sense that it takes more time with inter-site latency injected.
Is the issue that it slows down too much? Can you frame this more precisely?
Updated by Christopher Hoffman 12 months ago
- Subject changed from rbd-mirror slows when latency is added to improve rbd-mirror slow downs when latency is added