Bug #36652: [rbd-mirror] replay performance issue - rbd - Ceph

Actions

Copy link

Bug #36652

open

[rbd-mirror] replay performance issue

Added by Sameh Ghane over 5 years ago. Updated over 5 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello,

[New to ceph community, please pardon my faux pas.]

I have 2 ceph clusters are separated by 40ms RTT.

2 rbd-mirror instances are running, each one close to a cluster.

rados bench from rbd-mirror instances shows 300MB/s speed in the worst case scenario (remote writes).

Remote (from rbd-mirror perspective) cluster is running 12.2.7.
Local cluster is running 12.2.4.
rbd-mirror is running 12.2.7.

When I map an mount a 10GB rbd(-nbd) image and run this command to fill it:
[root@systasks001 mnt]# dd if=/dev/zero of=TEST bs=1M count=5000
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 32.512 s, 161 MB/s

This translates to 300k+ entries to replay:

replaying, master_position=[object_number=3077, tag_tid=11, entry_tid=648217], mirror_position=[object_number=2778, tag_tid=11, entry_tid=305374], entries_behind_master=342843

rbd-mirror reads from remote cluster and writes to local cluster (but the problem is the same when it reads locally and replays remotely).

It replays entries at a rate of about 400+ entries per second, which is about 25 times slower than the initial dd write.

Attached is a tcpdump pcap where you can observe a chunk of the data being replayed by rbd-mirror:

reads from the remote OSDs. (During this, no data is sent to local OSDs)
writes to the local OSDs. (During this, no data is received from remote OSDs)

I marked this issue as major, because after a certain threshold, performance will break functionality.

Cheers,

Files

ceph.rbd-mirror.pcap (52 KB) ceph.rbd-mirror.pcap

Sameh Ghane, 10/30/2018 08:41 PM

Actions

Copy link

Updated by Mykola Golub over 5 years ago

It should be much faster in your case if the image journal is created with large (1Mb) rbd_journal_max_payload_bytes and rbd_mirror_journal_max_fetch_bytes is set to large value (1Mb or more) in rbd mirror. rbd_journal_max_payload_bytes is also improves performance when writing to a journal for large request sizes.

Could you try this?

The rationale why we have so low defaults is to limit rbd-mirror memory usage when mirroring a pool with many images, and that a usual rbd workload is small size requests for which these params are not so useful.

Actions

Copy link

Updated by Mykola Golub over 5 years ago

Status changed from New to Need More Info

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #36652

[rbd-mirror] replay performance issue

Updated by Mykola Golub over 5 years ago

Updated by Mykola Golub over 5 years ago