Bug #58058: CephFS Snapshot Mirroring slow due to repeating attribute sync - CephFS - Ceph

Actions

Copy link

Bug #58058

open

CephFS Snapshot Mirroring slow due to repeating attribute sync

Added by Mathias Kuhring over 1 year ago. Updated 7 months ago.

Status:

Triaged

Priority:

Normal

Assignee:

Venky Shankar

Category:

Administration/Usability

Target version:

% Done:

Source:

Tags:

Backport:

pacific,quincy

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.1

ceph-qa-suite:

Component(FS):

cephfs-mirror

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We might have found a major performance bug in the cephfs snapshot mirroring.
We already reported it to the mailing list but with now answers, yet.
So we will mainly reuse the details from here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/R7NRB5PNGHL4YJ7EZ5TYBLLJFTU7I2MK/

We are using ceph version 17.2.1
(ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable).
We are using cephadm since version 15 octopus.

We mirror several CephFS directories from our main cluster our to a second mirror cluster.
In particular with bigger directories (over 900 TB and 186 M of files), we noticed that mirroring is very slow.
On the mirror, most of the time we only observe a write speed of 0 to 10 MB/s in the client IO.
The target peer directory often doesn't show increase in size during synchronization
(when we check with: getfattr -n ceph.dir.rbytes).

The status of the syncs is always fine, i.e. syncing and not failing, e.g.:

0|0[root@osd-1 /var/run/ceph/55633ec3-6c0c-4a02-990c-0f87e0f7a01f]# ceph --admin-daemon ceph-client.cephfs-mirror.osd-1.ydsqsw.7.94552861013544.asok fs mirror peer status cephfs@1 c66afb80-593f-4c42-a120-dd3b6fca26bc
{
     "/irods/sodar": {
         "state": "syncing",
         "current_sycning_snap": {
             "id": 7552,
             "name": "scheduled-2022-08-22-13_00_00" 
         },
         "last_synced_snap": {
             "id": 7548,
             "name": "scheduled-2022-08-22-12_00_00",
             "sync_duration": 37828.164744490001,
             "sync_time_stamp": "13240678.542916s" 
         },
         "snaps_synced": 1,
         "snaps_deleted": 11,
         "snaps_renamed": 0
     }
}

The cluster nodes (6 per main and mirror cluster, each) are connected with Dual 40G NICs to the switches.
Connection between switches are 2x 100G.
Simple write operations from other clients to the mirror cephfs result in writes of e.g. 300 to 400 MB/s.
So network doesn't seem to be the issue, here.

We started to dig into debug logs of the cephfs-mirror daemon / docker container.
We set the debug level to 20. Otherwise there are no messages at all (so no errors).

We observed a lot of messages with need_data_sync=0, need_attr_sync=1.
Leading us to the assumption, that instead of actual data a lot of attributes are synced.

We started looking at specific examples in the logs and tried to make sense from the source code which steps are happening.
Most of the messages are coming from cephfs::mirror::PeerReplayer
https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc

We figured, the do_synchronize function checks if data (need_data_sync) or attributes (need_attr_sync) should be synchronized using should_sync_entry.
And if necessary performs the sync using remote_file_op.
should_sync_entry reports different ctimes for our examples, e.g.:

local cur statx: mode=33152, uid=996, gid=993, size=154701172, ctime=2022-01-28T12:54:21.176004+0000, ...
local prev statx: mode=33152, uid=996, gid=993, size=154701172, ctime=2022-08-22T11:03:18.578380+0000, ...

Based on these different ctimes, should_sync_entry decides then that attributes need to be synced:
*need_attr_sync = (cstx.stx_ctime != pstx.stx_ctime)
https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc#L911

We assume cur statx/cstx refers to the file in the snapshot currently mirrored.
But what exactly is prev statx/pstx?
Is it the peer path or the last snapshot on the mirror peer?

We can confirm that ctimes are different on the main cluster and the mirror.
On the main cluster, the ctimes are consistent in every snapshot (since the files didn't change).
On the the mirror, the ctimes increase with every snapshot towards more current dates.

Given that the CephFS Mirror daemon writes the data to the mirror as a CephFS client,
it seems to make sense that data on the mirror has different / more recent ctimes (from writing).
Also, when the mirror daemon is syncing the attributes to the mirror, wouldn't this trigger an new/current ctime as well?

So our assumption is, syncing an old ctime will actually result in a new ctime.
And thus trigger the sync of attributes over and over (at least with every snapshot synced).
So is ctime the proper parameter to test if attributes need to be synced? Or shouldn't it rather be excluded?
So is this check the right thing to do: *need_attr_sync = (cstx.stx_ctime != pstx.stx_ctime)

Is it reasonable to assume that these attribute syncs are responsible for our slow mirroring?
Or is there anything else we should look out for?

And are there actually commands or logs showing us the speed of the mirroring?
We only now about sync_duration and sync_time_stamp (as in the status above).
But then, how can we actually determine the size of a snapshot or the difference between snapshots?
So one can make speed calculations for the latest sync.

What is the general experience with mirroring performance?
In particular with bigger cephfs directories towards peta bytes.

Also, minor typo in the admin daemon output: "current_sycning_snap"

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Category set to Administration/Usability
Status changed from New to Triaged
Assignee set to Venky Shankar
Target version set to v18.0.0
Backport set to pacific,quincy

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Mathias Kuhring wrote:

We might have found a major performance bug in the cephfs snapshot mirroring.
We already reported it to the mailing list but with now answers, yet.
So we will mainly reuse the details from here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/R7NRB5PNGHL4YJ7EZ5TYBLLJFTU7I2MK/

We are using ceph version 17.2.1
(ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable).
We are using cephadm since version 15 octopus.

We mirror several CephFS directories from our main cluster our to a second mirror cluster.
In particular with bigger directories (over 900 TB and 186 M of files), we noticed that mirroring is very slow.
On the mirror, most of the time we only observe a write speed of 0 to 10 MB/s in the client IO.
The target peer directory often doesn't show increase in size during synchronization
(when we check with: getfattr -n ceph.dir.rbytes).

The status of the syncs is always fine, i.e. syncing and not failing, e.g.:

[...]

The cluster nodes (6 per main and mirror cluster, each) are connected with Dual 40G NICs to the switches.
Connection between switches are 2x 100G.
Simple write operations from other clients to the mirror cephfs result in writes of e.g. 300 to 400 MB/s.
So network doesn't seem to be the issue, here.

We started to dig into debug logs of the cephfs-mirror daemon / docker container.
We set the debug level to 20. Otherwise there are no messages at all (so no errors).

We observed a lot of messages with need_data_sync=0, need_attr_sync=1.
Leading us to the assumption, that instead of actual data a lot of attributes are synced.

We started looking at specific examples in the logs and tried to make sense from the source code which steps are happening.
Most of the messages are coming from cephfs::mirror::PeerReplayer
https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc

We figured, the do_synchronize function checks if data (need_data_sync) or attributes (need_attr_sync) should be synchronized using should_sync_entry.
And if necessary performs the sync using remote_file_op.
should_sync_entry reports different ctimes for our examples, e.g.:
[...]

Based on these different ctimes, should_sync_entry decides then that attributes need to be synced:
*need_attr_sync = (cstx.stx_ctime != pstx.stx_ctime)
https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc#L911

We assume cur statx/cstx refers to the file in the snapshot currently mirrored.
But what exactly is prev statx/pstx?

So, the mirror daemon can optimize the directory comparison in the following case: Say you have 3 snapshots in the primary site - snap_a, snap_b and snap_c. Assume that the secondary site is empty. Synchronizing snap_a requires a bulk transfer. Now when snap_b is to be synchronized, the mirror daemon would try to infer the delta between the two snapshots by walking the directories of snapshots (one in primary and the other in secondary). The optimization here is to use the "local" (primary) snapshot (snap_a) for attribute comparison rather then the secondary copy (which might be over a WAN, slow link). Therefore, snap_b is compared against "local" snap_a, snap_c against "local" snap_b.

Is it the peer path or the last snapshot on the mirror peer?

Last transferred snapshot, but the one on the primary site. Note that the mirror daemon ensures that it can rely this optimization for correctness, else it may lead to incorrect data getting synchronized.

We can confirm that ctimes are different on the main cluster and the mirror.

They will be different since the mirror daemon transfers file to the secondary cluster and then takes a snapshot.

On the main cluster, the ctimes are consistent in every snapshot (since the files didn't change).

If they are same for a file in two snapshots, the mirror daemon should not sync the attributes.

On the the mirror, the ctimes increase with every snapshot towards more current dates.

Given that the CephFS Mirror daemon writes the data to the mirror as a CephFS client,
it seems to make sense that data on the mirror has different / more recent ctimes (from writing).
Also, when the mirror daemon is syncing the attributes to the mirror, wouldn't this trigger an new/current ctime as well?

Right. The mirror daemon synchronizes {a,m}time, however, ctime moves forward.

So our assumption is, syncing an old ctime will actually result in a new ctime.
And thus trigger the sync of attributes over and over (at least with every snapshot synced).

The mirror daemon should be comparing ctime for a file between two snapshots in the primary cluster. If you aren't seeing that, then its probably a bug that's causing unnecessary attr sync.

So is ctime the proper parameter to test if attributes need to be synced? Or shouldn't it rather be excluded?
So is this check the right thing to do: *need_attr_sync = (cstx.stx_ctime != pstx.stx_ctime)

Is it reasonable to assume that these attribute syncs are responsible for our slow mirroring?
Or is there anything else we should look out for?

Could you share (upload) the debug mirror logs. I will have a look

And are there actually commands or logs showing us the speed of the mirroring?
We only now about sync_duration and sync_time_stamp (as in the status above).
But then, how can we actually determine the size of a snapshot or the difference between snapshots?
So one can make speed calculations for the latest sync.

Those aren't available anywhere at the moment. There were some considerations to use rbytes of snapshots (from primary), however, recursive stats (ones starting with "r", rbytes, rsize, etc..) are currently a bit buggy for snapshots. So, we decided not to rely on those. Unfortunately, we do not have a good way to provide this bit of information.

What is the general experience with mirroring performance?
In particular with bigger cephfs directories towards peta bytes.

The optimizations are a bit limited right now, especially when it comes to walking the directory tree and synchronizing file data. We have changes (under review) that adds functionality that can be used to speed up mirror daemon, but those will take a bit of time to be baked in.