Feature #63945: cephfs_mirror: add perf counters (w/ label) support - CephFS - Ceph

Actions

Copy link

Feature #63945

open

cephfs_mirror: add perf counters (w/ label) support

Added by Jos Collin 4 months ago. Updated about 2 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Venky Shankar

Category:

Administration/Usability

Target version:

Ceph - v19.0.0

% Done:

Source:

Community (dev)

Tags:

backport_processed

Backport:

reef,squid

Reviewed:

Affected Versions:

Component(FS):

cephfs-mirror

Labels (FS):

Pull request ID:

55471

Description

https://jsw.ibm.com/browse/ISCE-49:

Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.

Metrics should enable monitoring logic to generate the following alerts:

Secondary cluster disconnected
Replication started/ended
Resync started/ended
Promotion/Demotion event (= failover or fallback initiated)
Snapshot transfer failed or interrupted.
Failed to complete the snapshot transfer before the next scheduled transfer.
Replication status Monitoring: Alerts on policy non-compliance.

This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.

Related issues 3 (3 open — 0 closed)

Actions

Copy link

Updated by Jos Collin 4 months ago

Description updated (diff)

Actions

Copy link

Updated by Venky Shankar 4 months ago

Assignee set to Jos Collin
Target version set to v19.0.0

Actions

Copy link

Updated by Jos Collin 4 months ago

Status changed from New to In Progress

Based on the discussion with Juan, Juan shared the Perf counters/metrics design guidelines doc [1], which we should follow for the cephfs_mirror metrics implementation. Adding a PerfCountersBuilder as it's done in MDSRank::create_logger would be a good start. As per Juan the metrics from multiple mirror daemons would be taken care by the PerfCountersBuilder, if we give the daemon name as label.

[1] https://ibm-my.sharepoint.com/:w:/p/jolmomar/EX3Jw4YLEoFGpqmDWZzYwxsBdssx4u1R3SDx9KcO2oimOg?e=IWnbMW

Actions

Copy link

Updated by Venky Shankar 4 months ago

Apart from the perf counters, there is also value to adding labeled perf counters. Let's keep that in mind when implementing this.

Also, as far as the metrics are concerned, a while back I was involved with a similar effort with Paul Cuzner and there is a god (somewhere, which I'll dig) that details out useful metrics for the mirror daemon.

Actions

Copy link

Updated by Venky Shankar 4 months ago

Subject changed from cephfs_mirror: generate Geo-Replication metrics to cephfs_mirror: add perf counters (w/ label) support

Actions

Copy link

Updated by Venky Shankar 4 months ago

Jos Collin wrote:

https://jsw.ibm.com/browse/ISCE-49:

Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.

Metrics should enable monitoring logic to generate the following alerts:

Secondary cluster disconnected

Replication started/ended

Resync started/ended

Promotion/Demotion event (= failover or fallback initiated)

cephfs-mirror does not support promotion-demotion semantics, so this can be left out. This functionality is taken care by the "upper" layer and therefore any metrics related to promotio/demotion (failover/failback) should be provided by that layer.

Snapshot transfer failed or interrupted.

Failed to complete the snapshot transfer before the next scheduled transfer.

Replication status Monitoring: Alerts on policy non-compliance.

This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.

Actions

Copy link

Updated by Jos Collin 4 months ago

The following are the metrics to be considered for cephfs_mirroring:

cephfs_mirror_snapshot_snapshots - Number of snapshots synced
cephfs_mirror_snapshot_sync_time - Average sync time
cephfs_mirror_snapshot_sync_bytes - Total bytes synced
// per-image only counters:
cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot
cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot
cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot
cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot

These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.

Actions

Copy link

Updated by Venky Shankar 3 months ago

Jos Collin wrote:

The following are the metrics to be considered for cephfs_mirroring:

cephfs_mirror_snapshot_snapshots - Number of snapshots synced

cephfs_mirror_snapshot_sync_time - Average sync time

cephfs_mirror_snapshot_sync_bytes - Total bytes synced

Tracking failures are as important (perhaps more!) as the success metrics. So, we should add:

- cephfs_mirror_failed_snapshot_syncs

// per-image only counters:

Are these per directory then from cephfs pov?

cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot

cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot

cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot

cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot

These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.

Actions

Copy link