Project

General

Profile

Actions

Feature #63945

open

cephfs_mirror: add perf counters (w/ label) support

Added by Jos Collin 4 months ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
backport_processed
Backport:
reef,squid
Reviewed:
Affected Versions:
Component(FS):
cephfs-mirror
Labels (FS):
Pull request ID:

Description

https://jsw.ibm.com/browse/ISCE-49:

Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.

Metrics should enable monitoring logic to generate the following alerts:

  • Secondary cluster disconnected
  • Replication started/ended
  • Resync started/ended
  • Promotion/Demotion event (= failover or fallback initiated)
  • Snapshot transfer failed or interrupted.
  • Failed to complete the snapshot transfer before the next scheduled transfer.
  • Replication status Monitoring: Alerts on policy non-compliance.

This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.


Related issues 3 (3 open0 closed)

Related to CephFS - Feature #64387: mds: add per-client perf counters (w/ label) supportPending BackportVenky Shankar

Actions
Copied to CephFS - Backport #64485: reef: cephfs_mirror: add perf counters (w/ label) supportIn ProgressVenky ShankarActions
Copied to CephFS - Backport #64779: squid: cephfs_mirror: add perf counters (w/ label) supportIn ProgressVenky ShankarActions
Actions #1

Updated by Jos Collin 4 months ago

  • Description updated (diff)
Actions #2

Updated by Venky Shankar 4 months ago

  • Assignee set to Jos Collin
  • Target version set to v19.0.0
Actions #3

Updated by Jos Collin 4 months ago

  • Status changed from New to In Progress

Based on the discussion with Juan, Juan shared the Perf counters/metrics design guidelines doc [1], which we should follow for the cephfs_mirror metrics implementation. Adding a PerfCountersBuilder as it's done in MDSRank::create_logger would be a good start. As per Juan the metrics from multiple mirror daemons would be taken care by the PerfCountersBuilder, if we give the daemon name as label.

[1] https://ibm-my.sharepoint.com/:w:/p/jolmomar/EX3Jw4YLEoFGpqmDWZzYwxsBdssx4u1R3SDx9KcO2oimOg?e=IWnbMW

Actions #4

Updated by Venky Shankar 4 months ago

Apart from the perf counters, there is also value to adding labeled perf counters. Let's keep that in mind when implementing this.

Also, as far as the metrics are concerned, a while back I was involved with a similar effort with Paul Cuzner and there is a god (somewhere, which I'll dig) that details out useful metrics for the mirror daemon.

Actions #5

Updated by Venky Shankar 4 months ago

  • Subject changed from cephfs_mirror: generate Geo-Replication metrics to cephfs_mirror: add perf counters (w/ label) support
Actions #6

Updated by Venky Shankar 4 months ago

Jos Collin wrote:

https://jsw.ibm.com/browse/ISCE-49:

Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.

Metrics should enable monitoring logic to generate the following alerts:

  • Secondary cluster disconnected
  • Replication started/ended
  • Resync started/ended
  • Promotion/Demotion event (= failover or fallback initiated)

cephfs-mirror does not support promotion-demotion semantics, so this can be left out. This functionality is taken care by the "upper" layer and therefore any metrics related to promotio/demotion (failover/failback) should be provided by that layer.

  • Snapshot transfer failed or interrupted.
  • Failed to complete the snapshot transfer before the next scheduled transfer.
  • Replication status Monitoring: Alerts on policy non-compliance.

This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.

Actions #7

Updated by Jos Collin 4 months ago

The following are the metrics to be considered for cephfs_mirroring:

  • cephfs_mirror_snapshot_snapshots - Number of snapshots synced
  • cephfs_mirror_snapshot_sync_time - Average sync time
  • cephfs_mirror_snapshot_sync_bytes - Total bytes synced
    // per-image only counters:
  • cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot
  • cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot
  • cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot
  • cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot

These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.

Actions #8

Updated by Venky Shankar 3 months ago

Jos Collin wrote:

The following are the metrics to be considered for cephfs_mirroring:

  • cephfs_mirror_snapshot_snapshots - Number of snapshots synced
  • cephfs_mirror_snapshot_sync_time - Average sync time
  • cephfs_mirror_snapshot_sync_bytes - Total bytes synced

Tracking failures are as important (perhaps more!) as the success metrics. So, we should add:

- cephfs_mirror_failed_snapshot_syncs

// per-image only counters:

Are these per directory then from cephfs pov?

  • cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot
  • cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot
  • cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot
  • cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot

These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.

Actions #9

Updated by Jos Collin 3 months ago

  • Pull request ID set to 55420
Actions #10

Updated by Venky Shankar 3 months ago

  • Pull request ID changed from 55420 to 55471
Actions #11

Updated by Venky Shankar 3 months ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to reef
Actions #12

Updated by Venky Shankar 3 months ago

  • Related to Feature #64387: mds: add per-client perf counters (w/ label) support added
Actions #13

Updated by Venky Shankar 2 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #14

Updated by Backport Bot 2 months ago

  • Copied to Backport #64485: reef: cephfs_mirror: add perf counters (w/ label) support added
Actions #15

Updated by Backport Bot 2 months ago

  • Tags set to backport_processed
Actions #16

Updated by Venky Shankar 2 months ago

  • Assignee changed from Jos Collin to Venky Shankar
Actions #17

Updated by Venky Shankar about 2 months ago

  • Tags deleted (backport_processed)
  • Backport changed from reef to reef,squid
Actions #18

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64779: squid: cephfs_mirror: add perf counters (w/ label) support added
Actions #19

Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF