Feature #63945
open
cephfs_mirror: add perf counters (w/ label) support
Added by Jos Collin 4 months ago.
Updated 2 months ago.
Category:
Administration/Usability
Component(FS):
cephfs-mirror
Description
https://jsw.ibm.com/browse/ISCE-49:
Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.
Metrics should enable monitoring logic to generate the following alerts:
- Secondary cluster disconnected
- Replication started/ended
- Resync started/ended
- Promotion/Demotion event (= failover or fallback initiated)
- Snapshot transfer failed or interrupted.
- Failed to complete the snapshot transfer before the next scheduled transfer.
- Replication status Monitoring: Alerts on policy non-compliance.
This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.
Related issues
3 (3 open — 0 closed)
- Description updated (diff)
- Assignee set to Jos Collin
- Target version set to v19.0.0
- Status changed from New to In Progress
Based on the discussion with Juan, Juan shared the Perf counters/metrics design guidelines doc [1], which we should follow for the cephfs_mirror metrics implementation. Adding a PerfCountersBuilder as it's done in MDSRank::create_logger would be a good start. As per Juan the metrics from multiple mirror daemons would be taken care by the PerfCountersBuilder, if we give the daemon name as label.
[1] https://ibm-my.sharepoint.com/:w:/p/jolmomar/EX3Jw4YLEoFGpqmDWZzYwxsBdssx4u1R3SDx9KcO2oimOg?e=IWnbMW
Apart from the perf counters, there is also value to adding labeled perf counters. Let's keep that in mind when implementing this.
Also, as far as the metrics are concerned, a while back I was involved with a similar effort with Paul Cuzner and there is a god (somewhere, which I'll dig) that details out useful metrics for the mirror daemon.
- Subject changed from cephfs_mirror: generate Geo-Replication metrics to cephfs_mirror: add perf counters (w/ label) support
Jos Collin wrote:
https://jsw.ibm.com/browse/ISCE-49:
Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.
Metrics should enable monitoring logic to generate the following alerts:
- Secondary cluster disconnected
- Replication started/ended
- Resync started/ended
- Promotion/Demotion event (= failover or fallback initiated)
cephfs-mirror does not support promotion-demotion semantics, so this can be left out. This functionality is taken care by the "upper" layer and therefore any metrics related to promotio/demotion (failover/failback) should be provided by that layer.
- Snapshot transfer failed or interrupted.
- Failed to complete the snapshot transfer before the next scheduled transfer.
- Replication status Monitoring: Alerts on policy non-compliance.
This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.
The following are the metrics to be considered for cephfs_mirroring:
- cephfs_mirror_snapshot_snapshots - Number of snapshots synced
- cephfs_mirror_snapshot_sync_time - Average sync time
- cephfs_mirror_snapshot_sync_bytes - Total bytes synced
// per-image only counters:
- cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot
- cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot
- cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot
- cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot
These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.
Jos Collin wrote:
The following are the metrics to be considered for cephfs_mirroring:
- cephfs_mirror_snapshot_snapshots - Number of snapshots synced
- cephfs_mirror_snapshot_sync_time - Average sync time
- cephfs_mirror_snapshot_sync_bytes - Total bytes synced
Tracking failures are as important (perhaps more!) as the success metrics. So, we should add:
- cephfs_mirror_failed_snapshot_syncs
// per-image only counters:
Are these per directory then from cephfs pov?
- cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot
- cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot
- cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot
- cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot
These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.
- Pull request ID set to 55420
- Pull request ID changed from 55420 to 55471
- Status changed from In Progress to Fix Under Review
- Backport set to reef
- Related to Feature #64387: mds: add per-client perf counters (w/ label) support added
- Status changed from Fix Under Review to Pending Backport
- Copied to Backport #64485: reef: cephfs_mirror: add perf counters (w/ label) support added
- Tags set to backport_processed
- Assignee changed from Jos Collin to Venky Shankar
- Tags deleted (
backport_processed)
- Backport changed from reef to reef,squid
- Copied to Backport #64779: squid: cephfs_mirror: add perf counters (w/ label) support added
- Tags set to backport_processed
Also available in: Atom
PDF