Feature #63945
opencephfs_mirror: add perf counters (w/ label) support
0%
Description
https://jsw.ibm.com/browse/ISCE-49:
Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.
Metrics should enable monitoring logic to generate the following alerts:
- Secondary cluster disconnected
- Replication started/ended
- Resync started/ended
- Promotion/Demotion event (= failover or fallback initiated)
- Snapshot transfer failed or interrupted.
- Failed to complete the snapshot transfer before the next scheduled transfer.
- Replication status Monitoring: Alerts on policy non-compliance.
This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.
Updated by Venky Shankar 4 months ago
- Assignee set to Jos Collin
- Target version set to v19.0.0
Updated by Jos Collin 4 months ago
- Status changed from New to In Progress
Based on the discussion with Juan, Juan shared the Perf counters/metrics design guidelines doc [1], which we should follow for the cephfs_mirror metrics implementation. Adding a PerfCountersBuilder as it's done in MDSRank::create_logger would be a good start. As per Juan the metrics from multiple mirror daemons would be taken care by the PerfCountersBuilder, if we give the daemon name as label.
[1] https://ibm-my.sharepoint.com/:w:/p/jolmomar/EX3Jw4YLEoFGpqmDWZzYwxsBdssx4u1R3SDx9KcO2oimOg?e=IWnbMW
Updated by Venky Shankar 4 months ago
Apart from the perf counters, there is also value to adding labeled perf counters. Let's keep that in mind when implementing this.
Also, as far as the metrics are concerned, a while back I was involved with a similar effort with Paul Cuzner and there is a god (somewhere, which I'll dig) that details out useful metrics for the mirror daemon.
Updated by Venky Shankar 4 months ago
- Subject changed from cephfs_mirror: generate Geo-Replication metrics to cephfs_mirror: add perf counters (w/ label) support
Updated by Venky Shankar 4 months ago
Jos Collin wrote:
https://jsw.ibm.com/browse/ISCE-49:
Introduce metrics that will be consumed by the OCP/ODF Dashboard to provide monitoring of Geo Replication in the OCP and ACM dashboard and elsewhere. This would generate the progress of cephfs_mirror syncing and thus provide the monitoring capability.
Metrics should enable monitoring logic to generate the following alerts:
- Secondary cluster disconnected
- Replication started/ended
- Resync started/ended
- Promotion/Demotion event (= failover or fallback initiated)
cephfs-mirror does not support promotion-demotion semantics, so this can be left out. This functionality is taken care by the "upper" layer and therefore any metrics related to promotio/demotion (failover/failback) should be provided by that layer.
- Snapshot transfer failed or interrupted.
- Failed to complete the snapshot transfer before the next scheduled transfer.
- Replication status Monitoring: Alerts on policy non-compliance.
This would provide information like replication status, replication schedules, resync status - per volumes belonging to a storage class, namespace, label via OCP/ODF dashboards.
Updated by Jos Collin 4 months ago
The following are the metrics to be considered for cephfs_mirroring:
- cephfs_mirror_snapshot_snapshots - Number of snapshots synced
- cephfs_mirror_snapshot_sync_time - Average sync time
- cephfs_mirror_snapshot_sync_bytes - Total bytes synced
// per-image only counters: - cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot
- cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot
- cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot
- cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot
These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.
Updated by Venky Shankar 3 months ago
Jos Collin wrote:
The following are the metrics to be considered for cephfs_mirroring:
- cephfs_mirror_snapshot_snapshots - Number of snapshots synced
- cephfs_mirror_snapshot_sync_time - Average sync time
- cephfs_mirror_snapshot_sync_bytes - Total bytes synced
Tracking failures are as important (perhaps more!) as the success metrics. So, we should add:
- cephfs_mirror_failed_snapshot_syncs
// per-image only counters:
Are these per directory then from cephfs pov?
- cephfs_mirror_snapshot_remote_timestamp - Timestamp of the remote snapshot
- cephfs_mirror_snapshot_local_timestamp - Timestamp of the local snapshot
- cephfs_mirror_snapshot_last_sync_time - Time taken to sync the last snapshot
- cephfs_mirror_snapshot_last_sync_bytes - Bytes synced for the last snapshot
These are the metrics observed from rbd-mirroring. Will add more metrics in cephfs_mirroring if needed.
Updated by Venky Shankar 3 months ago
- Pull request ID changed from 55420 to 55471
Updated by Venky Shankar 3 months ago
- Status changed from In Progress to Fix Under Review
- Backport set to reef
Updated by Venky Shankar 3 months ago
- Related to Feature #64387: mds: add per-client perf counters (w/ label) support added
Updated by Venky Shankar 3 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 3 months ago
- Copied to Backport #64485: reef: cephfs_mirror: add perf counters (w/ label) support added
Updated by Venky Shankar 3 months ago
- Assignee changed from Jos Collin to Venky Shankar
Updated by Venky Shankar about 2 months ago
- Tags deleted (
backport_processed) - Backport changed from reef to reef,squid
Updated by Backport Bot about 2 months ago
- Copied to Backport #64779: squid: cephfs_mirror: add perf counters (w/ label) support added