Feature #52903
openmgr/prometheus: Add RADOSGW / rgw sync state and related metrics to prometheus output
0%
Description
The mgr/prometheus module does not provide any state info on the radosgw sync (multisite)
While there are the metrics with prefixes ceph_data_sync_
or ceph_rgw_
already available, those only cover the performance metrics of rgw and the sync metrics such as ceph_data_sync_from_zone_poll_errors
.
The actual state of the sync between sites can only be seen via radosgw-admin:
radosgw-admin sync status realm ce992742-88e7-4613-990a-95dad46fd14a (myrealm) zonegroup fd9a77a5-f6c6-4144-82a1-307111569c04 (myzonegroup) zone 3084933d-a1bb-43ef-870e-8e8f14dab718 (myzone) metadata sync no sync (zone is master) data sync source: 6c39f3b7-8234-4d7b-89e5-15ba573ed033 (theotherzone) syncing full sync: 81/128 shards full sync: 6 buckets to sync incremental sync: 47/128 shards data is behind on 81 shards behind shards: [0,1,3,4,5,6,8,10,11,12,14,17,20,22,23,24,25,26,27,29,30,31,32,33,34,35,36,37,42,43,44,47,48,49,56,57,58,60,61,62,63,64,66,67,69,70,71,73,74,76,78,80,81,82,85,86,88,90,91,96,97,99,103,105,106,108,109,111,112,113,114,115,116,117,118,119,120,122,123,124,125]
radosgw-admin sync status realm ce992742-88e7-4613-990a-95dad46fd14a (myrealm) zonegroup fd9a77a5-f6c6-4144-82a1-307111569c04 (myzonegroup) zone 3084933d-a1bb-43ef-870e-8e8f14dab718 (myzone) metadata sync no sync (zone is master) data sync source: 6c39f3b7-8234-4d7b-89e5-15ba573ed033 (theotherzone) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source
I'd love to have metrics on general sync state ("caught up with source", "behind", ...) for metadata and data, but also about state of shards and numbers on buckets being synced.
This request is loosely related to https://tracker.ceph.com/issues/52638, but I believe the rgw sync state data is not even available via the mgr (yet).
Updated by Christian Rohmann over 2 years ago
This seems to also be related to https://tracker.ceph.com/issues/39369 which also suffers from unavailable data on the detailed sync status of RGW multisite.
Updated by Christian Rohmann over 2 years ago
Christian Rohmann wrote:
This seems to also be related to https://tracker.ceph.com/issues/39369 which also suffers from unavailable data on the detailed sync status of RGW multisite.
The discussion there on already available data via the admin API seems very valuable to identify which metrics would be required to properly judge the sync state of two zones, such as state per shard, replication lag, ...