<Benoît Knecht wrote:
Hmm, now that I think about it, I don't think it's the right approach. If I understand https://github.com/ceph/ceph/pull/40220/commits/afc33758e076761b8d4ec004c8f9c49b80a48770 correctly, the idea is to be able to run several `radosgw` processes with the same `--id` (and therefore the same credentials) on the same machine. As a result, they will each have their own perf counters, but with my proposed fix, we wouldn't get the aggregate value, we would just overwrite the counters, or even just break things due to duplicate keys in the JSON document.
I think the correct approach would be to replace the `ceph_daemon` label on the `ceph_rgw_*` metrics with something like `client_id`, which would be the numerical ID that is currently part of `ceph_daemon` on Pacific, and then have `ceph_rgw_metadata` do the mapping between `client_id` and `ceph_daemon`.
In order to get the same metrics and labels as on Octopus, one would do
```
sum by(ceph_daemon) (ceph_rgw_req * on(client_id) group_left(ceph_daemon) ceph_rgw_metadata)
```
which is almost the same solution as Roland suggested, except it would also work if several `radosgw` instances are running on the same host but with different names, e.g. `my-hostname.rgw0`, `my-hostname.rgw1`, etc.
Does that make sense? If so, I'll see if I can modify my PR to implement this without getting too messy.
Wouldn't that require `ceph_rgw_req` and other metrics to have the `client_id` label? And if so, wouldn't it possibly be easier to simply kind of restore the previous behavior and append the `client_id` to the value of the `ceph_daemon` label?
I mean, that way the ceph_daemon label would be unique (again), even across several RGW instances on the same host. Assuming `client_id` is a six-letter ID, it might look like so:
ceph_rgw_req{ceph_daemon="rgw.default.default.node1.tzauqy"} 0.0
Which is actually like it looks on a development/test environment for Octopus I have set up. So not sure if the change that replaced `ceph_daemon` with an instance ID was necessary. But if so, a somewhat more persistent and unique ID (across instances on the same host) could simply be appended to the previous version of the ceph_daemon label for RGW (provided such an ID exists).
ceph_rgw_req{ceph_daemon="rgw.default.default.node1.tzauqy.123"} 0.0
The idea of Roland would then also just work, but label_replace could still be used to obtain the name of the host. Personally, I'd prefer to use the metadata label, but I think for the solution of this problem that's not so important. Either one will work.
But your proposed solution would most likely work as well, it just looks like it would require a change to more metrics. But in addition, the value of the ceph_daemon label would need to be changed to something more persistent than the instance ID anyway (for this issue to be fixed).
ceph_rgw_req{ceph_daemon="rgw.default.default.node1", client_id="txaugy"} 0.0
or possibly
ceph_rgw_req{ceph_daemon="rgw.default.default.node1.txaugy", client_id="123"} 0.0
depending on the behavior of the six-letter ID (txaugy), which I am absolutely not certain about.