Project

General

Profile

Actions

Bug #51120

closed

RGW daemon names keep changing after every restart since update to pacific

Added by Roland Sommer almost 3 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Monitoring
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After updating a ceph cluster from octopus to pacific (16.2.4), the daemon names of the rgw instances started to change after every daemon restart, breaking for example the prometheus metrics, because the ceph_daemon-label keeps changing. Before the update, daemon names (and prometheus labels) were stable like rgw.cephrgw01, rgw.cephrgw02 etc., after the update the names started to be just numeric ids like rgw.253785251. Excerpt from the rgw daemon log (restarted a few times just for demonstration purposes):

2021-06-08T06:25:08.208+0000 7f4ae77c0240  1 mgrc service_daemon_register rgw.253785251
2021-06-08T07:03:37.417+0000 7f884a557240  1 mgrc service_daemon_register rgw.253789201
2021-06-08T07:03:40.953+0000 7f81c09dc240  1 mgrc service_daemon_register rgw.253789226
2021-06-08T07:03:50.585+0000 7f16b76dc240  1 mgrc service_daemon_register rgw.253789261

Did I miss any important change or is this a regression?


Related issues 1 (0 open1 closed)

Copied to Dashboard - Backport #54118: pacific: RGW daemon names keep changing after every restart since update to pacificResolvedActions
Actions #1

Updated by Roland Sommer almost 3 years ago

Additional information: the ceph dashboard still says, the ID is cephrgw01 etc. and uses this when trying to access the grafana detail dashboards, resulting in empty graphs because the actual labels do not match.

Actions #2

Updated by Roland Sommer almost 3 years ago

Ok, that seems to come from commit afc33758e076761b8d4ec004c8f9c49b80a48770. Instead of the daemons name, rados.get_instance_id() is used now to register the service. So the prometheus labels keep changing now after every restart, making it impossible to track one specific instance in a traditional, non-container setup.

The used query in the grafana dashboards

label_replace(rate(ceph_rgw_req[30s]), "rgw_host", "$1", "ceph_daemon", "rgw.(.*)")

results in rgw_host being that instance id, too.

Actions #3

Updated by Roland Sommer almost 3 years ago

If one would like to get back the per rgw host metrics:

sum by(hostname) (rate(ceph_rgw_req[30s]) * on(ceph_daemon) group_left(hostname) ceph_rgw_metadata)

The legend has to be changed to

{{ hostname }}
.

Actions #4

Updated by Neha Ojha almost 3 years ago

  • Project changed from Ceph to rgw
Actions #5

Updated by Casey Bodley almost 3 years ago

  • Project changed from rgw to Orchestrator
Actions #6

Updated by Benoît Knecht over 2 years ago

I submitted https://github.com/ceph/ceph/pull/43707 to fix this issue.

Actions #7

Updated by Benoît Knecht over 2 years ago

Hmm, now that I think about it, I don't think it's the right approach. If I understand https://github.com/ceph/ceph/pull/40220/commits/afc33758e076761b8d4ec004c8f9c49b80a48770 correctly, the idea is to be able to run several `radosgw` processes with the same `--id` (and therefore the same credentials) on the same machine. As a result, they will each have their own perf counters, but with my proposed fix, we wouldn't get the aggregate value, we would just overwrite the counters, or even just break things due to duplicate keys in the JSON document.

I think the correct approach would be to replace the `ceph_daemon` label on the `ceph_rgw_*` metrics with something like `client_id`, which would be the numerical ID that is currently part of `ceph_daemon` on Pacific, and then have `ceph_rgw_metadata` do the mapping between `client_id` and `ceph_daemon`.

In order to get the same metrics and labels as on Octopus, one would do

```
sum by(ceph_daemon) (ceph_rgw_req * on(client_id) group_left(ceph_daemon) ceph_rgw_metadata)
```

which is almost the same solution as Roland suggested, except it would also work if several `radosgw` instances are running on the same host but with different names, e.g. `my-hostname.rgw0`, `my-hostname.rgw1`, etc.

Does that make sense? If so, I'll see if I can modify my PR to implement this without getting too messy.

Actions #8

Updated by Patrick Seidensal over 2 years ago

<Benoît Knecht wrote:

Hmm, now that I think about it, I don't think it's the right approach. If I understand https://github.com/ceph/ceph/pull/40220/commits/afc33758e076761b8d4ec004c8f9c49b80a48770 correctly, the idea is to be able to run several `radosgw` processes with the same `--id` (and therefore the same credentials) on the same machine. As a result, they will each have their own perf counters, but with my proposed fix, we wouldn't get the aggregate value, we would just overwrite the counters, or even just break things due to duplicate keys in the JSON document.

I think the correct approach would be to replace the `ceph_daemon` label on the `ceph_rgw_*` metrics with something like `client_id`, which would be the numerical ID that is currently part of `ceph_daemon` on Pacific, and then have `ceph_rgw_metadata` do the mapping between `client_id` and `ceph_daemon`.

In order to get the same metrics and labels as on Octopus, one would do

```
sum by(ceph_daemon) (ceph_rgw_req * on(client_id) group_left(ceph_daemon) ceph_rgw_metadata)
```

which is almost the same solution as Roland suggested, except it would also work if several `radosgw` instances are running on the same host but with different names, e.g. `my-hostname.rgw0`, `my-hostname.rgw1`, etc.

Does that make sense? If so, I'll see if I can modify my PR to implement this without getting too messy.

Wouldn't that require `ceph_rgw_req` and other metrics to have the `client_id` label? And if so, wouldn't it possibly be easier to simply kind of restore the previous behavior and append the `client_id` to the value of the `ceph_daemon` label?

I mean, that way the ceph_daemon label would be unique (again), even across several RGW instances on the same host. Assuming `client_id` is a six-letter ID, it might look like so:

ceph_rgw_req{ceph_daemon="rgw.default.default.node1.tzauqy"} 0.0

Which is actually like it looks on a development/test environment for Octopus I have set up. So not sure if the change that replaced `ceph_daemon` with an instance ID was necessary. But if so, a somewhat more persistent and unique ID (across instances on the same host) could simply be appended to the previous version of the ceph_daemon label for RGW (provided such an ID exists).

ceph_rgw_req{ceph_daemon="rgw.default.default.node1.tzauqy.123"} 0.0

The idea of Roland would then also just work, but label_replace could still be used to obtain the name of the host. Personally, I'd prefer to use the metadata label, but I think for the solution of this problem that's not so important. Either one will work.

But your proposed solution would most likely work as well, it just looks like it would require a change to more metrics. But in addition, the value of the ceph_daemon label would need to be changed to something more persistent than the instance ID anyway (for this issue to be fixed).

ceph_rgw_req{ceph_daemon="rgw.default.default.node1", client_id="txaugy"} 0.0

or possibly

ceph_rgw_req{ceph_daemon="rgw.default.default.node1.txaugy", client_id="123"} 0.0

depending on the behavior of the six-letter ID (txaugy), which I am absolutely not certain about.

Actions #9

Updated by Sebastian Wagner over 2 years ago

  • Pull request ID set to 43707
Actions #10

Updated by Ernesto Puerta about 2 years ago

  • Status changed from New to Pending Backport
  • Backport set to pacific
Actions #11

Updated by Ernesto Puerta about 2 years ago

  • Project changed from Orchestrator to Dashboard
  • Category set to Monitoring
Actions #12

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54118: pacific: RGW daemon names keep changing after every restart since update to pacific added
Actions #13

Updated by Ernesto Puerta about 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF