Bug #51120: RGW daemon names keep changing after every restart since update to pacific - Dashboard - Ceph

Actions

Copy link

Bug #51120

closed

RGW daemon names keep changing after every restart since update to pacific

Added by Roland Sommer almost 3 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Monitoring

Target version:

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

43707

Crash signature (v1):

Crash signature (v2):

Description

After updating a ceph cluster from octopus to pacific (16.2.4), the daemon names of the rgw instances started to change after every daemon restart, breaking for example the prometheus metrics, because the ceph_daemon-label keeps changing. Before the update, daemon names (and prometheus labels) were stable like rgw.cephrgw01, rgw.cephrgw02 etc., after the update the names started to be just numeric ids like rgw.253785251. Excerpt from the rgw daemon log (restarted a few times just for demonstration purposes):

2021-06-08T06:25:08.208+0000 7f4ae77c0240  1 mgrc service_daemon_register rgw.253785251
2021-06-08T07:03:37.417+0000 7f884a557240  1 mgrc service_daemon_register rgw.253789201
2021-06-08T07:03:40.953+0000 7f81c09dc240  1 mgrc service_daemon_register rgw.253789226
2021-06-08T07:03:50.585+0000 7f16b76dc240  1 mgrc service_daemon_register rgw.253789261

Did I miss any important change or is this a regression?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Roland Sommer almost 3 years ago

Additional information: the ceph dashboard still says, the ID is cephrgw01 etc. and uses this when trying to access the grafana detail dashboards, resulting in empty graphs because the actual labels do not match.

Actions

Copy link

Updated by Roland Sommer almost 3 years ago

Ok, that seems to come from commit afc33758e076761b8d4ec004c8f9c49b80a48770. Instead of the daemons name, rados.get_instance_id() is used now to register the service. So the prometheus labels keep changing now after every restart, making it impossible to track one specific instance in a traditional, non-container setup.

The used query in the grafana dashboards

label_replace(rate(ceph_rgw_req[30s]), "rgw_host", "$1", "ceph_daemon", "rgw.(.*)")

results in rgw_host being that instance id, too.

Actions

Copy link

Updated by Roland Sommer almost 3 years ago

If one would like to get back the per rgw host metrics:

sum by(hostname) (rate(ceph_rgw_req[30s]) * on(ceph_daemon) group_left(hostname) ceph_rgw_metadata)

The legend has to be changed to

{{ hostname }}

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Project changed from Ceph to rgw

Actions

Copy link

Updated by Casey Bodley almost 3 years ago

Project changed from rgw to Orchestrator

regression from https://github.com/ceph/ceph/pull/40220

Actions

Copy link

Updated by Benoît Knecht over 2 years ago

I submitted https://github.com/ceph/ceph/pull/43707 to fix this issue.

Actions

Copy link

Updated by Benoît Knecht over 2 years ago

Hmm, now that I think about it, I don't think it's the right approach. If I understand https://github.com/ceph/ceph/pull/40220/commits/afc33758e076761b8d4ec004c8f9c49b80a48770 correctly, the idea is to be able to run several `radosgw` processes with the same `--id` (and therefore the same credentials) on the same machine. As a result, they will each have their own perf counters, but with my proposed fix, we wouldn't get the aggregate value, we would just overwrite the counters, or even just break things due to duplicate keys in the JSON document.

I think the correct approach would be to replace the `ceph_daemon` label on the `ceph_rgw_*` metrics with something like `client_id`, which would be the numerical ID that is currently part of `ceph_daemon` on Pacific, and then have `ceph_rgw_metadata` do the mapping between `client_id` and `ceph_daemon`.

In order to get the same metrics and labels as on Octopus, one would do

```
sum by(ceph_daemon) (ceph_rgw_req * on(client_id) group_left(ceph_daemon) ceph_rgw_metadata)
```

which is almost the same solution as Roland suggested, except it would also work if several `radosgw` instances are running on the same host but with different names, e.g. `my-hostname.rgw0`, `my-hostname.rgw1`, etc.

Does that make sense? If so, I'll see if I can modify my PR to implement this without getting too messy.

Actions

Copy link

Updated by Patrick Seidensal over 2 years ago

<Benoît Knecht wrote:

Hmm, now that I think about it, I don't think it's the right approach. If I understand https://github.com/ceph/ceph/pull/40220/commits/afc33758e076761b8d4ec004c8f9c49b80a48770 correctly, the idea is to be able to run several `radosgw` processes with the same `--id` (and therefore the same credentials) on the same machine. As a result, they will each have their own perf counters, but with my proposed fix, we wouldn't get the aggregate value, we would just overwrite the counters, or even just break things due to duplicate keys in the JSON document.

I think the correct approach would be to replace the `ceph_daemon` label on the `ceph_rgw_*` metrics with something like `client_id`, which would be the numerical ID that is currently part of `ceph_daemon` on Pacific, and then have `ceph_rgw_metadata` do the mapping between `client_id` and `ceph_daemon`.

In order to get the same metrics and labels as on Octopus, one would do

```
sum by(ceph_daemon) (ceph_rgw_req * on(client_id) group_left(ceph_daemon) ceph_rgw_metadata)
```

which is almost the same solution as Roland suggested, except it would also work if several `radosgw` instances are running on the same host but with different names, e.g. `my-hostname.rgw0`, `my-hostname.rgw1`, etc.

Does that make sense? If so, I'll see if I can modify my PR to implement this without getting too messy.

Wouldn't that require `ceph_rgw_req` and other metrics to have the `client_id` label? And if so, wouldn't it possibly be easier to simply kind of restore the previous behavior and append the `client_id` to the value of the `ceph_daemon` label?

I mean, that way the ceph_daemon label would be unique (again), even across several RGW instances on the same host. Assuming `client_id` is a six-letter ID, it might look like so:

ceph_rgw_req{ceph_daemon="rgw.default.default.node1.tzauqy"} 0.0

Which is actually like it looks on a development/test environment for Octopus I have set up. So not sure if the change that replaced `ceph_daemon` with an instance ID was necessary. But if so, a somewhat more persistent and unique ID (across instances on the same host) could simply be appended to the previous version of the ceph_daemon label for RGW (provided such an ID exists).

ceph_rgw_req{ceph_daemon="rgw.default.default.node1.tzauqy.123"} 0.0

The idea of Roland would then also just work, but label_replace could still be used to obtain the name of the host. Personally, I'd prefer to use the metadata label, but I think for the solution of this problem that's not so important. Either one will work.

But your proposed solution would most likely work as well, it just looks like it would require a change to more metrics. But in addition, the value of the ceph_daemon label would need to be changed to something more persistent than the instance ID anyway (for this issue to be fixed).

ceph_rgw_req{ceph_daemon="rgw.default.default.node1", client_id="txaugy"} 0.0

or possibly

ceph_rgw_req{ceph_daemon="rgw.default.default.node1.txaugy", client_id="123"} 0.0

depending on the behavior of the six-letter ID (txaugy), which I am absolutely not certain about.

Actions

Copy link

Updated by Sebastian Wagner over 2 years ago

Pull request ID set to 43707

Actions

Copy link

#10

Updated by Ernesto Puerta about 2 years ago

Status changed from New to Pending Backport
Backport set to pacific

Actions

Copy link

#11

Updated by Ernesto Puerta about 2 years ago

Project changed from Orchestrator to Dashboard
Category set to Monitoring

Actions

Copy link

#12

Updated by Backport Bot about 2 years ago

Copied to Backport #54118: pacific: RGW daemon names keep changing after every restart since update to pacific added

Actions

Copy link

#13

Updated by Ernesto Puerta about 2 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr » Dashboard

Custom queries

Bug #51120

RGW daemon names keep changing after every restart since update to pacific

Updated by Roland Sommer almost 3 years ago

Updated by Roland Sommer almost 3 years ago

Updated by Roland Sommer almost 3 years ago

Updated by Neha Ojha almost 3 years ago

Updated by Casey Bodley almost 3 years ago

Updated by Benoît Knecht over 2 years ago

Updated by Benoît Knecht over 2 years ago

Updated by Patrick Seidensal over 2 years ago

Updated by Sebastian Wagner over 2 years ago

Updated by Ernesto Puerta about 2 years ago

Updated by Ernesto Puerta about 2 years ago

Updated by Backport Bot about 2 years ago

Updated by Ernesto Puerta about 2 years ago