Bug #52974
closedmgr/monitoring: regression in OSD host details/overview Grafana dashboard
0%
Description
OSD host details and overview do not show any data anymore¶
This is due to a regression introduced in https://github.com/ceph/ceph/pull/41221, which was meant to fix an matching problem if multiple OSDs use a single device (NVMe).
Environment¶
- Prometheus
- node_disk_writes_completed_total metric
- ceph_disk_occupation metric
- possibly other metrics as well, which are not all required to reproduce the issue
How reproducible¶
Steps:
- Use the following two metrics to perform a query
ceph_disk_occupation{ceph_daemon="osd.99",db_device="/dev/dm-7",device="/dev/dm-1",instance="foo.ceph",job="ceph-mgr"} 1.0 node_disk_writes_completed_total{device="dm-1",instance="foo.ceph:9100",job="node-exporter"} 93809050.0
- Perform query (does not yield a result)
Note that I removed the filtering for a particular instance from the queries.
label_replace( ( irate(node_disk_writes_completed{}[5m]) or irate(node_disk_writes_completed_total{}[5m]) ), "instance", "$1", "instance", "([^:.]*).*" ) * on(instance, device, ceph_daemon) group_left label_replace( label_replace( ceph_disk_occupation, "device", "$1", "device", "/dev/(.*)" ), "instance", "$1", "instance", "([^:.]*).*" )
- Perform adapted query (yields result)
label_replace( ( irate(node_disk_writes_completed{}[5m]) or irate(node_disk_writes_completed_total{}[5m]) ), "instance", "$1", "instance", "([^:.]*).*" ) * on(instance, device) group_left(ceph_daemon) label_replace( label_replace( ceph_disk_occupation, "device", "$1", "device", "/dev/(.*)" ), "instance", "$1", "instance", "([^:.]*).*" )
Returns
{ceph_daemon="osd.99", device="dm-1", instance="foo", job="node-exporter"}
Additional info¶
According to the Prometheus documentation, the labels declared in parentheses directly following an `on` are used to join two metrics. As the query was adapted in a previous PR to additionally join on the `ceph_daemon` label, I cannot explain how this would have worked other than if both sides had the `ceph_daemon` label present. However, as `ceph_disk_occupation` is only a metadata metric that is seemingly used to get that label from one side to the other, I cannot explain how this works.
Updated by Patrick Seidensal over 2 years ago
- Status changed from New to In Progress
Updated by Patrick Seidensal over 2 years ago
- Subject changed from mgr/monitoring: regression in OSD host details/overview to mgr/monitoring: regression in OSD host details/overview Grafana dashboard
Updated by Patrick Seidensal over 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 43685
Updated by Patrick Seidensal over 2 years ago
- Backport set to nautilus octopus pacific
Updated by Ernesto Puerta over 2 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from nautilus octopus pacific to octopus pacific
Updated by Backport Bot over 2 years ago
- Copied to Backport #53882: pacific: mgr/monitoring: regression in OSD host details/overview Grafana dashboard added
Updated by Backport Bot over 2 years ago
- Copied to Backport #53883: octopus: mgr/monitoring: regression in OSD host details/overview Grafana dashboard added
Updated by Patrick Seidensal about 2 years ago
- Backport changed from octopus pacific to octopus pacific nautilus
Updated by Backport Bot about 2 years ago
- Copied to Backport #53990: nautilus: mgr/monitoring: regression in OSD host details/overview Grafana dashboard added
Updated by Konstantin Shalygin about 2 years ago
- Backport changed from octopus pacific nautilus to octopus pacific
Updated by Ernesto Puerta about 2 years ago
- Backport changed from octopus pacific to nautilus octopus pacific
Updated by Konstantin Shalygin over 1 year ago
- Status changed from Pending Backport to Resolved