Bug #52974: mgr/monitoring: regression in OSD host details/overview Grafana dashboard - Dashboard - Ceph

Actions

Copy link

Bug #52974

closed

mgr/monitoring: regression in OSD host details/overview Grafana dashboard

Added by Patrick Seidensal over 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Patrick Seidensal

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

nautilus octopus pacific

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

43685

Crash signature (v1):

Crash signature (v2):

Description

OSD host details and overview do not show any data anymore¶

This is due to a regression introduced in https://github.com/ceph/ceph/pull/41221, which was meant to fix an matching problem if multiple OSDs use a single device (NVMe).

Environment¶

Prometheus
- node_disk_writes_completed_total metric
- ceph_disk_occupation metric
- possibly other metrics as well, which are not all required to reproduce the issue

How reproducible¶

Steps:

Use the following two metrics to perform a query

ceph_disk_occupation{ceph_daemon="osd.99",db_device="/dev/dm-7",device="/dev/dm-1",instance="foo.ceph",job="ceph-mgr"} 1.0
node_disk_writes_completed_total{device="dm-1",instance="foo.ceph:9100",job="node-exporter"} 93809050.0

Perform query (does not yield a result)

Note that I removed the filtering for a particular instance from the queries.

label_replace(
 (
 irate(node_disk_writes_completed{}[5m]) or
 irate(node_disk_writes_completed_total{}[5m])
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
)
* on(instance, device, ceph_daemon) group_left
 label_replace(
 label_replace(
 ceph_disk_occupation,
 "device",
 "$1",
 "device",
 "/dev/(.*)" 
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
 )

Perform adapted query (yields result)

label_replace(
 (
 irate(node_disk_writes_completed{}[5m]) or
 irate(node_disk_writes_completed_total{}[5m])
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
)
* on(instance, device) group_left(ceph_daemon)
 label_replace(
 label_replace(
 ceph_disk_occupation,
 "device",
 "$1",
 "device",
 "/dev/(.*)" 
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
 )

Returns

{ceph_daemon="osd.99", device="dm-1", instance="foo", job="node-exporter"}

Additional info¶

According to the Prometheus documentation, the labels declared in parentheses directly following an `on` are used to join two metrics. As the query was adapted in a previous PR to additionally join on the `ceph_daemon` label, I cannot explain how this would have worked other than if both sides had the `ceph_daemon` label present. However, as `ceph_disk_occupation` is only a metadata metric that is seemingly used to get that label from one side to the other, I cannot explain how this works.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Patrick Seidensal over 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Patrick Seidensal over 2 years ago

Subject changed from mgr/monitoring: regression in OSD host details/overview to mgr/monitoring: regression in OSD host details/overview Grafana dashboard

Actions

Copy link