Project

General

Profile

Actions

Bug #52974

closed

mgr/monitoring: regression in OSD host details/overview Grafana dashboard

Added by Patrick Seidensal over 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
nautilus octopus pacific
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSD host details and overview do not show any data anymore

This is due to a regression introduced in https://github.com/ceph/ceph/pull/41221, which was meant to fix an matching problem if multiple OSDs use a single device (NVMe).

Environment

  • Prometheus
    • node_disk_writes_completed_total metric
    • ceph_disk_occupation metric
    • possibly other metrics as well, which are not all required to reproduce the issue

How reproducible

Steps:

  1. Use the following two metrics to perform a query
ceph_disk_occupation{ceph_daemon="osd.99",db_device="/dev/dm-7",device="/dev/dm-1",instance="foo.ceph",job="ceph-mgr"} 1.0
node_disk_writes_completed_total{device="dm-1",instance="foo.ceph:9100",job="node-exporter"} 93809050.0
  1. Perform query (does not yield a result)

Note that I removed the filtering for a particular instance from the queries.

label_replace(
 (
 irate(node_disk_writes_completed{}[5m]) or
 irate(node_disk_writes_completed_total{}[5m])
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
)
* on(instance, device, ceph_daemon) group_left
 label_replace(
 label_replace(
 ceph_disk_occupation,
 "device",
 "$1",
 "device",
 "/dev/(.*)" 
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
 )
  1. Perform adapted query (yields result)
label_replace(
 (
 irate(node_disk_writes_completed{}[5m]) or
 irate(node_disk_writes_completed_total{}[5m])
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
)
* on(instance, device) group_left(ceph_daemon)
 label_replace(
 label_replace(
 ceph_disk_occupation,
 "device",
 "$1",
 "device",
 "/dev/(.*)" 
 ),
 "instance",
 "$1",
 "instance",
 "([^:.]*).*" 
 )

Returns

{ceph_daemon="osd.99", device="dm-1", instance="foo", job="node-exporter"}

Additional info

According to the Prometheus documentation, the labels declared in parentheses directly following an `on` are used to join two metrics. As the query was adapted in a previous PR to additionally join on the `ceph_daemon` label, I cannot explain how this would have worked other than if both sides had the `ceph_daemon` label present. However, as `ceph_disk_occupation` is only a metadata metric that is seemingly used to get that label from one side to the other, I cannot explain how this works.


Related issues 3 (0 open3 closed)

Copied to Dashboard - Backport #53882: pacific: mgr/monitoring: regression in OSD host details/overview Grafana dashboardResolvedPatrick SeidensalActions
Copied to Dashboard - Backport #53883: octopus: mgr/monitoring: regression in OSD host details/overview Grafana dashboardResolvedPatrick SeidensalActions
Copied to Dashboard - Backport #53990: nautilus: mgr/monitoring: regression in OSD host details/overview Grafana dashboardRejectedPatrick SeidensalActions
Actions #1

Updated by Patrick Seidensal over 2 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Patrick Seidensal over 2 years ago

  • Subject changed from mgr/monitoring: regression in OSD host details/overview to mgr/monitoring: regression in OSD host details/overview Grafana dashboard
Actions #3

Updated by Patrick Seidensal over 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 43685
Actions #4

Updated by Patrick Seidensal over 2 years ago

  • Backport set to nautilus octopus pacific
Actions #5

Updated by Ernesto Puerta over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from nautilus octopus pacific to octopus pacific
Actions #6

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53882: pacific: mgr/monitoring: regression in OSD host details/overview Grafana dashboard added
Actions #7

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53883: octopus: mgr/monitoring: regression in OSD host details/overview Grafana dashboard added
Actions #8

Updated by Patrick Seidensal about 2 years ago

  • Backport changed from octopus pacific to octopus pacific nautilus
Actions #9

Updated by Backport Bot about 2 years ago

  • Copied to Backport #53990: nautilus: mgr/monitoring: regression in OSD host details/overview Grafana dashboard added
Actions #10

Updated by Konstantin Shalygin about 2 years ago

  • Backport changed from octopus pacific nautilus to octopus pacific
Actions #11

Updated by Ernesto Puerta about 2 years ago

  • Backport changed from octopus pacific to nautilus octopus pacific
Actions #12

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #13

Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF