Project

General

Profile

Actions

Cleanup #62911

open

Label used for Ceph health alert POOL_NEARFULL is in discordance with documentation and Ceph alerts

Added by Laura Flores 8 months ago. Updated 5 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Ref: https://bugzilla.redhat.com/show_bug.cgi?id=2238396

Description of problem:
The label provided by the "ceph health details" command when a POOL_NEAR_FULL warn raises is in discordance with what we have documented, with the label used in the Prometheus alert and with the syntax rules used for other alerts in the "ceph health detail" command.

The definition of the alert is in:
https://github.com/ceph/ceph/blob/a45ade8c157bae3b18d6cd0162c8073a3716a653/src/osd/OSDMap.cc#L7152

- It does not follow the same syntax rules other alerts follow in the same file.
- It does not match with documentation says:
https://docs.ceph.com/en/quincy/rados/operations/health-checks/#pool-near-full
- It does not match with the label used in prometheus alerts:
https://github.com/ceph/ceph/blob/55e13ffde7e8bf1622cad8b16695acf0116fafc1/monitoring/ceph-mixin/prometheus_alerts.libsonnet#L618

How reproducible:
1 fill up of a pool
2. get the result of the "ceph health detail" command.Example:
Example:
Alert triggered in sysdig UI:
name='ceph_health_detail', _sysdig_custom_metric='true', _sysdig_datasource='agent', agent_id='457203', agent_tag_cluster='mzoned18', agent_tag_cluster_type='acadia', host_hostname='dal1-qz2-sr2-rk036-s47', host_mac='3c:ec:ef:fc:68:d2', instance='localhost:9283', job='acadia-node-exporter', kube_cluster_name='mzoned18', name='POOL_NEARFULL', severity='HEALTH_WARN'

Actual results:
look the "name='POOL_NEARFULL'" must be "name='POOL_NEAR_FULL'

Expected results:
look the "name='POOL_NEAR_FULL'" must be returned

Additional info:
It is a cosmetic change, but it contributes to confuse users about the results expected.

Actions #1

Updated by Laura Flores 8 months ago

  • Translation missing: en.field_tag_list set to low-hanging-fruit
Actions #2

Updated by Prashant D 8 months ago

The prometheus alert should be changed to POOL_NEARFULL instead of changing ceph warning to POOL_NEAR_FULL

        {
          alert: 'CephPoolNearFull',
          'for': '5m',
          expr: 'ceph_health_detail{name="POOL_NEAR_FULL"} > 0',                   <-------------- should be POOL_NEARFULL
          labels: { severity: 'warning', type: 'ceph_default' },
          annotations: {
            summary: 'One or more Ceph pools are nearly full%(cluster)s' % $.MultiClusterSummary(),
            description: "A pool has exceeded the warning (percent full) threshold, or OSDs supporting the pool have reached the NEARFULL threshold. Writes may continue, but you are at risk of the pool going read-only if more capacity isn't made available. Determine the affected pool with 'ceph df detail', looking at QUOTA BYTES and STORED. Increase the pool's quota, or add capacity to the cluster first then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>). Also ensure that the balancer is active.",
          },
        },
      ],
    },
Actions #3

Updated by Laura Flores 8 months ago

Claiming this issue for the time being for Grace Hopper Open Source Day!

Actions #4

Updated by Laura Flores 8 months ago

Prashant D wrote:

The prometheus alert should be changed to POOL_NEARFULL instead of changing ceph warning to POOL_NEAR_FULL

[...]

Agreed

Actions #5

Updated by Laura Flores 7 months ago

  • Status changed from New to Fix Under Review
Actions #6

Updated by Radoslaw Zarzynski 5 months ago

  • Pull request ID set to 53609
Actions

Also available in: Atom PDF