Cleanup #62911: Label used for Ceph health alert POOL_NEARFULL is in discordance with documentation and Ceph alerts - RADOS - Ceph

Actions

Copy link

Cleanup #62911

open

Label used for Ceph health alert POOL_NEARFULL is in discordance with documentation and Ceph alerts

Added by Laura Flores 8 months ago. Updated 5 months ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Backport:

Reviewed:

Affected Versions:

Component(RADOS):

Pull request ID:

53609

Description

Ref: https://bugzilla.redhat.com/show_bug.cgi?id=2238396

Description of problem:
The label provided by the "ceph health details" command when a POOL_NEAR_FULL warn raises is in discordance with what we have documented, with the label used in the Prometheus alert and with the syntax rules used for other alerts in the "ceph health detail" command.

The definition of the alert is in:
https://github.com/ceph/ceph/blob/a45ade8c157bae3b18d6cd0162c8073a3716a653/src/osd/OSDMap.cc#L7152

- It does not follow the same syntax rules other alerts follow in the same file.
- It does not match with documentation says:
https://docs.ceph.com/en/quincy/rados/operations/health-checks/#pool-near-full
- It does not match with the label used in prometheus alerts:
https://github.com/ceph/ceph/blob/55e13ffde7e8bf1622cad8b16695acf0116fafc1/monitoring/ceph-mixin/prometheus_alerts.libsonnet#L618

How reproducible:
1 fill up of a pool
2. get the result of the "ceph health detail" command.Example:
Example:
Alert triggered in sysdig UI:
name='ceph_health_detail', _sysdig_custom_metric='true', _sysdig_datasource='agent', agent_id='457203', agent_tag_cluster='mzoned18', agent_tag_cluster_type='acadia', host_hostname='dal1-qz2-sr2-rk036-s47', host_mac='3c:ec:ef:fc:68:d2', instance='localhost:9283', job='acadia-node-exporter', kube_cluster_name='mzoned18', name='POOL_NEARFULL', severity='HEALTH_WARN'

Actual results:
look the "name='POOL_NEARFULL'" must be "name='POOL_NEAR_FULL'

Expected results:
look the "name='POOL_NEAR_FULL'" must be returned

Additional info:
It is a cosmetic change, but it contributes to confuse users about the results expected.

Actions

Copy link

Updated by Laura Flores 8 months ago

Translation missing: en.field_tag_list set to low-hanging-fruit

Actions

Copy link

Updated by Prashant D 8 months ago

The prometheus alert should be changed to POOL_NEARFULL instead of changing ceph warning to POOL_NEAR_FULL

        {
          alert: 'CephPoolNearFull',
          'for': '5m',
          expr: 'ceph_health_detail{name="POOL_NEAR_FULL"} > 0',                   <-------------- should be POOL_NEARFULL
          labels: { severity: 'warning', type: 'ceph_default' },
          annotations: {
            summary: 'One or more Ceph pools are nearly full%(cluster)s' % $.MultiClusterSummary(),
            description: "A pool has exceeded the warning (percent full) threshold, or OSDs supporting the pool have reached the NEARFULL threshold. Writes may continue, but you are at risk of the pool going read-only if more capacity isn't made available. Determine the affected pool with 'ceph df detail', looking at QUOTA BYTES and STORED. Increase the pool's quota, or add capacity to the cluster first then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>). Also ensure that the balancer is active.",
          },
        },
      ],
    },

Actions

Copy link