Cleanup #62911
openLabel used for Ceph health alert POOL_NEARFULL is in discordance with documentation and Ceph alerts
0%
Description
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=2238396
Description of problem:
The label provided by the "ceph health details" command when a POOL_NEAR_FULL warn raises is in discordance with what we have documented, with the label used in the Prometheus alert and with the syntax rules used for other alerts in the "ceph health detail" command.
The definition of the alert is in:
https://github.com/ceph/ceph/blob/a45ade8c157bae3b18d6cd0162c8073a3716a653/src/osd/OSDMap.cc#L7152
- It does not follow the same syntax rules other alerts follow in the same file.
- It does not match with documentation says:
https://docs.ceph.com/en/quincy/rados/operations/health-checks/#pool-near-full
- It does not match with the label used in prometheus alerts:
https://github.com/ceph/ceph/blob/55e13ffde7e8bf1622cad8b16695acf0116fafc1/monitoring/ceph-mixin/prometheus_alerts.libsonnet#L618
How reproducible:
1 fill up of a pool
2. get the result of the "ceph health detail" command.Example:
Example:
Alert triggered in sysdig UI:
name='ceph_health_detail', _sysdig_custom_metric='true', _sysdig_datasource='agent', agent_id='457203', agent_tag_cluster='mzoned18', agent_tag_cluster_type='acadia', host_hostname='dal1-qz2-sr2-rk036-s47', host_mac='3c:ec:ef:fc:68:d2', instance='localhost:9283', job='acadia-node-exporter', kube_cluster_name='mzoned18', name='POOL_NEARFULL', severity='HEALTH_WARN'
Actual results:
look the "name='POOL_NEARFULL'" must be "name='POOL_NEAR_FULL'
Expected results:
look the "name='POOL_NEAR_FULL'" must be returned
Additional info:
It is a cosmetic change, but it contributes to confuse users about the results expected.
Updated by Laura Flores 8 months ago
- Translation missing: en.field_tag_list set to low-hanging-fruit
Updated by Prashant D 8 months ago
The prometheus alert should be changed to POOL_NEARFULL instead of changing ceph warning to POOL_NEAR_FULL
{ alert: 'CephPoolNearFull', 'for': '5m', expr: 'ceph_health_detail{name="POOL_NEAR_FULL"} > 0', <-------------- should be POOL_NEARFULL labels: { severity: 'warning', type: 'ceph_default' }, annotations: { summary: 'One or more Ceph pools are nearly full%(cluster)s' % $.MultiClusterSummary(), description: "A pool has exceeded the warning (percent full) threshold, or OSDs supporting the pool have reached the NEARFULL threshold. Writes may continue, but you are at risk of the pool going read-only if more capacity isn't made available. Determine the affected pool with 'ceph df detail', looking at QUOTA BYTES and STORED. Increase the pool's quota, or add capacity to the cluster first then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>). Also ensure that the balancer is active.", }, }, ], },
Updated by Laura Flores 8 months ago
Claiming this issue for the time being for Grace Hopper Open Source Day!
Updated by Laura Flores 8 months ago
Prashant D wrote:
The prometheus alert should be changed to POOL_NEARFULL instead of changing ceph warning to POOL_NEAR_FULL
[...]
Agreed
Updated by Laura Flores 7 months ago
- Status changed from New to Fix Under Review