Feature #48430: Add memory consumption of nodes to health checks - RADOS - Ceph

Actions

Copy link

Feature #48430

open

Add memory consumption of nodes to health checks

Added by Gunther Heinrich over 3 years ago. Updated almost 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

Reviewed:

Affected Versions:

Component(RADOS):

Pull request ID:

Description

During some tests using a (very small) virtual cluster I noticed that Ceph doesn't seem to 'notice' when a node runs out of available memory (including swap). The virtual node where this happened was an OSD so the result was a large amount of slow ops and stalled operations.
At least in Ubuntu it's possible to get the current memory consumption of a system including swap with "free -m" which seems to report a fairly accurate reading. The command reports the same values when used inside a container. My idea is that Ceph monitors the current system memory load of all nodes in regular intervals. If the amount of free memory on a node falls below a (user) defined threshold or the the swap file gets too large - for whatever reason which could also be a different process - the cluster health changes to a warning state. If a Health Warn is too much the cluster alternatively could log the problem.
In normal clusters with very large amount of RAM available per node this check might seem a little bit unneccessary but another data point might be helpful.