Feature #48430
openAdd memory consumption of nodes to health checks
0%
Description
During some tests using a (very small) virtual cluster I noticed that Ceph doesn't seem to 'notice' when a node runs out of available memory (including swap). The virtual node where this happened was an OSD so the result was a large amount of slow ops and stalled operations.
At least in Ubuntu it's possible to get the current memory consumption of a system including swap with "free -m" which seems to report a fairly accurate reading. The command reports the same values when used inside a container. My idea is that Ceph monitors the current system memory load of all nodes in regular intervals. If the amount of free memory on a node falls below a (user) defined threshold or the the swap file gets too large - for whatever reason which could also be a different process - the cluster health changes to a warning state. If a Health Warn is too much the cluster alternatively could log the problem.
In normal clusters with very large amount of RAM available per node this check might seem a little bit unneccessary but another data point might be helpful.
Updated by Laura Flores almost 2 years ago
- Translation missing: en.field_tag_list set to low-hanging-fruit
- Tags deleted (
low-hanging-fruit)