Project

General

Profile

Actions

Feature #63550

open

NUMA enhacements

Added by Kyle Bader 6 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Right now we support pinning an OSD process to the CPUs that corresponds with a particular NUMA node according to which NUMA node that is responsible for the network adapter the OSD is listening on (osd_numa_prefer_iface=true), or the NUMA node that is responsible for the storage device and the network adapter the OSD is listening on (osd_numa_auto_affinity=true).

It is not unusual for systems to have asymmetries. CPUs themselves are employing chiplet designs that, depending on BIOS configuration, can result in 4 NUMA nodes per socket. Even in conventional dual socket systems that present 2 NUMA nodes, there might be an uneven distribution of NVMe devices across NUMA nodes, or all the network adapters be on a single NUMA node.

In these situations, we might want to employ a third NUMA strategy (osd_numa_distribute=true) - divide the OSDs by the number of NUMA nodes. This would mean that OSD tasks can be scheduled to any CPU on a NUMA node, and that the memory allocation would prefer affinity with that NUMA node. This should help with CPU cache lines, and reduce memory access latency.

A cgroup would be created for each OSD, and we would set:

cpuset.cpus=[list of cpus in numa] # only schedule OSD tasks to CPUs on a single NUMA
cpuset.mems=1                      # prefer to allocate memory with NUMA affinity (not hard)
cpuset.memory_migrate=1            # move memory pages to NUMA
Actions #1

Updated by Kyle Bader 6 months ago

  • Tracker changed from Bug to Feature
Actions

Also available in: Atom PDF