Feature #63550
openNUMA enhacements
0%
Description
Right now we support pinning an OSD process to the CPUs that corresponds with a particular NUMA node according to which NUMA node that is responsible for the network adapter the OSD is listening on (osd_numa_prefer_iface=true), or the NUMA node that is responsible for the storage device and the network adapter the OSD is listening on (osd_numa_auto_affinity=true).
It is not unusual for systems to have asymmetries. CPUs themselves are employing chiplet designs that, depending on BIOS configuration, can result in 4 NUMA nodes per socket. Even in conventional dual socket systems that present 2 NUMA nodes, there might be an uneven distribution of NVMe devices across NUMA nodes, or all the network adapters be on a single NUMA node.
In these situations, we might want to employ a third NUMA strategy (osd_numa_distribute=true) - divide the OSDs by the number of NUMA nodes. This would mean that OSD tasks can be scheduled to any CPU on a NUMA node, and that the memory allocation would prefer affinity with that NUMA node. This should help with CPU cache lines, and reduce memory access latency.
A cgroup would be created for each OSD, and we would set: cpuset.cpus=[list of cpus in numa] # only schedule OSD tasks to CPUs on a single NUMA cpuset.mems=1 # prefer to allocate memory with NUMA affinity (not hard) cpuset.memory_migrate=1 # move memory pages to NUMA