# Enhanced Platform Awareness Enhanced Platform Awareness (EPA) is a methodology and a set of enhancements across various layers of the orchestration stack. EPA focuses on discovering, scheduling and isolating server hardware capabilities. This document provides a detailed guide of how EPA applies to {{product}}, which centre around the following technologies: - **HugePage support**: In GA from Kubernetes v1.14, this feature enables the discovery, scheduling and allocation of HugePages as a first-class resource. - **Real-time kernel**: Ensures that high-priority tasks are run within a predictable time frame, providing the low latency and high determinism essential for time-sensitive applications. - **CPU pinning** (CPU Manager for Kubernetes (CMK)): In GA from Kubernetes v1.26, provides mechanisms for CPU pinning and isolation of containerised workloads. - **NUMA topology awareness**: Ensures that CPU and memory allocation are aligned according to the NUMA architecture, reducing memory latency and increasing performance for memory-intensive applications. - **Single Root I/O Virtualization (SR-IOV)**: Enhances networking by enabling virtualisation of a single physical network device into multiple virtual devices. - **DPDK (Data Plane Development Kit)**: A set of libraries and drivers for fast packet processing, designed to run in user space, optimising network performance and reducing latency. This document provides relevant links to detailed instructions for setting up and installing these technologies. It is designed for developers and architects who wish to integrate these new technologies into their {{product}}-based networking solutions. The separate [how to guide][howto-epa] for EPA includes the necessary steps to implement these features on {{product}}. ## HugePages HugePages are a feature in the Linux kernel which enables the allocation of larger memory pages. This reduces the overhead of managing large amounts of memory and can improve performance for applications that require significant memory access. ### Key features - **Larger memory pages**: HugePages provide larger memory pages (e.g., 2MB or 1GB) compared to the standard 4KB pages, reducing the number of pages the system must manage. - **Reduced overhead**: By using fewer, larger pages, the system reduces the overhead associated with page table entries, leading to improved memory management efficiency. - **Improved TLB performance**: The Translation Lookaside Buffer (TLB) stores recent translations of virtual memory to physical memory addresses. Using HugePages increases TLB hit rates, reducing the frequency of memory translation lookups. - **Enhanced application performance**: Applications that access large amounts of memory can benefit from HugePages by experiencing lower latency and higher throughput due to reduced page faults and better memory access patterns. - **Support for high-performance workloads**: Ideal for high-performance computing (HPC) applications, databases and other memory-intensive workloads that demand efficient and fast memory access. - **Native Kubernetes integration**: Starting from Kubernetes v1.14, HugePages are supported as a native, first-class resource, enabling their discovery, scheduling and allocation within Kubernetes environments. ### Application to Kubernetes The architecture for HugePages on Kubernetes integrates the management and allocation of large memory pages into the Kubernetes orchestration system. Here are the key architectural components and their roles: - **Node configuration**: Each Kubernetes node must be configured to reserve HugePages. This involves setting the number of HugePages in the node's kernel boot parameters. - **Kubelet configuration**: The `kubelet` on each node must be configured to recognise and manage HugePages. This is typically done through the `kubelet` configuration file, specifying the size and number of HugePages. - **Pod specification**: HugePages are requested and allocated at the pod level through resource requests and limits in the pod specification. Pods can request specific sizes of HugePages (e.g., 2MB or 1GB). - **Scheduler awareness**: The Kubernetes scheduler is aware of HugePages as a resource and schedules pods onto nodes that have sufficient HugePages available. This ensures that pods with HugePages requirements are placed appropriately. Scheduler configurations and policies can be adjusted to optimise HugePages allocation and utilisation. - **Node Feature Discovery (NFD)**: Node Feature Discovery can be used to label nodes with their HugePages capabilities. This enables scheduling decisions to be based on the available HugePages resources. - **Resource quotas and limits**: Kubernetes enables the definition of resource quotas and limits to control the allocation of HugePages across namespaces. This helps in managing and isolating resource usage effectively. - **Monitoring and metrics**: Kubernetes provides tools and integrations (e.g., Prometheus, Grafana) to monitor and visualise HugePages usage across the cluster. This helps in tracking resource utilisation and performance. Metrics can include HugePages allocation, usage and availability on each node, aiding in capacity planning and optimization. ## Real-time kernel A real-time kernel ensures that high-priority tasks are run within a predictable timeframe, crucial for applications requiring low latency and high determinism. Note that this can also impede applications which were not designed with these considerations. ### Key features - **Predictable task execution**: A real-time kernel ensures that high-priority tasks are run within a predictable and bounded timeframe, reducing the variability in task execution time. - **Low latency**: The kernel is optimised to minimise the time it takes to respond to high-priority tasks, which is crucial for applications that require immediate processing. - **Priority-based scheduling**: Tasks are scheduled based on their priority levels, with real-time tasks being given precedence over other types of tasks to ensure they are processed promptly. - **Deterministic behaviour**: The kernel guarantees deterministic behaviour, meaning the same task will have the same response time every time it is run, essential for time-sensitive applications. - **Pre-emption:** The real-time kernel supports preemptive multitasking, allowing high-priority tasks to interrupt lower-priority tasks to ensure critical tasks are run without delay. - **Resource reservation**: System resources (such as CPU and memory) can be reserved by the kernel for real-time tasks, ensuring that these resources are available when needed. - **Enhanced interrupt handling**: Interrupt handling is optimised to ensure minimal latency and jitter, which is critical for maintaining the performance of real-time applications. - **Real-time scheduling policies**: The kernel includes specific scheduling policies (e.g., SCHED\_FIFO, SCHED\_RR) designed to manage real-time tasks effectively and ensure they meet their deadlines. These features make a real-time kernel ideal for applications requiring precise timing and high reliability. ### Application to Kubernetes The architecture for integrating a real-time kernel into Kubernetes involves several components and configurations to ensure that high-priority, low-latency tasks can be managed effectively within a Kubernetes environment. Here are the key architectural components and their roles: - **Real-time kernel installation**: Each Kubernetes node must run a real-time kernel. This involves installing a real-time kernel package and configuring the system to use it. - **Kernel boot parameters**: The kernel boot parameters must be configured to optimise for real-time performance. This includes isolating CPU cores and configuring other kernel parameters for real-time behaviour. - **Kubelet configuration**: The `kubelet` on each node must be configured to recognise and manage real-time workloads. This can involve setting specific `kubelet` flags and configurations. - **Pod specification**: Real-time workloads are specified at the pod level through resource requests and limits. Pods can request dedicated CPU cores and other resources to ensure they meet real-time requirements. - **CPU Manager**: Kubernetes’ CPU Manager is a critical component for real-time workloads. It enables the static allocation of CPUs to containers, ensuring that specific CPU cores are dedicated to particular workloads. - **Scheduler awareness**: The Kubernetes scheduler must be aware of real-time requirements and prioritise scheduling pods onto nodes with available real-time resources. - **Priority and preemption**: Kubernetes supports priority and preemption to ensure that critical real-time pods are scheduled and run as needed. This involves defining pod priorities and enabling preemption to ensure high-priority pods can displace lower-priority ones if necessary. - **Resource quotas and limits**: Kubernetes can define resource quotas and limits to control the allocation of resources for real-time workloads across namespaces. This helps manage and isolate resource usage effectively. - **Monitoring and metrics**: Monitoring tools such as Prometheus and Grafana can be used to track the performance and resource utilisation of real-time workloads. Metrics include CPU usage, latency and task scheduling times, which help in optimising and troubleshooting real-time applications. - **Security and isolation**: Security contexts and isolation mechanisms ensure that real-time workloads are protected and run in a controlled environment. This includes setting privileged containers and configuring namespaces. ## CPU pinning CPU pinning enables specific CPU cores to be dedicated to a particular process or container, ensuring that the process runs on the same CPU core(s) every time, which reduces context switching and cache invalidation. ### Key features - **Dedicated CPU Cores**: CPU pinning allocates specific CPU cores to a process or container, ensuring consistent and predictable CPU usage. - **Reduced context switching**: By running a process or container on the same CPU core(s), CPU pinning minimises the overhead associated with context switching, leading to better performance. - **Improved cache utilisation**: When a process runs on a dedicated CPU core, it can take full advantage of the CPU cache, reducing the need to fetch data from main memory and improving overall performance. - **Enhanced application performance**: Applications that require low latency and high performance benefit from CPU pinning as it ensures they have dedicated processing power without interference from other processes. - **Consistent performance**: CPU pinning ensures that a process or container receives consistent CPU performance, which is crucial for real-time and performance-sensitive applications. - **Isolation of workloads**: CPU pinning isolates workloads on specific CPU cores, preventing them from being affected by other workloads running on different cores. This is especially useful in multi-tenant environments. - **Improved predictability**: By eliminating the variability introduced by sharing CPU cores, CPU pinning provides more predictable performance characteristics for critical applications. - **Integration with Kubernetes**: Kubernetes supports CPU pinning through the CPU Manager (in GA since v1.26), which allows for the static allocation of CPUs to containers. This ensures that containers with high CPU demands have the necessary resources. ### Application to Kubernetes The architecture for CPU pinning in Kubernetes involves several components and configurations to ensure that specific CPU cores can be dedicated to particular processes or containers, thereby enhancing performance and predictability. Here are the key architectural components and their roles: - **Kubelet configuration**: The `kubelet` on each node must be configured to enable CPU pinning. This involves setting specific `kubelet` flags to activate the CPU Manager. - **CPU manager**: Kubernetes’ CPU Manager is a critical component for CPU pinning. It allows for the static allocation of CPUs to containers, ensuring that specific CPU cores are dedicated to particular workloads. The CPU Manager can be configured to either static or none. Static policy enables exclusive CPU core allocation to Guaranteed QoS (Quality of Service) pods. - **Pod specification**: Pods must be specified to request dedicated CPU resources. This is done through resource requests and limits in the pod specification. - **Scheduler awareness**: The Kubernetes scheduler must be aware of the CPU pinning requirements. It schedules pods onto nodes with available CPU resources as requested by the pod specification. The scheduler ensures that pods with specific CPU pinning requests are placed on nodes with sufficient free dedicated CPUs. - **NUMA Topology Awareness**: For optimal performance, CPU pinning should be aligned with NUMA (Non-Uniform Memory Access) topology. This ensures that memory accesses are local to the CPU, reducing latency. Kubernetes can be configured to be NUMA-aware, using the Topology Manager to align CPU and memory allocation with NUMA nodes. - **Node Feature Discovery (NFD)**: Node Feature Discovery can be used to label nodes with their CPU capabilities, including the availability of isolated and reserved CPU cores. - **Resource quotas and limits**: Kubernetes can define resource quotas and limits to control the allocation of CPU resources across namespaces. This helps in managing and isolating resource usage effectively. - **Monitoring and metrics**: Monitoring tools such as Prometheus and Grafana can be used to track the performance and resource utilisation of CPU-pinned workloads. Metrics include CPU usage, core allocation and task scheduling times, which help in optimising and troubleshooting performance-sensitive applications. - **Isolation and security**: Security contexts and isolation mechanisms ensure that CPU-pinned workloads are protected and run in a controlled environment. This includes setting privileged containers and configuring namespaces to avoid resource contention. - **Performance Tuning**: Additional performance tuning can be achieved by isolating CPU cores at the OS level and configuring kernel parameters to minimise interference from other processes. This includes setting CPU isolation and `nohz_full` parameters (reduces the number of scheduling-clock interrupts, improving energy efficiency and [reducing OS jitter][no_hz]). ## NUMA topology awareness NUMA (Non-Uniform Memory Access) topology awareness ensures that the CPU and memory allocation are aligned according to the NUMA architecture, which can reduce memory latency and increase performance for memory-intensive applications. The Kubernetes Memory Manager enables the feature of guaranteed memory (and HugePages) allocation for pods in the Guaranteed QoS (Quality of Service) class. The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node. Moreover, the Memory Manager ensures that the memory which a pod requests is allocated from a minimum number of NUMA nodes. ### Key features - **Aligned CPU and memory allocation**: NUMA topology awareness ensures that CPUs and memory are allocated in alignment with the NUMA architecture, minimising cross-node memory access latency. - **Reduced memory latency**: By ensuring that memory is accessed from the same NUMA node as the CPU, NUMA topology awareness reduces memory latency, leading to improved performance for memory-intensive applications. - **Increased performance**: Applications benefit from increased performance due to optimised memory access patterns, which is especially critical for high-performance computing and data-intensive tasks. - **Kubernetes Memory Manager**: The Kubernetes Memory Manager supports guaranteed memory allocation for pods in the Guaranteed QoS (Quality of Service) class, ensuring predictable performance. - **Hint generation protocol**: The Memory Manager uses a hint generation protocol to determine the most suitable NUMA affinity for a pod, helping to optimise resource allocation based on NUMA topology. - **Integration with Topology Manager**: The Memory Manager provides NUMA affinity hints to the Topology Manager. The Topology Manager then decides whether to admit or reject the pod based on these hints and the configured policy. - **Optimised resource allocation**: The Memory Manager ensures that the memory requested by a pod is allocated from the minimum number of NUMA nodes, thereby optimising resource usage and performance. - **Enhanced scheduling decisions**: The Kubernetes scheduler, in conjunction with the Topology Manager, makes informed decisions about pod placement to ensure optimal NUMA alignment, improving overall cluster efficiency. - **Support for HugePages**: The Memory Manager also supports the allocation of HugePages, ensuring that large memory pages are allocated in a NUMA-aware manner, further enhancing performance for applications that require large memory pages. - **Improved application predictability**: By aligning CPU and memory allocation with NUMA topology, applications experience more predictable performance characteristics, crucial for real-time and latency-sensitive workloads. - **Policy-Based Management**: NUMA topology awareness can be managed through policies so that administrators can configure how resources should be allocated based on the NUMA architecture, providing flexibility and control. ### Application to Kubernetes The architecture for NUMA topology awareness in Kubernetes involves several components and configurations to ensure that CPU and memory allocations are optimised according to the NUMA architecture. This setup reduces memory latency and enhances performance for memory intensive applications. Here are the key architectural components and their roles: - **Node configuration**: Each Kubernetes node must have NUMA-aware hardware. The system's NUMA topology can be inspected using tools such as `lscpu` or `numactl`. - **Kubelet configuration**: The `kubelet` on each node must be configured to enable NUMA topology awareness. This involves setting specific `kubelet` flags to activate the Topology Manager. - **Topology Manager**: The Topology Manager is a critical component that coordinates resource allocation based on NUMA topology. It receives NUMA affinity hints from other managers (e.g., CPU Manager, Device Manager) and makes informed scheduling decisions. - **Memory Manager**: The Kubernetes Memory Manager is responsible for managing memory allocation, including HugePages, in a NUMA-aware manner. It ensures that memory is allocated from the minimum number of NUMA nodes required. The Memory Manager uses a hint generation protocol to provide NUMA affinity hints to the Topology Manager. - **Pod specification**: Pods can be specified to request NUMA-aware resource allocation through resource requests and limits, ensuring that they get allocated in alignment with the NUMA topology. - **Scheduler awareness**: The Kubernetes scheduler works in conjunction with the Topology Manager to place pods on nodes that meet their NUMA affinity requirements. The scheduler considers NUMA topology during the scheduling process to optimise performance. - **Node Feature Discovery (NFD)**: Node Feature Discovery can be used to label nodes with their NUMA capabilities, providing the scheduler with information to make more informed placement decisions. - **Resource quotas and limits**: Kubernetes allows defining resource quotas and limits to control the allocation of NUMA-aware resources across namespaces. This helps in managing and isolating resource usage effectively. - **Monitoring and metrics**: Monitoring tools such as Prometheus and Grafana can be used to track the performance and resource utilisation of NUMA-aware workloads. Metrics include CPU and memory usage per NUMA node, helping in optimising and troubleshooting performance-sensitive applications. - **Isolation and security**: Security contexts and isolation mechanisms ensure that NUMA-aware workloads are protected and run in a controlled environment. This includes setting privileged containers and configuring namespaces to avoid resource contention. - **Performance tuning**: Additional performance tuning can be achieved by configuring kernel parameters and using tools like `numactl` to bind processes to specific NUMA nodes. ## SR-IOV (Single Root I/O Virtualization) SR-IOV enables a single physical network device to appear as multiple separate virtual devices. This can be beneficial for network-intensive applications that require direct access to the network hardware. ### Key features - **Multiple Virtual Functions (VFs)**: SR-IOV enables a single physical network device to be partitioned into multiple virtual functions (VFs), each of which can be assigned to a virtual machine or container as a separate network interface. - **Direct hardware access**: By providing direct access to the physical network device, SR-IOV bypasses the software-based network stack, reducing overhead and improving network performance and latency. - **Improved network throughput**: Applications can achieve higher network throughput as SR-IOV enables high-speed data transfer directly between the network device and the application. - **Reduced CPU utilisation**: Offloading network processing to the hardware reduces the CPU load on the host system, freeing up CPU resources for other tasks and improving overall system performance. - **Isolation and security**: Each virtual function (VF) is isolated from others, providing security and stability. This isolation ensures that issues in one VF do not affect other VFs or the physical function (PF). - **Dynamic resource allocation**: SR-IOV supports dynamic allocation of virtual functions, enabling resources to be adjusted based on application demands without requiring changes to the physical hardware setup. - **Enhanced virtualisation support**: SR-IOV is particularly beneficial in virtualised environments, enabling better network performance for virtual machines and containers by providing them with dedicated network interfaces. - **Kubernetes integration**: Kubernetes supports SR-IOV through the use of network device plugins, enabling the automatic discovery, allocation, and management of virtual functions. - **Compatibility with Network Functions Virtualization (NFV)**: SR-IOV is widely used in NFV deployments to meet the high-performance networking requirements of virtual network functions (VNFs), such as firewalls, routers and load balancers. - **Reduced network latency**: As network packets can bypass the hypervisor's virtual switch, SR-IOV significantly reduces network latency, making it ideal for latency-sensitive applications. ### Application to Kubernetes The architecture for SR-IOV (Single Root I/O Virtualization) in Kubernetes involves several components and configurations to ensure that virtual functions (VFs) from a single physical network device can be managed and allocated efficiently. This setup enhances network performance and provides direct access to network hardware for applications requiring high throughput and low latency. Here are the key architectural components and their roles: - **Node configuration**: Each Kubernetes node with SR-IOV capable hardware must have the SR-IOV drivers and tools installed. This includes the SR-IOV network device plugin and associated drivers. - **SR-IOV enabled network interface**: The physical network interface card (NIC) must be configured to support SR-IOV. This involves enabling SR-IOV in the system BIOS and configuring the NIC to create virtual functions (VFs). - **SR-IOV network device plugin**: The SR-IOV network device plugin is deployed as a DaemonSet in Kubernetes. It discovers SR-IOV capable network interfaces and manages the allocation of virtual functions (VFs) to pods. - **Device Plugin Configuration**: The SR-IOV device plugin requires a configuration file that specifies the network devices and the number of virtual functions (VFs) to be managed. - **Pod specification**: Pods can request SR-IOV virtual functions by specifying resource requests and limits in the pod specification. The SR-IOV device plugin allocates the requested VFs to the pod. - **Scheduler awareness**: The Kubernetes scheduler must be aware of the SR-IOV resources available on each node. The device plugin advertises the available VFs as extended resources, which the scheduler uses to place pods accordingly. Scheduler configuration ensures pods with SR-IOV requests are scheduled on nodes with available VFs. - **Resource quotas and limits**: Kubernetes enables the definition of resource quotas and limits to control the allocation of SR-IOV resources across namespaces. This helps manage and isolate resource usage effectively. - **Monitoring and metrics**: Monitoring tools such as Prometheus and Grafana can be used to track the performance and resource utilisation of SR-IOV-enabled workloads. Metrics include VF allocation, network throughput, and latency, helping optimise and troubleshoot performance-sensitive applications. - **Isolation and security**: SR-IOV provides isolation between VFs, ensuring that each VF operates independently and securely. This isolation is critical for multi-tenant environments where different workloads share the same physical network device. - **Dynamic resource allocation**: SR-IOV supports dynamic allocation and deallocation of VFs, enabling Kubernetes to adjust resources based on application demands without requiring changes to the physical hardware setup. ## DPDK (Data Plane Development Kit) The Data Plane Development Kit (DPDK) is a set of libraries and drivers for fast packet processing. It is designed to run in user space, so that applications can achieve high-speed packet processing by bypassing the kernel. DPDK is used to optimise network performance and reduce latency, making it ideal for applications that require high-throughput and low-latency networking, such as telecommunications, cloud data centres and network functions virtualisation (NFV). ### Key features - **High performance**: DPDK can process millions of packets per second per core, using multi-core CPUs to scale performance. - **User-space processing**: By running in user space, DPDK avoids the overhead of kernel context switches and uses HugePages for better memory performance. - **Poll Mode Drivers (PMD)**: DPDK uses PMDs that poll for packets instead of relying on interrupts, which reduces latency. ### DPDK architecture The main goal of the DPDK is to provide a simple, complete framework for fast packet processing in data plane applications. Anyone can use the code to understand some of the techniques employed, to build upon for prototyping or to add their own protocol stacks. The framework creates a set of libraries for specific environments through the creation of an Environment Abstraction Layer (EAL), which may be specific to a mode of the Intel® architecture (32-bit or 64-bit), user space compilers or a specific platform. These environments are created through the use of Meson files (needed by Meson, the software tool for automating the building of software that DPDK uses) and configuration files. Once the EAL library is created, the user may link with the library to create their own applications. Other libraries, outside of EAL, including the Hash, Longest Prefix Match (LPM) and rings libraries are also provided. Sample applications are provided to help show the user how to use various features of the DPDK. The DPDK implements a run-to-completion model for packet processing, where all resources must be allocated prior to calling data plane applications, running as execution units on logical processing cores. The model does not support a scheduler and all devices are accessed by polling. The primary reason for not using interrupts is the performance overhead imposed by interrupt processing. In addition to the run-to-completion model, a pipeline model may also be used by passing packets or messages between cores via the rings. This enables work to be performed in stages and is a potentially more efficient use of code on cores. This is suitable for scenarios where each pipeline must be mapped to a specific application thread or when multiple pipelines must be mapped to the same thread. ### Application to Kubernetes The architecture for integrating the Data Plane Development Kit (DPDK) into Kubernetes involves several components and configurations to ensure high-speed packet processing and low-latency networking. DPDK enables applications to bypass the kernel network stack, providing direct access to network hardware and significantly enhancing network performance. Here are the key architectural components and their roles: - **Node configuration**: Each Kubernetes node must have the DPDK libraries and drivers installed. This includes setting up HugePages and binding network interfaces to DPDK-compatible drivers. - **HugePages configuration**: DPDK requires HugePages for efficient memory management. Configure the system to reserve HugePages. - **Network interface binding**: Network interfaces must be bound to DPDK-compatible drivers (e.g., vfio-pci) to be used by DPDK applications. - **DPDK application container**: Create a Docker container image with the DPDK application and necessary libraries. Ensure that the container runs with appropriate privileges and mounts HugePages. - **Pod specification**: Deploy the DPDK application in Kubernetes by specifying the necessary resources, including CPU pinning and HugePages, in the pod specification. - **CPU pinning**: For optimal performance, DPDK applications should use dedicated CPU cores. Configure CPU pinning in the pod specification. - **SR-IOV for network interfaces**: Combine DPDK with SR-IOV to provide high-performance network interfaces. Allocate SR-IOV virtual functions (VFs) to DPDK pods. - **Scheduler awareness**: The Kubernetes scheduler must be aware of the resources required by DPDK applications, including HugePages and CPU pinning, to place pods appropriately on nodes with sufficient resources. - **Monitoring and metrics**: Use monitoring tools like Prometheus and Grafana to track the performance of DPDK applications, including network throughput, latency and CPU usage. - **Resource quotas and limits**: Define resource quotas and limits to control the allocation of resources for DPDK applications across namespaces, ensuring fair resource distribution and preventing resource contention. - **Isolation and security**: Ensure that DPDK applications run in isolated and secure environments. Use security contexts to provide the necessary privileges while maintaining security best practices. [no_hz]: https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt [howto-epa]: /snap/howto/epa