PCIe Device NUMA Node Locality

January 10, 2020 by frankdenneman

During this Christmas break, I wanted to learn PowerCLI properly. As I’m researching the use-cases of new hardware types and workloads in the data center, I managed to produce a script to identify the PCIe Device to NUMA Node Locality within a VMware ESXi Host. The script set contains a script for the most popular PCIe Device types for data centers that can be assigned as a passthrough device. The current script set is available on Github and contains scripts for GPUs, NICs and (Intel) FPGAs.

PCIe Devices Becoming the Primary Units of Data Processing

Due to the character of new workloads, the PCIe device is quickly moving up from “just” being a peripheral device to become the primary unit for data processing. Two great examples of this development are the rise of General Purpose GPU (GPGPU), often referred to as GPU Compute, and the virtualization of the telecommunication space.

The concept of GPU computing implies using GPUs and CPUs together. In many new workloads, the processes of an application are executed on a few CPU cores, while the GPU, with its many cores, handles the computational intensive data-processing part. Another workload, or better said, a whole industry that leans heavily on the performance of PCIe devices, is the telecommunication industry. Virtual Network Functions (VNF) require platforms using SR-IOV capable NICs or SmartNICs to provide ultra-fast packet processing performance.

In both scenarios having insight into PCIe Device to processor locality is a must to provide the best performance to the application or avoid introducing menacing noisy neighbors that can influence the performance of other workloads active in the system.

PCIe Device NUMA Node Locality

The majority of servers used in VMware virtualized environments are two CPU socket systems. Each CPU socket accommodates a processor containing several CPU cores. A processor contains multiple memory controllers offering a connection to directly connected memory. An interconnect (Intel: QuickPath Interconnect (QPI) & UltraPath Interconnect (UPI), AMD: Infinity Fabric (IF)) connects the two processors and allows the cores within each processor to access the memory connected to the other processor. When accessing memory connected directly to the processor, it is called local memory access. When accessing memory connected to the other processor, it is called remote memory access. This architecture provides Non-Uniform Memory Access (NUMA) as access latency, and bandwidth differs between local memory access or remote memory access. Henceforth these systems are referred to as NUMA systems.

It was big news when the AMD Opteron and Intel Nehalem Processor integrated the memory controller within the processor. But what about PCIe devices in such a system? Since the Sandy Bridge Architecture (2009), Intel reorganized the functions critical to the core and grouped them in the Uncore, which is a “construct” that is integrated into the processor as well. And it is this Uncore that handles the PCIe bus functions. It provides access to NVMe devices, GPUs, and NICs. Below is a schematic overview of a 28 core Intel Sky lake processor showing the PCIe ports and their own PCIe root stack.

In essence, a PCIe device is hardwired to a particular port on a processor. And that means that we can introduce another concept to NUMA locality, which is PCIe locality. Considering PCIe locality when scheduling low-latency or GPU compute workload can be beneficial not only to the performance of the application itself but also to the other workloads active on the system.

For example, Machine Learning involves processing a lot of data, and this data flows within the system from the CPU and memory subsystem to the GPU to be processed. Properly written Machine Learning application routines minimize communication between the GPU and CPU once the dataset is loaded on the GPU, but getting the data onto the GPU typically turns the application into a noisy neighbor to the rest of the system. Imagine if the GPU card is connected to NUMA node 0, and the application is running on cores located in NUMA node 1. All that data has to go through the interconnect to the GPU card.

The interconnect provides more theoretical bandwidth than a single PCIe 3.0 device can operate at, ~40 GB/s vs. 15 GB/s. But we have to understand that interconnect is used for all PCIe connectivity and memory transfers by the CPU scheduler. If you want to explore this topic more, I recommend reviewing Amdahl’s Law – Validity of the single processor approach to achieving large scale computing capabilities – published in 1967. (Still very relevant) And the strongly related Little’s Law. Keeping the application processes and data-processing software components on the same NUMA node keeps the workloads from flooding the QPI/UPI/ AMD IF interconnect.

For VNF workloads, it is essential to avoid any latency introduced by the system. Concepts like VT-d (Virtualization Technology for Directed I/O) reduces the time spent in a system for IOs and isolate the path so that no other workload can affect its operation. Ensuring the vCPU operates within the same NUMA domain ensures that no additional penalties are introduced by traffic on the interconnect and ensures the shortest path is provided from the CPU to the PCIe device.

Constraining CPU Placement

The PCIe Device NUMA Node Locality script assists in obtaining the best possible performance by identifying the PCIe locality of GPU, NIC of FPGA PCIe devices within VMware ESXi hosts. Typically VMs running NFV or GPGPU workloads are configured with a PCI passthrough enabled device. As a result, these VMware PowerCLI scripts inform the user which VMs are attached directly to the particular PCIe devices.

Currently, the VMkernel schedulers do no provide any automatic placement based on PCIe locality. CPU placement can be controlled by associating the listed virtual machines with a specific NUMA node using an advanced setting.

Please note that applying this setting can interfere with the ability of the ESXi NUMA scheduler to rebalance virtual machines across NUMA nodes for fairness. Specify NUMA node affinity only after you consider the rebalancing issues.

The Script Set

The purpose of these scripts is to identify the PCIe Device to NUMA Node locality within a VMware ESXi Host. The script set contains a script for the most popular PCIe Device types for Datacenters that can be assigned as a passthrough device. The current script set contains scripts for GPUs, NICs, and (Intel) FPGAs.

Please note that these scripts only collect information and do not alter any configuration in any way possible.

Requirements

VMware PowerCLI
Connection to VMware vCenter
Unrestricted Script Execution Policy
Posh-SSH
Root Access to ESXi hosts

Please note that Posh-SSH only works on Windows version of PowerShell.

The VMware PowerCLI script primarily interfaces with the virtual infrastructure via a connection to the VMware vCenter Server. A connection (Connect-VIServer) with the proper level of certificates must be in place before executing these scripts. The script does not initiate any connect session itself. It assumes this is already in-place.

As the script extracts information from the VMkernel Sys Info Shell (VSI Shell) the script uses Posh-SSH to log into ESXi host of choice and extracts the data from the VSI Shell for further processing. The Posh-SSH module needs to be installed before running the PCIe-NUMA-Locality scripts, the script does not install Posh-SSH itself. This module can be installed by running the following command Install-Module -Name Posh-SSH (Admin rights required). More information can be found at https://github.com/darkoperator/Posh-SSH

Root access is required to execute a vanish command via the SSH session. It might be possible to use SUDO, but this has functionality has not been included in the script (yet). The script uses Posh-SSH keyboard-interactive authentication method and presents a screen that allows you to enter your root credentials securely.

Script Content

Each script consists of three stages, Host selection & logon, data collection, and data modeling. The script uses the module Posh-SSH to create an SSH connection and runs a vsish command directly on the node itself. Due to this behavior, the script creates an output per server and cannot invoke at the cluster level.

Host Selection & Logon

The script requires you to enter the FQDN of the ESXi Host, and since you are already providing input via the keyboard, the script initiates the SSH session to the host, requiring you to login with the root user account of the host. When using the GPU script, the input of the GPU vendor name is requested. The input can be, for example, NVIDIA, AMD, Intel, or any other vendor providing supported GPU devices. This input is not case-sensitive.

Data Collection

The script initiates an esxcli command that collects the PCIe address of the chosen PCIe device type. It stores the PCIe addresses in a simple array.

Data Modeling

The NUMA node information of the PCIe device is available in the VSI Shell. However, it is listed under the decimal value of the Bus ID of the PCIe address of the device. The part that follows is a collection of instructions converting the full address space into a double-digit decimal value. Once this address is available, it’s inserted in a VSISH command and execute on the ESXi host via the already opened SSH connection. The NUMA node, plus some other information, is returned by the host, and this data is trimmed to get the core value and store it in a PSobject. Throughout all the steps of the data modeling phase, each output of the used filter functions is stored in a PSObject. This object can be retrieved to verify if the translation process was executed correctly. Call $bdfOutput to retrieve the most recent conversion. (as the data of each GPU flows serially through the function pipeline, only the last device conversion can be retrieved by calling $bdfOutput.

The next step is to identify if any virtual machines registered on the selected host are configured with PCIe passthrough devices corresponding with the discovered PCIe addresses.

Output

A selection of data points is generated as output by the script:

PCIe Device	Output Values
GPU	PCI ID, NUMA Node, Passthrough Attached VMs
NIC	VMNIC name, PCI ID, NUMA Node, Passthrough Attached VMs
FPGA	PCI ID, NUMA Node, Passthrough Attached VMs

The reason why the PCI ID address is displayed is that when you create a VM, the vCenter UI displays the (unique) PCI-ID first to identify the correct card. An FPGA and GPU do not have a VMkernel label, such as the VMNIC label of a network card. No additional information about the VMs is provided, such as CPU scheduling locations or vNUMA topology, as these are expensive calls to make and can change every CPU Quorum (50 ms).

It’s recommended to review the CPU topology of the virtual machine and if possible to set the NUMA Node affinity following the instructions listed in VMware Resource Management Guide. Please note that using this advanced setting can impact the ability of the CPU and NUMA schedulers to achieve an optimal balance.

Using the Script Set

Step 1. Download the script by clicking the “Download” button on the Github repository
Step 2. Unlock scripts (Properties .ps1 file, General tab, select Unlock.)

Step 3. Open PowerCLI session.
Step 4. Connect to VIServer
Step 5. Execute script for example, the GPU script: .\PCIE-NUMA-Locality-GPU.ps1
Step 6. Enter ESXi Host Name
Step 7. Enter GPU Vendor Name

Step 8. Enter Root credentials to establish SSH session

Step 8. Consume output and possibly set NUMA Node affinity for VMs

Acknowledgments

This script set would not have been created without the guidance of @kmruddy and @lucdekens. Thanks, Valentin Bondzio, for verification of NUMA details and Niels Hagoort and the vSphere TM team for making their lab available to me.

vSphere 6.5+ DRS Pairwise Balancing

October 30, 2019 by frankdenneman

Or maybe I should have called this blog post, “I’m seeing an excessive number of DRS initiated vMotions on my newly upgraded 6.5 environment”. Recently I was part of a few conversations about the nature of DRS load balancing in systems running vSphere 6.5 and newer. It was noticed that more vMotion operations where occurring since running 6.5 and it’s highly likely that these operations occur due to the new DRS pairwise balancing functionality. Pairwise balancing was introduced by vSphere 6.5 and is focused on keeping the host resource utilization disparity within a certain threshold. As a result, DRS performs load-balancing operations if the difference between the lowest-utilized host and the highest-utilized host is a certain percentage. That percentage depends on your migration threshold. The default migration threshold uses a 20% tolerable difference in utilization.

Migration Threshold Level	Tolerable CPU/Memory usage difference between any two hosts in the cluster
1	Not Available (only Affinity violations and MM migrations allowed)
2	30%
3 (Default migration threshold)	20%
4	10%
5	5%

This new feature is needed as clusters keep on growing larger and larger. To determine if load-balancing operations are necessary, DRS calculates two metrics, the current host load standard deviation (CHLSTD) and the target host load standard deviation (THLSTD). Each host reports its load and DRS calculates the standard deviation of the host load metric across all the hosts in the cluster. DRS calculates a target host load balance for the cluster and as long as the current host load standard deviation is less than or equal to the target host load value, DRS will consider the cluster balanced. The migration threshold allows how far apart the CHLSTD and THLSTD before it triggers load balancing operations. The higher the aggressiveness of the migration threshold, the lower the difference between the CHLSTD and THLSTD is tolerated.

A situation can occur that a few hosts in a large cluster can experience a high resource utilization, while the majority of hosts are not. Due to the size of the cluster, the few high host load become just some statistical outliers than simply disappear as noise due to the vast number of hosts that experience (far) lower utilization. As a result, these outliers are missed as the calculate CHLSTD is below the threshold required to trigger load balancing.

By adding the functionality of pairwise balancing, and “simply” comparing the highest reported utilization with the lowest utilization, these outliers might be a thing of the past. That means that in certain cases, the DRS UI might report that the cluster is in a balanced state, yet load-balance operations still occur. This behavior can be attributed to pairwise balancing.

Please keep in mind that if you are using a migration threshold that is more aggressive than the default setting, the tolerable difference between hosts is reduced, more migrations are likely to occur.

So what happens when the tolerable difference is detected in the cluster? Does this mean that VMs are migrated from the highest utilized host to the lowest utilized host? Not necessarily. VMs can be migrated to any other host in the cluster. DRS still takes many different requirements into account when selecting a virtual machine migration for load-balancing purposes. Anti-affinity and affinity rules cannot be violated to obtain a better cluster load-balance, so these moves are not considered. Compatibility of hosts and VM configuration also impact migration options (typically a missing datastore or network portgroup are common reasons why particular hosts are overloaded and why other hosts are lower utilized), but also the “cost-benefit” of a VM migration is still taken into account. It still needs to make sense for the cluster balance to incur infrastructure costs and risk to move a particular VM.

If you recently updated your vCenter to 6.5/6.7 and are curious to see whether the vMotions are triggered due to Pairwise imbalance operations, you can use the online version of the DRS Dump Insight tool available at https://www.drsdumpinsight.vmware.com/. You can also run the DRS dump insight tool on-prem by installing one of the flings available here: https://flings.vmware.com/?utf8=%E2%9C%93&q=DRS+Dump+Insight&button=. Grep for “Pairwise Imbalance”.

If this behavior is not appreciated, and you do not want to alter the migration threshold, you can switch back to the old behavior by turning off pair-wise balancing by setting the cluster advanced option “CheckPairWiseImbalance to 0. (case-sensitive). Although this functionality was introduced by vSphere 6.5 and is active by default in all newer releases, we have backported this functionality to vSphere 6.0 u3.

One thing I would like to ask if you want to disable it, what are the reasons? I expect “too much vMotions”, but I would like to understand why a vMotion or a collection of vMotions is considered not desirable? The main goal is to get the VMs to a place where they have access to enough resources, why is that still a bad thing?

AMD EPYC Naples vs Rome and vSphere CPU Scheduler Updates

October 14, 2019 by frankdenneman

Recently AMD announced the 2nd generation of the AMD EPYC CPU architecture, the EPYC 7002 series. Most refer to the new CPU architecture using its internal codename Rome. When AMD introduced the 1st generation EPYC (Naples), they succeeded in setting a new record of core count and memory capacity per socket. However, due to the CPU multi-chip-module (MCM) architecture, it is not an apples-to-apples comparison when compared to an Intel Xeon architecture. As each chip module contains a memory controller, each module presents a standalone NUMA domain. This impacts OS scheduling decisions and, thus, virtual machine sizing. A detailed look can be found here in English or here translated by Grigory Pryalukhin in Russian. Rome is different, the new CPU architecture is more aligned with the single NUMA per Socket paradigm, and this helps with obtaining workload performance consistency. There are some differences between Xeons and Rome. In addition, we made some adjustments to the CPU scheduler to deal with this new architecture. Let’s take a closer look at the difference between Naples and Rome.

7 nanometer (7 nm) lithography process forcing a new architecture

Rome is using the new 7nm Zen 2 microarchitecture. A smaller lithography process (7nm vs. 14nm) allows CPU manufacturers to cram more CPU cores in a CPU package. However, there are more elements on a CPU chip than CPU cores alone, such as I/O and memory controllers. The scalability of I/O interfaces is limited, and therefore, AMD decided to use a separated and more massive 14nm die that contains the memory and I/O controllers. This die is typically revered to as the server I/O Die (sIOD). In the picture below, you see a side by side comparison of an unlidded Naples (left) and an unlidded Rome, exposing the core chiplet dies and the SIOD.

Naples Zeppelin vs. Rome Chiplet

The photo above provides a clear overview of the structure of the CPU package. The Naples CPU package contains four Zeppelin dies (black rectangles). A Zeppelin die provides a maximum of eight Zen cores. The cores are divided across two compute complexes (CCX). A Zeppelin of a 32 core EPYC contains 4 cores per CCX. When Simultaneous Multi-Threading (SMT) is enabled, a CCX offers eight threads. Each CCX is connected to the Scalable Data Fabric (SDF) through the Cache-Coherent Master (CCM) that is responsible for sending traffic cross CCXes. The SDF contains two Unified Memory Controllers (UMC) connecting the DRAM memory modules. Each UMC provides a memory channel to two DIMMs. Providing the memory capacity of 4 DIMMs in total. Due to the combination of Cores, cache, and memory controller, a Zeppelin is a NUMA domain. To access a “remote” on-package memory controller, the Infinity Fabric On Package Controller (IFOP) sets up and coordinates the data communication.

The Rome CPU package contains a 14nm I/O Die (the center black rectangle), and 8 chiplet dies (the smaller black rectangles). A Rome chiplet contains two CCX’es with each containing four cores and L3 cache, but no I/O components or the memory controllers. There is a small Infinity Fabric “controller” on each CCX that connects the CCX to the sIOD. As a result, every memory read beyond the local CCX L3 cache has to go to the sIOD. Even for a cache line (data from memory stored in the cache) that is stored in the LL3 cache of the CCX sharing the same Rome chiplet. A Chiplet is a part of the NUMA Domain.

NUMA Domain per Socket

As mentioned before, a NUMA Domain, typically called NUMA node, is a combination of CPU cores, cache, and memory capacity connected to a local memory controller. Intel architecture design uses a single NUMA Domain per Socket (NPS), AMD Naples offered four NPS, while Rome is back to a single NPS. Single NPS simplifies VM and application sizing while providing the best and consistent performance.

The bandwidth to local memory differs between each CPU architecture. The Intel Xeon Scalable Family provides a maximum of six channels of memory supporting a DDR4-2933 memory type. The Naples provides two memory channels to its locally connected memory, supporting a DDR4-2666 memory type. The Rome architecture provides eight memory channels to its locally connected memory, supporting a DDR4-3200 memory type. Please note that the memory controllers in the Rome architecture are located on the centralized die, handling all types of I/O and memory traffic, the Intel memory controllers are constructs isolated from any other traffic. Real-life application testing must be used to determine whether this architecture impacts memory bandwidth performance.

CPU Architecture	Local Channels	Mem Types	Peak transfer
Intel Xeon Scalable	6	DDR4-2666	127.8 GB/s
AMD EPYC v1 (Naples)	2	DDR4-2933	46.92 GB/s
AMD EPYC v2 (Rome)	8	DDR4-3200	204.8 GB/s

With a dual-socket system, there are typically two different distances with regards to memory access. Accessing memory connected to the local memory controller and accessing memory connected to the memory controller located on the other socket. With Naples, there are three different distances. The IFOP is used for intra-socket communication, while the Infinity Fabric Inter Socket (IFIS) controller takes care of routing traffic across sockets. As there are eight Zeppelins in a dual-socket system, not every Zeppelin is connected directly to each other and thus sometimes the memory access is routed through the IFIS first before hitting an IFOP to get to the appropriate Zeppelin.

Naples Memory Access	Hops
Local memory access within a Zeppelin	0
Intra-socket memory access between Zeppelins	1
Inter-socket memory access between Zeppelins with direct IFIS connection	1
Inter-socket memory access between Zeppelins with indirect connection (IFIS+Remote IFOP)	2

AMD Rome provides equidistant memory access within the die and a single hop connection between sockets. Every memory access within the socket, every cache line load within the socket has to go to the I/O die. Every remote memory and cache access goes across the Infinity Fabric between sockets. This is somewhat similar to the Intel architecture that we have been familiar with since Nehalem, which launched in 2008. Why somewhat? Because there is a difference in cache domain design.

The Importance of Cache in CPU Scheduling

Getting memory capacity as close to the CPU improves performance tremendously. That’s the reason why each CPU package contains multiple levels of cache. Each core has a small but extremely fast cache capacity for instructions and data (L1), a slightly larger but relatively slower (L2) cache. A third and larger cache (L3) capacity is shared amongst the cores in the socket (Intel paradigm). Every time when a core request data to be loaded, it makes sense to retrieve this from the closest source possible, typically this is cache. To get an idea of how fast cache is relative to local and remote memory, look at the following table:

System Event	Actual Latency	Human Scaled Latency
One CPU cycle (2.3 GHz)	0.4 ns	1 second
Level 1 cache access	1.6 ns	4 seconds
Level 2 cache access	4.8 ns	12 seconds
Level 3 cache access	15.2 ns	38 seconds
Remote level 3 cache access	63 ns	157 seconds
Local memory access	75 ns	188 seconds (3min)
Remote memory access	130 ns	325 seconds (5min)
Optane PMEM Access	350 ns	875 seconds (15min)
Optane SSD I/O	10 us	7 hours
NVMe SSD I/O	25 us	17 hours

Back in the day when you could disable the cache of the CPU, someone tested the effect of cache on loading Windows 95. With cache it took almost five minutes, without the use of the cache, it took over an hour. Cache performance is crucial to get the best performance. And because of this, the vSphere NUMA scheduler and the CPU scheduler work together to optimize workloads that communicate with each other often. As they are communicating, they typically use the same data sources. Therefore, if vSphere can run the workload on the same cores that share the cache, then this could improve performance tremendously. The challenge is that AMD uses a different cache domain design than Intel.

As depicted in the diagram above, Intel uses a 1:1:1 relationship model. One socket equals one NUMA domain and contains one Last Level Cache domain. As Intel is used in more than 98% of the dual-socket systems (info based on internal telemetry reports), our scheduling team obviously focused most of their efforts on this model. EPYC Naples introduced a 1:4:2 model, one socket, that contains four NUMA domains, and each NUMA domain contains two LLC domains. Rome provides a NUMA model similar to the XEON, with a single socket and single NUMA domain. However, each chiplet contains two separate LLC domains. A Rome CPU package contains eight chiplets, and thus, 16 different LLC domains exist within a socket & NUMA domain.

Relational Scheduling

vSphere uses this LLC domain as a target for its relational scheduling functionality. Relational scheduling is better known as Action-Affinity. Its actions have made most customers think that the NUMA scheduler was broken. As the scheduler is optimized for cache sharing, it can happen that a majority of vCPU is running on a single socket, while the cores of the other sockets are idling. When reviewing ESXTOP you might see an unbalanced number of VMs running on the same NUMA Host Node (NHN). As a result, the VMs running in this NUMA domain (or in ESX terminology NHN) might compete with CPU resources and thus experience increased %Ready time.

Side note: It is my opinion to test the difference of relational scheduling on the performance of the application. Do not test this with synthetic test software. Although %Ready time is something to avoid, some applications benefit more from low-latency and highly consistent memory access than being impacted by an increase of CPU scheduling latency.

Action-Affinity can lead to ready time on an Intel CPU architecture where more than eight cores share the same cache domain, imagine what impact it can have on AMD EPYC systems where the maximum number of cores per cache domain is four. In lower-core count AMD EPYC systems, the cores are disabled per CCX, reducing the scheduling domain any further.

As the majority of the data centers are running on Intel, vSphere is optimized for a CPU topology where the NUMA and LLC domain are of consistent scope, i.e. the same size. With AMD the scopes are different and thus the current CPU scheduler can make “sub-optimal” decisions, impact performance. What happens is that the NUMA scheduler dictates the client size, the number of vCPUs to run on a NUMA Home Node, but it’s up to the CPU scheduler discretion to decide which vCPU to run on which physical core. As there are multiple Cache domains within a NUMA client, it can happen that there is an extraordinary amount of vCPU migrations between the cache domains within the NUMA domain. And that means cold cache access and a very crowded group of cores.

Therefore, the CPU team worked very hard to introduce optimizations for the AMD architecture and these optimizations are released in the updates ESXi 6.5 Update 3 and ESXi 6.7 Update 2.

The fix informs the CPU scheduler about the presence of the multiple cache domains within the NUMA node, allowing it to schedule the vCPU more intelligently. The fix also introduces a automatic virtual NUMA client sizer. By default, a virtual NUMA architecture is exposed to the guest OS when the vCPU count exceeds the physical core count of the physical NUMA domain and if the vCPU count is no less than the numa.vcpu.min setting, which defaults to 9. A physical NUMA domain in Naples counts eight cores, and thus no virtual NUMA topology is exposed. With the patch, this is solved. What is crucial to note is that the virtual NUMA topology is determined at first boot by default. Therefore, existing VMs need to have its virtual NUMA topology reset to leverage this new functionality. This involves a power-down to remove the NUMA settings in the VMX.

When introducing Naples/Rome based systems in your virtual data center, it’s strongly recommended to deploy the latest update of your preferred vSphere platform version. This allows you to extract as much performance from your recent investment.

VMware Cloud on AWS on Virtually Speaking Podcast

April 9, 2019 by frankdenneman

Last week I had the pleasure of connecting again with my friends and colleagues Pete Flecha a.k.a PedroArrow and eternal sunshine John Nicholson. During the podcast, we discussed the road to Hybrid cloud, cloud mobility, multi-cloud operations, and the necessity of replatforming apps or not. It’s always fun hanging out with these guys especially when talking about cool things. Hope you enjoy the show as much as I did.

AMD EPYC and vSphere vNUMA

February 19, 2019 by frankdenneman

AMD is gaining popularity in the server market with the EPYC CPU platform. The EPYC CPU platform provides a high core count and a large memory capacity. If you are familiar with previous AMD generations, you know AMD’s method of operation is different than Intel’s. For reference, take a look at the article I wrote in 2011 about the 12-core 6100 Opteron code name Magny-Cours. EPYC provides an increase of scale but builds on the previously introduced principles. Let’s review the EPYC architecture and see how it can impact your VM sizing and ESXi configuration. (Please note that this article is NOT intended as a good/bad comparison between AMD and Intel, I’m just describing the architectural differences).

EPYC Architecture
The EPYC processor architecture is what AMD refers to as a Multi-Chip-Module (MCM). EPYC is designed to provide a high core count platform by combining multiple silicon dies within a CPU Package. A silicon die (named Zeppelin) is a wafer that contains the circuitry. In simple terms, it’s the component that contains CPU cores, memory cache, and various controllers. Regardless of the core-count, an EPYC CPU package always contains four Zeppelin dies. Comparing this to Intel Xeon, a Xeon CPU package is a single-chip-design which consist of a single silicon die containing all components. The reason why the difference in chip design is interesting is that impacts the logical grouping of compute resources. The size of the logical group, better known as a NUMA node, impacts scheduling decisions made by the CPU scheduler of the operating system (both the hypervisor kernel and possibly the guest operating system). It might be necessary to change some of the default settings of the ESXi host to alter scheduling behavior, these settings are covered in the last part of the article. Let’s continue to explore the architecture of the EPYC CPU.

AMD EPYC – image courtesy of wccftech.com

Compute Complex
The photo above provides a clear overview of the structure of the CPU package. The CPU package houses four Zeppelin dies. In the current generation, a Zeppelin die provides a maximum of eight Zen cores. The cores are divided across two compute complexes (CCX). A Zeppelin of a 32 core EPYC contains 4 cores per CCX. When Simultaneous Multi-Threading (SMT) is enabled within the BIOS, a CCX offers eight threads.

Each core has its own L1 (instruction (64KB) and data (32KB)) and L2 caches (4 MB total L2 cache). A Zeppelin has 16 MB L3 cache. Interestingly enough, each CCX has it’s own L3 Cache of 8MB, in turn, split up into four slices of 2 MB. The two CCXes within a Zeppelin die are connected to each other through an interconnect (Infinity Fabric). Adding hops to memory access is not beneficial to bandwidth and latency. Multiple tech-sites have performed in-depth testing on cache performance, and to quote Anandtech.com:

“The local “inside the CCX” 8 MB L3-cache is accessed with very little latency. But once the core needs to access another L3-cache chunk – even on the same die – unloaded latency is pretty bad: it’s only slightly better than the DRAM access latency.”

In essence, this means that you cannot think of the 64MB L3 cache as one single pool of cache capacity. Better is to approach it as eight 8MB capacity pools. This is important to realize if multiple workloads share the same data, the NUMA scheduler of ESXi attempts to place both workloads in the same NUMA node to optimize cache and memory performance for these workloads. It might happen that the L3 cache size is not sufficient enough. The option that impacts this behavior is called Action Affinity, more details about this setting can be found in the last part of the article.

Zeppelin Core Count
EPYC is offered in multiple SKUs. Next, to the 32 core count model, there are lower-core count models. Since the EPYC architecture always includes four Zeppelins, the difference in core count is created by disabling cores per CCX in a symmetrical way. For example, in a 24 core count EPYC, a single Zeppelin die would look like this.

The table shows the core count per Zeppelin of the three largest EPYC CPUs. The total cores per Zeppelin count can be used as a guideline for the vNUMA setting described later in this article

Cores	Cores per CCX	Total Cores per Zeppelin	Zeppelin Count
32	4	8	4
24	3	6	4
16	2	4	4

Infinity Fabric
The cores within a CCX communicate with memory (DIMMs) via an on-die memory controller through the infinity Fabric. The Infinity fabric is AMD’s proprietary system interconnect architecture that facilitates data and control transmission across all linked components. The Infinity Fabric consists of two communication planes; the Scalable Data Fabric (SDF) and the Infinity Scalable Control Fabric (SCF). The SCF is responsible for processing system control signals, such as thermal and power management. Although very important, we are more interested in the SDF which is responsible for transmitting data within the system. The rest of the article zooms into SDF design and its impact on scheduling decisions.

Each CCX is connected to the SDF through the Cache-Coherent Master (CCM) that is responsible for sending coherent data traffic cross CCXes. The SDF uses a Unified Memory Controller (UMC) to connect to DRAM memory modules. Each UMC provides a memory channel to two DIMMs. Providing the memory capacity of 4 DIMMs in total.

How does this design impact VM sizing? A Zeppelin is a NUMA node that contains a maximum of 8 cores (16 threads) with the memory capacity of four DIMMs. This design results in a single EPYC CPU package presents four NUMA nodes to the operating system.

Server Memory Capacity and NUMA
Intel moved from a 3 DIMMs per channel configuration (DPC) with 4 channels to a model with 6 channels and 2 DIMMs deep. This new model broke the capacity model cadence. For example, using 16 GB DIMMs, you had either 64 GB, 128GB or 192GB available per socket. Now with the scalable architecture, it’s either 96GB or 192GB. That is if you follow the high- performance best practice of populating all channels for maximum bandwidth availability. However, with the current DIMM pricing, a lot of customers cannot afford such a configuration.

With the EPYC, every Zeppelin has two memory channels. Each memory channel can drive two DIMMs. For good performance, each Zeppelin should be equipped with at least 1 DPC. That means that a proper performing dual socket EPYC system should be configured with 16 DIMMs. This configuration allows for a theoretical bandwidth of 42.6 GB/s while providing a (shallow) memory capacity of just the two DIMMs combined. This design results in a single EPYC CPU package presents four NUMA nodes to the operating system. If the minimum of 1DPC is used, the NUMA node size can be too small and thus the overall performance if the VM memory size exceeds the physical memory configuration of each Zeppelin. Servethehome published some benchmark tests about the performance difference between the different memory configurations of EPYC.

With NUMA, it’s important to understand the boundaries of your local memory domain and your remote memory domain. Traditionally the domains were easily demarcated by the CPU package core count and attached memory capacity. With EPYC, a new distinction has to be made between the different remote memory access types. It can be remote on-package memory access or remote socket memory access. The reason why this distinction has to be made is the impact on performance and consistency of application memory access. Having your VM and application span multiple NUMA nodes can introduce a very inconsistent response time.

Local Memory Access
Let’s start with the best and most consistent performance. When a core within the Zeppelin access local memory the path is as follows:

The presentation “Zeppelin an SOC for Multi-Chip Architectures” by AMD list the latency of local memory access within the Zeppelin at 90 nanoseconds.

Remote Memory Access On Package
A core can access memory attached to a different Zeppelin within the same CPU package. This is called remote on-package memory access or “on-package Die-to-Die” memory access. This means we are still using memory controllers within the same socket. In total the EPYC CPU has eight memory channels, but two are local to the Zeppelin. To access a “remote” on-package memory controller the Infinity Fabric On Package Controller (IFOP) sets up and coordinates the data communication.

In total each Zeppelin has 4 IFOPs, but actually, only three are needed since there are 3 other Zeppelins within the same CPU package.

To be more precise, the IO traverses an additional component before hitting the IFOP. This component is called the Coherent AMD socKet Extender (CAKE). It facilitates die-to-die or socket-to-socket memory transactions. This module translates the request and response formats used by the SDF transport layer to and from the serialized format used by the IFOP. What that means is that a few extra hops and CPU cycles are introduced when fetching data stored within DIMMs attached to other Zeppelins on the same die. AMD reports a latency of ~145ns.

Inter Package Remote Access
And then we have the chance that memory needs to be fetched from DIMMs attached to UMCs from a Zeppelin that is a part of another EPYC CPU package within the system (dual socket system). Instead of routing the traffic across the IFOP, the traffic is routed across Infinity Fabric Inter Socket (IFIS) controller. Package-to-package traffic has 8/9 of the bandwidth of IFOP traffic, resulting in a theoretical bandwidth of 37.9 GB/s. The reduction in bandwidth increases the chance of experiencing inconsistent performance. The increased length of the path, increments latency. AMD reports a latency of ~200ns.

Because there are two IFIS controllers per Zeppelin, not every Zeppelin within a dual socket system is directly connected to each other. In the worst case scenario, there are two hops. One hop from one package to the other package and an extra hop to go from one Zeppelin to the Zeppelin that is connected to the DIMM holding the data. Unfortunately, AMD as not shared latency data.

Remote Access Inter-package, die-to-die communication

VM Sizing
The key is to keep memory access as much local as possible. ESXi and most modern guest operating systems are optimized to deal with NUMA. However as with most things in life, for the most optimal performance, reduce distance and reduce any form of variation. Apply this to VM sizing and try to keep the vCPU count of a VM within the core count of NUMA domain. Same applies to VM memory capacity, try to fit this with the capacity of the NUMA node. If the VM cannot fit inside a NUMA node, there is no need to stress, ESXi has got the best NUMA scheduler in the business. To help ESXi to optimize for the EPYC architecture, some advanced settings might be necessary to adjust. As always, tests these settings in a non-revenue critical environment before applying them to production systems.

Virtual NUMA
Virtual NUMA (vNUMA) allows the operating system to understand the “physical” layout of the virtual machine. vNUMA presents the mapping of the VM vCPU to the physical NUMA nodes of the ESXi host. For example, if a VM has 12 vCPUs and the physical core count within a single NUMA node was 10 cores, ESXi would present the guest OS a topology of 2 NUMA nodes with each counting 6 cores. ESXi would group 6 vCPUs into a NUMA client and schedule these across the 10 CPU cores within a NUMA node.

When vNUMA was introduced, the highest core count of a CPU was 8 CPUs, thus the VMware engineers introduced a vNUMA threshold of 9 (numa.vcpu.min=9). Meaning that the VM needs to contain at least 9 vCPUs in order to generate the virtual NUMA topology.Considering the highest core-count of an EPYC system is eight cores per Zeppelin, you might want to adjust the vNUMA default threshold to resemble the physical layout of the used EPYC model.

For example, the EPYC 7401 contains 24 cores, 6 cores per Zeppelin and thus 6 cores per NUMA node. When using the default setting of numa.vcpu.min=9, an 8 vCPU VM is automatically configured like this.

A VPD is the virtual NUMA client that is exposed to the guest OS system, while a PPD is the NUMA client used by the VMkernel CPU scheduler. In this situation, the ESXi scheduler uses two physical NUMA nodes to satisfy CPU and memory requests while the guest OS perceives the layout as a Uniform Memory Access (UMA) system. In a UMA system, the access time to a memory location is independent of which processor makes the request, or which memory chip contains the transferred data). I.e., pretty much the same latency and bandwidth throughout the system. However, this is not the case as reported in this article above. Reading and writing remote CCX cache and remote memory (on-die) is slower than local memory even within the same Zeppelin. By setting the numa.vcpu.min=6, two VPDs are created, and thus the guest OS is made aware of the physical layout by the ESXi scheduler. The guest OS and the applications can optimize memory operations to attain consistent performance.

Action Affinity
When the ESXi scheduler detects multiple VMs communicating with each other, it can decide of placing them together on the same NUMA node to increase intra-NUMA node communication. This behavior is called action affinity, and it can increase performance by up to 30%. However, with the small NUMA nodes of max 8 CPUs, it can also lead to a lot of cache thrashing and remote memory access if the configured memory of the VMs cannot fit inside a single NUMA node. If this is the case, it might be helpful to test disabling the action affinity on the ESXi host. This is done by configuring the /Numa/LocalityWeightActionAffinity to 0 (KB 2097369).

What if the VM Memory Config Exceeds the Memory Capacity of the Physical NUMA Node?
I wrote an article about this situation back in 2017, and it’s featured in the vSphere 6.5 Host deep dive book. However, what happens if your VM memory configuration exceeds the physical capacity of a NUMA node. By default, the ESXi scheduler optimizes for local memory access and attempts to place as much memory along with the vCPU in the same NUMA node. Sometimes it can improve local memory access to creating multiple smaller NUMA clients.

For example, on an EPYC 7601 (32 core), the NUMA node contains 8 cores, and this server is equipped with 256 GB by using 16 x 16 GB DIMMs. A NUMA node has 4 DIMMs attached to it. Thus the NUMA node provides 8 cores and 64 GB. What happens if a VM is configured with 6 vCPUs and 96 GB? By default the NUMA scheduler attempts to store 64GB of VM memory inside the NUMA node, leaving 32 GB in a remote NUMA node. By enabling the VM advanced setting numa.consolidate = FALSE. It instructs the NUMA scheduler to distribute the VM configuration across the optimal number of NUMA nodes greater than 1. In this case, 2 NUMA clients are created, and this will schedule 3 vCPUs in each NUMA node.

Now the performance and the behavior of the application depends on its design. If you have a single-threaded application, this setting might not be helpful at all. However, if it’s a multi-threaded application, you might see some benefit. The only thing to do is to set the numa.vcpu.min equal to the number of vCPUs per virtual NUMA client to expose the vNUMA architecture to the guest OS and the application. The following command helps you to retrieve the NUMA configuration of the VM:

vmdumper -l | cut -d \/ -f 2-5 | while read path; do egrep -oi “DICT.(displayname.|numa.|cores.|vcpu.|memsize.|affinity.)= .|numa:.|numaHost:.” “/$path/vmware.log”; echo -e; done

Please bear in mind that the ESXi CPU and NUMA scheduler do not use an SRAT (System Resource Allocation Table) to determine the distance of the individual NUMA nodes between each other. ESXi uses its own method to determine latency between the different NUMA nodes within the system. It uses these latency numbers for initial placement and attempts to schedule the NUMA clients of a VM as close to each other as possible. However, the ESXi scheduler does not leverage this information during load-balancing operations. This is work in progress. Adding a new first class metric to a heuristic is not a simple task and knowing the CPU engineers, they want to provide a system that is thoroughly improved by augmenting new code.

Increase NUMA Node Compute Sizing
For workloads that are memory latency sensitive with a low processor utilization, you can alter the way the NUMA scheduler sizes the NUMA client of that particular VM. The VM advanced setting numa.vcpu.preferHT=TRUE allows the NUMA scheduler to count threads instead of cores for NUMA node size configuration. For example, an 8 vCPU VM that uses this advanced setting and runs on an EPYC 7401 system (6 cores, 12 threads), is scheduled within a single Zeppelin.
If all workloads follow the same utilization pattern, you can alter the ESXi host setting by adding numa.PreferHT=1 to the ESXi host advanced configuration.

Channel-Pair Interleaving (1 NUMA node per socket)
The EPYC architecture can interleave the memory channels and thus present the cores of the four zeppelins as a single NUMA node. This setting requires that every channel is populated with equal memory size. Some vendors use a different name for it. For example, Dell calls this setting “Memory Die Interleaving”. Little to no data can be found about the performance impact of this setting, but keep in mind, software settings do not change the physical layout (and thus physics). Typically abstraction filters out the outliers and presents an average performance behavior. For NUMA benchmarking, please take a look at the article “AMD EPYC – STREAM, HPL, InfiniBand, and WRF Performance Study” located on the Dell website.

Research Your Workload Requirements
ESXi can handle complex NUMA architectures as the best. However, it’s always best to avoid complexity as possible. Determine if your workload can fit in a minimum number of small NUMA nodes when using the EPYC architecture? Can the workload handle inconsistent memory performance if it does exceed the NUMA node size of 8? The EPYC architecture is an excellent way of adding scale to the server platform but do remember that for real-life workload optimal performance is achieved when you take the NUMA configuration boundaries into account.

On Twitter some asked what my thoughts are about the EPYC CPU architecture? For every tech challenge, there is a solution. When looking at the architecture, I think EPYC is an excellent solution for small and medium-sized workloads. I expect that larger monolithic apps, that require consistent performance, are better off looking at different architectures. (My opinion, not VMware’s!)