Initial Placement of a vSphere Pod

March 6, 2020 by frankdenneman

Project Pacific transforms vSphere into a unified application platform. This new platform runs both virtual machine and Linux containers as native workload constructs. Just introducing Linux containers as a new workload object is not enough. To manage containers properly, you need a legitimate orchestrator. And on top of that, you need to make sure that existing services, such as DRS, can handle the different lifecycles of these different objects. Containers typically have a shorter lifecycle than virtual machines, where VMs “live” for years, containers have a shorter life expectancy. And this massively different churn impacts initial placement and load-balancing operations of resource management services.

Being able to run containers as first-class citizens in the VMkernel generates a couple of fascinating challenges by itself. As Michael Gasch highlighted in our VMworld 2018 session, “Running Kubernetes on vSphere Deep Dive: The Value of Running Kubernetes on vSphere (CNA1553BU)” a container is not a separate entity but a collection of Linux processes and objects.

Container Runtime for ESXi

The ESXi VMkernel is not a Linux operating system. The VMkernel hardware and process abstractions were built with the intent of servicing virtual machines, not to support Linux processes directly. To do so, project Pacific introduces a container runtime for ESXi (CRX). The CRX provides a Linux Application Binary Interface (ABI) that allows you to execute a Linux application (container) if it was running in the VMkernel directly. The beauty of the CRX is that it is completely isolated from any other process or UserWorld running on the ESXi host. How is that possible? By using our old friend, the VM-construct.

The virtual machine aspect of a CRX instance is the use of the virtual machine monitor (VMM) and the configuration of the virtual hardware (VMX). The VMM provides the exception and interrupt handling for the VM. Inside the CRX instance, a CRX init process is active to provide communication between CRX instance and the VMkernel services.

But we need to have a Linux kernel to provide a Linux ABI for the container to run. What better Linux kernel than to use our own? VMware Photon was chosen as it’s a VMware supported and maintained LTS Linux kernel. Photon is used as the base for VCSA and other VMware products and has an extremely light footprint. Now the interesting part is that this kernel is not stored onto and loaded from a separate disk. The bare Linux kernel is directly loaded into the memory space of the CRX instance when it is instantiated. Additionally, the CRX instance is stripped down. Only the necessary devices and functionalities are enabled to make the CRX and kernel as lightweight and fast as possible. For example, the CRX only exposes paravirtualized devices to the Photon kernel.

On top of this base, a container runtime is active that allows us to spin up OCI compatible containers inside the CRX instance.

Inside the VMkernel, we introduced our implementation of the Kubelet, called the Spherelet. In-short, the Spherelet turns the ESXi host into a Kubernetes worker node, and the Spherelet acts as an extension to the Kubernetes control plane. The container runtime inside the CRX instance contains a Spherelet agent that allows the communication between the Spherelet and the container runtime. The Spherelet Agent provides the functionality that Kubernetes expects from a pod. Actions like; health checks, mounting storage, setting up networking, controlling the state of the containers inside the pod, and it provides an interactive endpoint to the Kubernetes command-line tool Kubectl. The Spherelet agent is linked with libcontainer and understands how to launch containers using that method. Once containers are running inside the CRX instance we refer to this group of objects as a native pod.

Two Captains on one Ship

Now that we understand how ESXi can run containers, we need to think about who controls what from a resource management perspective. Not only does project Pacific introduces a way to run containers natively inside the VMkernel, but it also introduces Kubernetes as an orchestrator of containers. In essence, that means it is adding a control plane to a platform that already has a control plane in place for VM workloads (vCenter+HostD) plus additional services such as DRS to simplify resource management.

At first sight, this should not be difficult as Kubernetes does not manage and control VMs, so there is no overlap. However, you must have noticed that there is a duality at play. A vSphere pod is a combination of a VM and a group of containers. And to make it even more interesting, both control platforms have similar constructs to control behavior and placement. In Kubernetes, you control the placement of containers on worker nodes by using labels and deployment policies. In DRS, you use affinity rules. In Kubernetes, you use requests and limits to specify resource entitlement, while vSphere uses vSphere Reservation, Shares, and Limits (RLS) settings. Also, we have to think about how individual requests and limits of multiple containers running inside a single CRX instance will translate to the RLS settings of the VM. I will address the resource entitlement considerations in an upcoming article, what I want to explore in this article is the placement of containers and VMs when these two control planes are active.

Initial Placement of a CRX Instance

When a developer deploys an application, he or she interacts with the Kubernetes API server, and the API server will trigger all sorts of events to various components that are present in the Kubernetes architecture. Project pacific extended the API of Kubernetes and introduced multiple controllers to interact with the vSphere platform. Therefore it seems Kubernetes is in charge. However, we cannot ignore the pure brilliance of the resource management capabilities of DRS and HostD. Whereas Kubernetes uses a very brusque and coarse method of simpy using request (the equivalent of reservations) to mix and match resource consumers (containers) and resource providers (worker nodes). vSphere is far more elegant with its ability to understand resource activity, its ability to translate idleness into a temporary priority adjustment, and the alignment of resource entitlement beyond a single host. And to make it even better, Project Pacific is using the scalable shares functionality that allows for instant readjustment of priority of resource pools if new workload (i.e., containers) are added to the vSphere namespace. An invention Duncan Epping and I so proudly created with the DRS engineering team back in 2013. Yet the Kubernetes architecture has a very elegant way to express business logic, to easily dictate the placement of containers based on labels, taints, and tolerations. Therefore, it makes sense to integrate or create a mesh of functionality of both control planes.

Initial Placement Order

The developer engages with the Kubernetes API server to deploy an application. Typically these deployments are submitted to the API server with the use of a YAML file that contains pod specifications.
The deployment and pod specification is stored in the etcd server and the API server publishes this event to a watch-list, to which the kube-scheduler is subscribed to. Read this article for more information about event-based architectures.
The kube-scheduler initiates the selection process of an adequate host. It filters the available worker node (ESXi hosts) list based on affinity, pod and node labels and other nodeSelector constraints.
It sends the curated list to DRS to pick a node. DRS selects the node based on its decision tree (VM resource entitlement, host state, host compatibility).
Once the host is selected, the information is returned via the vCenter API server to the Pacific Scheduler.
The scheduler has it stored as an event in the etcd database.
While the event is stored in the etcd database, vCenter issues a command to the HostD process on the selected ESXi host to power-on the virtual machine.
HostD powers on the VM (VMX, VMM) and loads the Photon kernel into the memory address space of this virtual machine.
HostD returns the VM ID of the newly created VM to the Pacific Scheduler Extension.
The VM ID is stored in the etcd database and now the control plane node has enough data for the Spherelet to configure the pod.
The vSphere Pod Lifecycle Controller is updated on the event and issues the vSpherelet to configure the pod.
The Spherelet connects with HostD to configure the personality of the pod and configure networking and storage elements.
The CRX container runtime initiates the start of the containers based on the pod specification.
The Spherelet returns the state of the containers back to the Kubernetes control plane node to have it stored as an event in the etcd database.

With this architecture, you have the best of both worlds, the expressiveness of the Kubernetes control plane while enjoying the elegancy of vSphere resource management capabilities. The next article on this topic dives into resource allocation based on container resource configuration settings. Please be aware that project Pacific is still in beta phase and is not (yet) available as a finalized product. Stay tuned for more.

Multi-GPU and Distributed Deep Learning

February 19, 2020 by frankdenneman

More enterprises are incorporating machine learning (ML) into their operations, products, and services. Similar to other workloads, a hybrid-cloud model strategy is used for ML development and deployment. A common strategy is using the excellent toolset and training data offered by public cloud ML services for generic ML capabilities. These ML activities typically improve an organization’s quality of service and increase in productivity. But the real differentiation lies within using the organization’s unique data and know-how to create what’s called differentiated machine learning. The data used is primarily generated by own processes or through interaction with its customers. As a result, specific rules and regulations come into play when handling and storing that data. Another strong aspect of determining where to deploy ML activities is data gravity. Placing compute close to where the data is generated provides a consistent (often high-performing) service. As a result, many organizations invest in the infrastructure needed to deploy ML and deep learning (DL) solutions.

Deep Learning

Deep learning is a subset of the more extensive collection of machine learning techniques. The critical difference between ML and DL is the way the data is presented to the solution. ML uses mathematical techniques and data to build predictive models. It uses labeled (structured) data to train the model, and once the model is trained accurately enough, the model keeps on learning by feeding new data. Deep learning does not necessarily need structured or labeled data to create an accurate model to provide a predictive answer. It uses larger neural networks (layers of algorithms, imitating the brain’s neural network), and it needs to be fed vast amounts of data to provide an accurate prediction.

Interestingly, at one point, ML experiences a performance plateau regardless of the amount of incoming new data, while deep learning keeps on improving. For more information, about this phenomenon review the notes from Andrew Ng Coursera Deep Learning course or watch his 5-minute clip on youtube: How Scale is Enabling Deep Learning.

In essence, the magic of deep-learning is that it gets better with data, and thus, how do we create an infrastructure that is capable of feeding, transporting, and processing these vast amounts of data, while still being able to run non-ML/DL workload?

Parallelism

The best way of dealing with massive amounts of data is to process it in a parallel way. And that’s where general-purpose computing on GPU (GPGPU) comes into play. A simple TensorFlow test compared the performance between a dual AMD Opteron 6168 (2×12 cores) vs. a system with a (consumer-grade NVIDIA Geforce 1070. The AMD system recorded 440 examples per second, while the Geforce processed 6500 examples per second. There are many performance tests available, but this showed the power of a consumer-grade GPU versus a data center grade CPU system.

Today data center focused GPUs have more than 5000 cores all optimized to operate in parallel. These cores have access to 32 GB of high bandwidth memory (HBM2) with speeds up to 900 GB/s (theoretical bandwidth). According to the paper “Analysis of Relationship between SIMD-Processing Features Used in NVIDIA GPUs and NEC SX-Aurora TSUBASA Vector Processors” by Ilya.V. Afanasyev et al. the achievable bandwidth on the tested NVIDIA Volta V100 was 809 GB/s. Getting all the data loaded in memory with consistent performance is one element that impacts virtual machine design. See “Machine Learning Workload and GPGPU NUMA Node Locality” for more information.

Although the improvement of processing speed is enormous, up to 10x over a CPU according to this performance study, sometimes this speed-up is not enough. After processing all the training examples in a dataset (called an epoch), a data scientist might make some adjustments as well and start another epoch to improve the prediction model.

It’s common to run multiple epochs before getting an adequate trained model (and in the process pushing lots of data through the system). Reducing training time, allows the organization to deploy the trained model faster, and start benefiting from their ML and DL initiatives. A “simple” way to reduce training time is to use multiple GPU devices to increase parallelism.

Distributed Deep Learning Strategies

How do you scale out your training model across the multiple GPUs in your system? You add another layer of parallelism on top of GPUs. Parallelism is a common strategy is distributed deep learning. There are two popular methods of parallelizing DL models: model parallelism and data parallelism.

Model parallelism
With model parallelism, a single model (Neural Network A) is split and distributed across different GPUs (GPU0 and GPU1). The same (full) training data will be processed by the different GPUs depending on which layer is active. Models with a very large number of parameters, that are too big to fit inside a single device’s memory, benefit from this type of strategy.

Neural networks have data dependency. The output of the previous layer is the input of the next layer. Asynchronous processing of data can be used to reduce training time, however, model parallelism is more about having the ability to run large models.

Maybe model sequentiality would be a better name for this mode as it primarily is using devices in sequential order. More than often a device is idling, waiting to receive the data from another device. Once the model part is trained on one device, it has to synchronize the outcome with the next layer possibly handled by another device.

This synchronization is interesting when designing your ML platform as specific data to help run the model has to traverse the interconnect either between devices within the ESXi system or between VMs (or containers) running on the platform. More about this in a later paragraph.

Data parallelism
Data parallelism is the most common strategy deployed. As covered in the previous article: “Machine Learning Workload and GPGPU NUMA node locality” it is common to split up the entire training dataset into batches (batch 0 and batch1). With data parallelism, these batches are sent to the multiple GPUs (GPU 0 and GPU1). Each GPU will load a full replica of the model (Neural Network A) and run their batch of training examples through the model.

The models running on the GPUs must communicate with each other to share the results. Communication timing and patterns between the GPUs depend on the DL model ( Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN)) and on the framework used (TensorFlow, Pytorch, MXNet).

Currently, there are a few projects active that are exploring the possibility of hybrid parallelization. This strategy uses both model and data parallelization strategies to minimize end-to-end training time.

Parallelism introduces communication between GPUs. Understanding the data-flow is essential to build a system that can provide consistent high-performance while ensuring the DL workloads are isolated enough and do not impact other workloads that are using the system. Various distributions of GPU resources are possible, such as a cluster of single GPU systems or multi-GPUs hosts. The next article focusses only on a single node with a multi-GPU configuration, to highlight the different in-system (on-node) interconnects

On-Node Interconnect

vSphere allows for different multi-GPU configurations. A VM can be equipped with multiple GPU configured as a passthrough device, or configured with vGPUs with the help of NVIDIA drivers, or by using a Bitfusion solution. Details about the different solutions will be covered in a future article. But regardless of the chosen configuration, the application will be able to use multiple GPUs in a single VM.

When deploying deep learning models across multiple GPUs in a single VM, the ESXi host PCIe bus becomes an inter-GPU network that is used for loading the data from system memory into the device memory. Once the model is active, the PCIe bus is used for GPU to GPU communication for synchronization between models or communication between layers.

If two PCIe devices communicate with each other, then the CPU is involved. Data coming from the source device is stored in system memory before transferring it to the destination device. The new Skylake architecture with it’s updated IIO structure, and additional mesh stops improved the CPU to PCIe communication over the previous ring-based architecture featured on the Xeon v1 through v4. (Each mesh stop has a dedicated cache and traffic controller).

CPU to GPU to CPU communication within a single NUMA node (Skylake Architecture)

For this purpose, NVIDIA introduced GPUDirect in CUDA 4.0, allowing direct memory access between two devices. However, this requires a full topology view of the system, and this is something currently vSphere is not exposing. As such, no direct PCI to PCI communication is available (yet).

Discovering this seems like this lack of topology view is an enormous bottleneck, but this doesn’t necessarily mean an application performance slowdown. Modern frameworks optimize their GPU code to minimize communication. As a result, communication between devices is just a portion of total time. Depending on the framework used and the parallelism strategy, the performance can still be close to the bare-metal performance.

NVIDIA NVLink

In 2016, NVIDIA introduced the NVLink interconnect, a high-speed mesh network that allows GPUs to communicate directly with each other. NVLink is designed to replace the inter GPU-GPU communication across the PCIe lanes, and as a result, NVLINK uses a separate interconnect. A new custom form factor SXM2 (supported by vSphere) allows the GPU to interface with the NVIDIA High-Speed Signalling interconnect (NVHS). The NVHS allows the GPU to communicate with the other GPUs as well as direct system memory access. Currently, NVLink 2.0 (available on NVIDIA Tesla v100 GPUs) provides an aggregate maximum theoretical bidirectional bandwidth of 300 GBps. (AMD does not have any equivalent to NVlink)

Design Decisions

Data movement within an ML system (VM) can be substantial. Fetching the data from storage, storing it into system memory before dispatching it to multiple vGPUs can produce a significant load on the platform. Depending on the neural network, framework, and parallelism strategy, communication between GPUs can add additional load to the system. It’s key to understand this behavior before considering retrofitting your current platform with GPU devices or while designing your new vSphere clusters.

Depending on the purpose of the platform it might be interesting to research the value of having a separate interconnect mesh for ML/DL workload. It allows for incredible isolation that will enable you to run other workloads on the ESXi host as well. Couple this the ability to share multiple GPUs with the Bitfusion solution, and you can create a platform that provides consistent high-performance for ML workload to numerous data scientists.

Machine Learning Workload and GPGPU NUMA Node Locality

January 30, 2020 by frankdenneman

In the previous article “PCIe Device NUMA Node Locality” I covered the physical connection between the processor and the PCIe device briefly touched upon machine learning workloads with regards to PCIe NUMA locality. This article zooms in on why it is important to consider PCIe NUMA locality.

General-Purpose Computing on Graphics Processing Units

New compute-intensive workloads take advantage of the new programming model called general-purpose computing on GPU (GPGPU). With GPGPU, the many cores integrated on modern GPUs are used to offload a vast number of (parallel) compute threads from the CPU. By adding another computational device with different characteristics, a heterogeneous compute architecture is born. GPUs are optimized for streaming sequential (or easily predictable) access patterns, while CPUs are designed for general access patterns and concurrency of threads. Combined, they form a GPGPU pipeline, that is exceptionally well-suited to analyze data. The vSphere platform is well-suited to create GPGPU pipelines and optimizations are provided to VMs, such as DirectPath I/O Access (also known as Passthrough). Passthrough allows the application to interface with the accelerator device directly; however, data must be transferred from disk/network through the system (RAM) to the GPU. And controlling the data transfer is of interest to the overall performance of the platform for both GPGPU workload and non-GPGPU workload.

A very popular GPGPU workload is Machine Learning (ML). Many ML workloads process gigabytes of data, sometimes even terabytes, this data flows from the storage device up to the PCIe device. Finetuning the configuration and placement of the virtual machine running the ML workload can benefit the data-scientist and other consumers of the platform. Not every ML workload is latency-sensitive, but most data scientists prefer to get the training done as quickly as possible. This allows them to perform more training iterations to fine-tune the model (also known as the neural network). Due to the movement of data through the system, a ML workload can quickly become the noisiest neighbor you ever saw in your system. But with the right guard-rails in place, data-scientists take advantage of running their workload on a consistent performing platform, while the rest of the organization can consume resources from this platform as well.

Machine Learning Concepts

Oversimplified ML is “using data to answer questions.” With traditional programming models, you create “rules” by using the programming language and apply these rules to the input to get output (results) (output). With ML training, you provide input and the output to train the program to create rules. This creates a predictive model that can be used to analyze previously unseen data to provide accurate answers. The key component of the entire ML process is data. This data is stored on a storage device and fetched to be used as input for the model to be trained on, or to use the trained model to provide results. Training a machine learning model is primarily done by a neural network of nodes that are executed by thousands of cores on GPUs. The nature of the cores (SIMT – Single Instruction, Multiple Data) allows for extremely fast parallel processing, ideal for this sort of workload, hence you want to use GPUs for this task and not the serial-workload optimized CPUs. The heavy lifting of the compute part is done by the GPU, but the challenge is getting the data to the costly GPU cores as fast and consistent as possible. If you do not keep the GPU cores fed with all the data it needs, then a large part of the GPU cores sit idle until new data shows up. And this is the challenge to overcome, handling large quantities of training data that flows from storage, through the host memory, into the VM memory before flowing into the memory of the GPU. High-speed storage systems with fast caching and fast paths between the storage, CPU, server memory and PCIe device are necessary.

Anatomy of an ML Training Workload

The collection of training examples is called a dataset, and the golden rule is, the more data you can use during the training, the better the predictive model becomes. That means that the data scientist will unleash copious amounts of data on the system, data so large that it cannot fit inside the memory device of the GPU. Perhaps not even the memory assigned to the virtual machine, as a result, the data is stored on disk and is retrieved in batches.

The data scientist typically finetunes the size of the batch set, finetuning a batch set size is considered an art form in the world of ML. You, the virtual admin, slowly graduating into an ML infrastructure engineer (managing and help to design the ML platform), can help inform the data scientist by sizing the virtual machine correctly. Look at CPU consumption and determine the correct number of vCPUs necessary to push the workload. Once the GPU receives a batch, the workload is contained within the GPU. Rightsizing the VM can help to improve performance further as the VM might fit a single NUMA node.

To understand the dataflow of an ML workload through the system, let’s get familiar with some neural network terminology. Most of the ML workload use the Compute Unified Device Architecture (CUDA) for GPU programming, and when using a batch of the training data, the CUDA program takes the following steps:

1: Allocate space on the GPU device memory

2: Copy (batch set) input data to the device (aka Host to Device (HtoD))

3: Run the algorithm on the GPU cores

4: Copy output (results) back to host memory (aka Device to Host (DtoH))

During training, the program processes all the training examples in the dataset. This cycle is called an epoch. As mentioned before, a data scientist can decide to split up the entire dataset into smaller batch sets. The number of training examples used is called a batch size. An iteration is the number of passes the program needs to use to go through the entire dataset to complete a single epoch. For example, a dataset contains 100.000 samples, and each batch size contains 1000 training examples, then it takes 100 iterations to complete a single epoch. Each iteration uses the previously described CUDA loop. To get a better result, multiple epochs are pushed to get a better convergence of the training model. Within each epoch, the neural network self-tweaks its own parameters (called weights and is done for each node) in the neural network, this finetuning provides a more accurate prediction result when it’s used during the inference operation. The interesting part is that the data scientist can also make some adjustments to the (hyper)parameters of the ML model. Simply put, a hyperparameter is a parameter whose value is set before the training process begins. Such as the number of weights or the batch size set. To verify if this tuning was helpful, a new sequence of epochs is kicked off. A great series of videos about neural networks can be found here.

Josh Simons and Justin Murray gave a 4-hour workshop on ML workload on vSphere at VMworld last year. In this workshop, they stated that the typical values they saw were gigabytes of data (D), 10 to 100s of epochs (E), and 10 or more tuning cycles (T), which can be substantially more (in the 1000s) when researching new models. You can imagine that such data volumes can become a challenge in a shared system such as the hypervisor. Let’s take a look at why isolation can benefit both ML workload and the other resident workload on the system.

CPU Scheduler and NUMA optimizations

When the data is fetched from the storage device it is stored in memory. The compute schedulers of the VMkernel are optimized to store the memory as close to the CPUs as possible. Let’s use the most popular server configuration in today’s data center, the dual-socket system. Each socket contains a processor and within the processor, memory controllers exist. Memory modules (DIMMs) attached to these memory controllers are considered local memory capacity. Both processors are connected to each other to allow each processor to access the memory connected to the other processor. Due to the difference in latency and bandwidth, this is considered to be non-uniform memory access (NUMA). For more information about NUMA, check out this series.

Let use the example of a 4 vCPU VM with 32 GB, running on a host with 512GB memory with 2 processors containing 10 cores each. The dataset used is 160GB of data and it cannot be stored in the VM memory and in the GPU device memory, thus the data scientist sets the batch size to 16GB. The program fetches 16GB of training data from the datastore and the NUMA scheduler ensures the data is stored within the local memory of the processor the four vCPU run on. In this case, the vCPUs of the VM are scheduled on the cores of CPU 1 (NUMA node 1) and thus the NUMA scheduler requests the VMkernel memory scheduler to store it in the memory pages belonging to the memory address space managed by the memory controllers of CPU 1.

The VM is configured with a passthrough GPU and the training data is pushed to the GPU. The problem is that the GPU is manually selected by the admin and no direct relation is visible in the UI or command-line, it just shows the type name and a PCI address. GPUs are PCIe devices and they are hardwired and controlled by a CPU.

The admin selected the first GPU in the list and now the dataset is pushed directly from the VM memory to the GPU Device memory to be used by the cores of the GPU. Data now flows through the interconnect to the PCIe controller of CPU 0 and to the GPU device. Each dataset that is retrieved from storage, is stored in NUMA node 1 and then moved through the interconnect to the device, this is done for each iteration, for each epoch and this can be done 1000’s of time.

The problem is that the interconnect is used by the entire system. When the CPU needs to rebalance, it can reschedule the vCPU on cores belonging to a different CPU if this improves the overall resource availability for the active virtual machines. Memory can be transferred over to the new NUMA home node of that recently migrated virtual machine, or memory is just accessed across the interconnect. Same with Wide-VMs, VMs that span multiple NUMA nodes, it can happen that these Wide-VMs access a lot of “remote” memory. Also do not forget the data being handled by other PCIe devices. All network traffic has to flow from the NIC to a particular VM, for optimized performance, the kernel prefers to store that data in memory that is local to the vCPUs of that VM. The same goes for data coming from external storage devices, if the HBA or NIC is “hanging” off the other CPU, data has to flow through the interconnect. The interconnect is a highway shared by a lot of components and workloads. These operations can impact the performance of the ML workload but the opposite is also true, pushing 1000 epochs of gigabytes of data to a GPU ensures other workloads will notice the presence of that workload, even if it has a small CPU and memory footprint. Remember, ML is “using data to answer questions.”

PTNumaTopology PowerCLI Module

To make sense of it all, I created a simple PowerCLI module with two functions that show the VMs that have a passthrough device configured. The output shows the VM name and the PCI address of the device so that you can relate that to what you see in the UI. The next column shows the NUMA node to which the PCIe device is connected. The next column indicates whether the advanced setting numa.affinity is set for that particular VM and its value. The last column shows the power state of the VM. To set the NUMA affinity, the VM has to be powered off.

To run the script, import the module (available at GitHub) and execute the Get-PTNumaTopology command. Specify the FQDN of the ESXi host. For example: Get-PTNumaTopology -esxhost sc2esx27.vslab.local. As the script needs to execute a command on the ESXi host locally an SSH session is initiated. This results in a prompt for a (root) username and password in a separate login screen. (The Github page has a thorough walk-through of all the steps involved and a list of requirements.)

NUMA Affinity Advanced Setting

In most situations, it is not recommended to set any affinity setting as it simply restricts the scheduler to generate an optimal balance between resource providers (CPUs) and consumers (vCPUs). At the host level and cluster level. However since the VM is configured with a passthrough (PT) GPU, it cannot move to another host and chances are a lot of data will flow to this device. Another assumption is that the host contains a small number of GPUs and thus a small number of VMs are active. If no other restrictions are configured, the CPU and NUMA scheduler can try to work “around” the affined VM and attempt to optimize the placement and resource consumptions of the other active VMs. Hopefully, the isolation of these particular passthrough-enabled VMs are reducing overall system load and thus evening out the possible enforced restrictions. Testing this first before using it on the production workload is always recommended! For more information about the NUMA affinity setting, please consult the VMware Docs for your specific vSphere version, linked is the VMware Docs page for vSphere 6.7.

Why set numa.affinity and not use CPU pinning? First of all, CPU pinning is something that should not be done ever. And even when you think you have a valid use case, chances are that CPU pinning will still reduce performance significantly. This topic is rearing its ugly head again and I will soon post another article why CPU pinning is just a bad idea. NUMA affinity creates a rule for the CPU scheduler to find a CPU core or HT within the boundaries of the CPU itself. In the example of the 4 vCPU running on the 10 core CPU. Let’s say hyperthreading is enabled, it allows the CPU scheduler to schedule one of these four CPUs on the 20 available logical processors. If the system is not over-utilized, it can use a complete core for a vCPU, it can find the optimal placement for that workload and for the others using the same CPU. With pinning you restrict the vCPUs to only run on that particular logical processor. If chosen incorrectly you might have just selected HTs only.

If you decide to set a NUMA affinity on a particular VM, the Get-PTNumaTopology function can help you to set it correctly. As a failsafe, the script proceeds to ask if you would like to set the NUMA node affinity of a powered off VM. Answer “N” to end the script and return to the command-line. If you answer “Y” for yes, it will then ask you to provide the name of the VM. Please note that this setting can only be applied on a powered-off VM. Setting an advanced setting means that the system is writing to this to the VMX file and the VMX file is in a locked state during the power-on state of a VM. The next step is to provide the NUMA Node you want the vCPUs to set the affinity for. Use the same number listed in the PCI NUMA Node column behind the attached passthrough device.

Once the advanced setting is configurated it shows the configured value. To verify whether the setting matches the NUMA node of the passthrough device, run the command Get-PTNumaTopology again. As it has closed the SSH connection after the last run, you are required to log in again with the root user account to retrieve the current settings.

Setting the NUMA node advanced option for a VM is something that should be done for specific reasons, do not use the script for all your virtual machines. The NUMA affinity setting applies to the placement of vCPU only. The NUMA scheduler provides recommendations to the memory scheduler, but it is up to the memory scheduler discretion to store the data. The kernel is optimized to keep the memory close to the vCPUs as possible, but sometimes it cannot fit that memory into that node. Either because the VM configuration exceeds the total capacity of that node, or that other active VMs are already using large amounts of memory of that node. Setting the affinity is not a 100% guarantee that all the resources are local, but in the majority of use-cases, it will. Isolating the workload within a specific NUMA node will help to provide you consistent performance and will reduce a lot of interconnect bandwidth consumption. Enjoy using the script!

Font used in PowerShell environment: JetBrains Mono – available at – https://www.jetbrains.com/lp/mono/#intro

PCIe Device NUMA Node Locality

January 10, 2020 by frankdenneman

During this Christmas break, I wanted to learn PowerCLI properly. As I’m researching the use-cases of new hardware types and workloads in the data center, I managed to produce a script to identify the PCIe Device to NUMA Node Locality within a VMware ESXi Host. The script set contains a script for the most popular PCIe Device types for data centers that can be assigned as a passthrough device. The current script set is available on Github and contains scripts for GPUs, NICs and (Intel) FPGAs.

PCIe Devices Becoming the Primary Units of Data Processing

Due to the character of new workloads, the PCIe device is quickly moving up from “just” being a peripheral device to become the primary unit for data processing. Two great examples of this development are the rise of General Purpose GPU (GPGPU), often referred to as GPU Compute, and the virtualization of the telecommunication space.

The concept of GPU computing implies using GPUs and CPUs together. In many new workloads, the processes of an application are executed on a few CPU cores, while the GPU, with its many cores, handles the computational intensive data-processing part. Another workload, or better said, a whole industry that leans heavily on the performance of PCIe devices, is the telecommunication industry. Virtual Network Functions (VNF) require platforms using SR-IOV capable NICs or SmartNICs to provide ultra-fast packet processing performance.

In both scenarios having insight into PCIe Device to processor locality is a must to provide the best performance to the application or avoid introducing menacing noisy neighbors that can influence the performance of other workloads active in the system.

PCIe Device NUMA Node Locality

The majority of servers used in VMware virtualized environments are two CPU socket systems. Each CPU socket accommodates a processor containing several CPU cores. A processor contains multiple memory controllers offering a connection to directly connected memory. An interconnect (Intel: QuickPath Interconnect (QPI) & UltraPath Interconnect (UPI), AMD: Infinity Fabric (IF)) connects the two processors and allows the cores within each processor to access the memory connected to the other processor. When accessing memory connected directly to the processor, it is called local memory access. When accessing memory connected to the other processor, it is called remote memory access. This architecture provides Non-Uniform Memory Access (NUMA) as access latency, and bandwidth differs between local memory access or remote memory access. Henceforth these systems are referred to as NUMA systems.

It was big news when the AMD Opteron and Intel Nehalem Processor integrated the memory controller within the processor. But what about PCIe devices in such a system? Since the Sandy Bridge Architecture (2009), Intel reorganized the functions critical to the core and grouped them in the Uncore, which is a “construct” that is integrated into the processor as well. And it is this Uncore that handles the PCIe bus functions. It provides access to NVMe devices, GPUs, and NICs. Below is a schematic overview of a 28 core Intel Sky lake processor showing the PCIe ports and their own PCIe root stack.

In essence, a PCIe device is hardwired to a particular port on a processor. And that means that we can introduce another concept to NUMA locality, which is PCIe locality. Considering PCIe locality when scheduling low-latency or GPU compute workload can be beneficial not only to the performance of the application itself but also to the other workloads active on the system.

For example, Machine Learning involves processing a lot of data, and this data flows within the system from the CPU and memory subsystem to the GPU to be processed. Properly written Machine Learning application routines minimize communication between the GPU and CPU once the dataset is loaded on the GPU, but getting the data onto the GPU typically turns the application into a noisy neighbor to the rest of the system. Imagine if the GPU card is connected to NUMA node 0, and the application is running on cores located in NUMA node 1. All that data has to go through the interconnect to the GPU card.

The interconnect provides more theoretical bandwidth than a single PCIe 3.0 device can operate at, ~40 GB/s vs. 15 GB/s. But we have to understand that interconnect is used for all PCIe connectivity and memory transfers by the CPU scheduler. If you want to explore this topic more, I recommend reviewing Amdahl’s Law – Validity of the single processor approach to achieving large scale computing capabilities – published in 1967. (Still very relevant) And the strongly related Little’s Law. Keeping the application processes and data-processing software components on the same NUMA node keeps the workloads from flooding the QPI/UPI/ AMD IF interconnect.

For VNF workloads, it is essential to avoid any latency introduced by the system. Concepts like VT-d (Virtualization Technology for Directed I/O) reduces the time spent in a system for IOs and isolate the path so that no other workload can affect its operation. Ensuring the vCPU operates within the same NUMA domain ensures that no additional penalties are introduced by traffic on the interconnect and ensures the shortest path is provided from the CPU to the PCIe device.

Constraining CPU Placement

The PCIe Device NUMA Node Locality script assists in obtaining the best possible performance by identifying the PCIe locality of GPU, NIC of FPGA PCIe devices within VMware ESXi hosts. Typically VMs running NFV or GPGPU workloads are configured with a PCI passthrough enabled device. As a result, these VMware PowerCLI scripts inform the user which VMs are attached directly to the particular PCIe devices.

Currently, the VMkernel schedulers do no provide any automatic placement based on PCIe locality. CPU placement can be controlled by associating the listed virtual machines with a specific NUMA node using an advanced setting.

Please note that applying this setting can interfere with the ability of the ESXi NUMA scheduler to rebalance virtual machines across NUMA nodes for fairness. Specify NUMA node affinity only after you consider the rebalancing issues.

The Script Set

The purpose of these scripts is to identify the PCIe Device to NUMA Node locality within a VMware ESXi Host. The script set contains a script for the most popular PCIe Device types for Datacenters that can be assigned as a passthrough device. The current script set contains scripts for GPUs, NICs, and (Intel) FPGAs.

Please note that these scripts only collect information and do not alter any configuration in any way possible.

Requirements

VMware PowerCLI
Connection to VMware vCenter
Unrestricted Script Execution Policy
Posh-SSH
Root Access to ESXi hosts

Please note that Posh-SSH only works on Windows version of PowerShell.

The VMware PowerCLI script primarily interfaces with the virtual infrastructure via a connection to the VMware vCenter Server. A connection (Connect-VIServer) with the proper level of certificates must be in place before executing these scripts. The script does not initiate any connect session itself. It assumes this is already in-place.

As the script extracts information from the VMkernel Sys Info Shell (VSI Shell) the script uses Posh-SSH to log into ESXi host of choice and extracts the data from the VSI Shell for further processing. The Posh-SSH module needs to be installed before running the PCIe-NUMA-Locality scripts, the script does not install Posh-SSH itself. This module can be installed by running the following command Install-Module -Name Posh-SSH (Admin rights required). More information can be found at https://github.com/darkoperator/Posh-SSH

Root access is required to execute a vanish command via the SSH session. It might be possible to use SUDO, but this has functionality has not been included in the script (yet). The script uses Posh-SSH keyboard-interactive authentication method and presents a screen that allows you to enter your root credentials securely.

Script Content

Each script consists of three stages, Host selection & logon, data collection, and data modeling. The script uses the module Posh-SSH to create an SSH connection and runs a vsish command directly on the node itself. Due to this behavior, the script creates an output per server and cannot invoke at the cluster level.

Host Selection & Logon

The script requires you to enter the FQDN of the ESXi Host, and since you are already providing input via the keyboard, the script initiates the SSH session to the host, requiring you to login with the root user account of the host. When using the GPU script, the input of the GPU vendor name is requested. The input can be, for example, NVIDIA, AMD, Intel, or any other vendor providing supported GPU devices. This input is not case-sensitive.

Data Collection

The script initiates an esxcli command that collects the PCIe address of the chosen PCIe device type. It stores the PCIe addresses in a simple array.

Data Modeling

The NUMA node information of the PCIe device is available in the VSI Shell. However, it is listed under the decimal value of the Bus ID of the PCIe address of the device. The part that follows is a collection of instructions converting the full address space into a double-digit decimal value. Once this address is available, it’s inserted in a VSISH command and execute on the ESXi host via the already opened SSH connection. The NUMA node, plus some other information, is returned by the host, and this data is trimmed to get the core value and store it in a PSobject. Throughout all the steps of the data modeling phase, each output of the used filter functions is stored in a PSObject. This object can be retrieved to verify if the translation process was executed correctly. Call $bdfOutput to retrieve the most recent conversion. (as the data of each GPU flows serially through the function pipeline, only the last device conversion can be retrieved by calling $bdfOutput.

The next step is to identify if any virtual machines registered on the selected host are configured with PCIe passthrough devices corresponding with the discovered PCIe addresses.

Output

A selection of data points is generated as output by the script:

PCIe Device	Output Values
GPU	PCI ID, NUMA Node, Passthrough Attached VMs
NIC	VMNIC name, PCI ID, NUMA Node, Passthrough Attached VMs
FPGA	PCI ID, NUMA Node, Passthrough Attached VMs

The reason why the PCI ID address is displayed is that when you create a VM, the vCenter UI displays the (unique) PCI-ID first to identify the correct card. An FPGA and GPU do not have a VMkernel label, such as the VMNIC label of a network card. No additional information about the VMs is provided, such as CPU scheduling locations or vNUMA topology, as these are expensive calls to make and can change every CPU Quorum (50 ms).

It’s recommended to review the CPU topology of the virtual machine and if possible to set the NUMA Node affinity following the instructions listed in VMware Resource Management Guide. Please note that using this advanced setting can impact the ability of the CPU and NUMA schedulers to achieve an optimal balance.

Using the Script Set

Step 1. Download the script by clicking the “Download” button on the Github repository
Step 2. Unlock scripts (Properties .ps1 file, General tab, select Unlock.)

Step 3. Open PowerCLI session.
Step 4. Connect to VIServer
Step 5. Execute script for example, the GPU script: .\PCIE-NUMA-Locality-GPU.ps1
Step 6. Enter ESXi Host Name
Step 7. Enter GPU Vendor Name

Step 8. Enter Root credentials to establish SSH session

Step 8. Consume output and possibly set NUMA Node affinity for VMs

Acknowledgments

This script set would not have been created without the guidance of @kmruddy and @lucdekens. Thanks, Valentin Bondzio, for verification of NUMA details and Niels Hagoort and the vSphere TM team for making their lab available to me.

vSphere 6.5+ DRS Pairwise Balancing

October 30, 2019 by frankdenneman

Or maybe I should have called this blog post, “I’m seeing an excessive number of DRS initiated vMotions on my newly upgraded 6.5 environment”. Recently I was part of a few conversations about the nature of DRS load balancing in systems running vSphere 6.5 and newer. It was noticed that more vMotion operations where occurring since running 6.5 and it’s highly likely that these operations occur due to the new DRS pairwise balancing functionality. Pairwise balancing was introduced by vSphere 6.5 and is focused on keeping the host resource utilization disparity within a certain threshold. As a result, DRS performs load-balancing operations if the difference between the lowest-utilized host and the highest-utilized host is a certain percentage. That percentage depends on your migration threshold. The default migration threshold uses a 20% tolerable difference in utilization.

Migration Threshold Level	Tolerable CPU/Memory usage difference between any two hosts in the cluster
1	Not Available (only Affinity violations and MM migrations allowed)
2	30%
3 (Default migration threshold)	20%
4	10%
5	5%

This new feature is needed as clusters keep on growing larger and larger. To determine if load-balancing operations are necessary, DRS calculates two metrics, the current host load standard deviation (CHLSTD) and the target host load standard deviation (THLSTD). Each host reports its load and DRS calculates the standard deviation of the host load metric across all the hosts in the cluster. DRS calculates a target host load balance for the cluster and as long as the current host load standard deviation is less than or equal to the target host load value, DRS will consider the cluster balanced. The migration threshold allows how far apart the CHLSTD and THLSTD before it triggers load balancing operations. The higher the aggressiveness of the migration threshold, the lower the difference between the CHLSTD and THLSTD is tolerated.

A situation can occur that a few hosts in a large cluster can experience a high resource utilization, while the majority of hosts are not. Due to the size of the cluster, the few high host load become just some statistical outliers than simply disappear as noise due to the vast number of hosts that experience (far) lower utilization. As a result, these outliers are missed as the calculate CHLSTD is below the threshold required to trigger load balancing.

By adding the functionality of pairwise balancing, and “simply” comparing the highest reported utilization with the lowest utilization, these outliers might be a thing of the past. That means that in certain cases, the DRS UI might report that the cluster is in a balanced state, yet load-balance operations still occur. This behavior can be attributed to pairwise balancing.

Please keep in mind that if you are using a migration threshold that is more aggressive than the default setting, the tolerable difference between hosts is reduced, more migrations are likely to occur.

So what happens when the tolerable difference is detected in the cluster? Does this mean that VMs are migrated from the highest utilized host to the lowest utilized host? Not necessarily. VMs can be migrated to any other host in the cluster. DRS still takes many different requirements into account when selecting a virtual machine migration for load-balancing purposes. Anti-affinity and affinity rules cannot be violated to obtain a better cluster load-balance, so these moves are not considered. Compatibility of hosts and VM configuration also impact migration options (typically a missing datastore or network portgroup are common reasons why particular hosts are overloaded and why other hosts are lower utilized), but also the “cost-benefit” of a VM migration is still taken into account. It still needs to make sense for the cluster balance to incur infrastructure costs and risk to move a particular VM.

If you recently updated your vCenter to 6.5/6.7 and are curious to see whether the vMotions are triggered due to Pairwise imbalance operations, you can use the online version of the DRS Dump Insight tool available at https://www.drsdumpinsight.vmware.com/. You can also run the DRS dump insight tool on-prem by installing one of the flings available here: https://flings.vmware.com/?utf8=%E2%9C%93&q=DRS+Dump+Insight&button=. Grep for “Pairwise Imbalance”.

If this behavior is not appreciated, and you do not want to alter the migration threshold, you can switch back to the old behavior by turning off pair-wise balancing by setting the cluster advanced option “CheckPairWiseImbalance to 0. (case-sensitive). Although this functionality was introduced by vSphere 6.5 and is active by default in all newer releases, we have backported this functionality to vSphere 6.0 u3.

One thing I would like to ask if you want to disable it, what are the reasons? I expect “too much vMotions”, but I would like to understand why a vMotion or a collection of vMotions is considered not desirable? The main goal is to get the VMs to a place where they have access to enough resources, why is that still a bad thing?