SCHEDULING VSPHERE PODS
The previous article “Initial Placement of a vSphere Pod,” covered the internal construction of a vSphere Pod to show that it is a combination of a tailor-made VM construct and a group of containers. Both the Kubernetes and vSphere platforms contain a rich set of resource management controls, policies, and features that guide, control, or restrict scheduling of workloads. Both control planes use similar types of expressions, creating a (false) sense of unification in the context of managing a vSphere pod. This series of articles examines the overlap of the different control plane dialects, and how the new vSphere 7 platform allows the developer to use Kubernetes native expressions, while the vSphere Admin continues to use the familiar vSphere native resource management functionalities.
DEEP LEARNING TECHNOLOGY STACK OVERVIEW FOR THE VADMIN - PART 1
Introduction We are amid the AI “gold rush.” More organizations are looking to incorporate any form of machine learning (ML) or deep learning in their services to enhance customer experience, drive efficiencies in their processes or improve quality of life (healthcare, transportation, smart cities). Train Where Data is Generated One of the key elements that drive on-premise ML focused infrastructure growth is the reality of data gravity. Mentioned in the article “Multi-GPU and Distributed Deep Learning,” deep learning (DL) gets better with data. Consequently, data sets used for ML and DL training purposes are growing at a tremendous rate. These vast data sets need to be processed. Data transit, hosting, and the necessary compute cycles impact the overall OPEX budget. Additionally, data protection regulations such as data residency, data sovereignty, and data locality impact where data can be stored outside the place it is created. As a result, a lot of forward-leaning organizations are repatriating their AI platforms to run ML and DL workloads close to the systems that generate the data.
INITIAL PLACEMENT OF A VSPHERE POD
Project Pacific transforms vSphere into a unified application platform. This new platform runs both virtual machine and Linux containers as native workload constructs. Just introducing Linux containers as a new workload object is not enough. To manage containers properly, you need a legitimate orchestrator. And on top of that, you need to make sure that existing services, such as DRS, can handle the different lifecycles of these different objects. Containers typically have a shorter lifecycle than virtual machines, where VMs “live” for years, containers have a shorter life expectancy. And this massively different churn impacts initial placement and load-balancing operations of resource management services.
MULTI-GPU AND DISTRIBUTED DEEP LEARNING
More enterprises are incorporating machine learning (ML) into their operations, products, and services. Similar to other workloads, a hybrid-cloud model strategy is used for ML development and deployment. A common strategy is using the excellent toolset and training data offered by public cloud ML services for generic ML capabilities. These ML activities typically improve an organization’s quality of service and increase in productivity. But the real differentiation lies within using the organization’s unique data and know-how to create what’s called differentiated machine learning. The data used is primarily generated by own processes or through interaction with its customers. As a result, specific rules and regulations come into play when handling and storing that data. Another strong aspect of determining where to deploy ML activities is data gravity. Placing compute close to where the data is generated provides a consistent (often high-performing) service. As a result, many organizations invest in the infrastructure needed to deploy ML and deep learning (DL) solutions.
MACHINE LEARNING WORKLOAD AND GPGPU NUMA NODE LOCALITY
In the previous article “PCIe Device NUMA Node Locality” I covered the physical connection between the processor and the PCIe device briefly touched upon machine learning workloads with regards to PCIe NUMA locality. This article zooms in on why it is important to consider PCIe NUMA locality. General-Purpose Computing on Graphics Processing Units New compute-intensive workloads take advantage of the new programming model called general-purpose computing on GPU (GPGPU). With GPGPU, the many cores integrated on modern GPUs are used to offload a vast number of (parallel) compute threads from the CPU. By adding another computational device with different characteristics, a heterogeneous compute architecture is born. GPUs are optimized for streaming sequential (or easily predictable) access patterns, while CPUs are designed for general access patterns and concurrency of threads. Combined, they form a GPGPU pipeline, that is exceptionally well-suited to analyze data. The vSphere platform is well-suited to create GPGPU pipelines and optimizations are provided to VMs, such as DirectPath I/O Access (also known as Passthrough). Passthrough allows the application to interface with the accelerator device directly; however, data must be transferred from disk/network through the system (RAM) to the GPU. And controlling the data transfer is of interest to the overall performance of the platform for both GPGPU workload and non-GPGPU workload.
PCIE DEVICE NUMA NODE LOCALITY
During this Christmas break, I wanted to learn PowerCLI properly. As I’m researching the use-cases of new hardware types and workloads in the data center, I managed to produce a script to identify the PCIe Device to NUMA Node Locality within a VMware ESXi Host. The script set contains a script for the most popular PCIe Device types for data centers that can be assigned as a passthrough device. The current script set is available on Github and contains scripts for GPUs, NICs and (Intel) FPGAs.
VSPHERE 6.5+ DRS PAIRWISE BALANCING
Or maybe I should have called this blog post, “I’m seeing an excessive number of DRS initiated vMotions on my newly upgraded 6.5 environment”. Recently I was part of a few conversations about the nature of DRS load balancing in systems running vSphere 6.5 and newer. It was noticed that more vMotion operations where occurring since running 6.5 and it’s highly likely that these operations occur due to the new DRS pairwise balancing functionality. Pairwise balancing was introduced by vSphere 6.5 and is focused on keeping the host resource utilization disparity within a certain threshold. As a result, DRS performs load-balancing operations if the difference between the lowest-utilized host and the highest-utilized host is a certain percentage. That percentage depends on your migration threshold. The default migration threshold uses a 20% tolerable difference in utilization.
AMD EPYC NAPLES VS ROME AND VSPHERE CPU SCHEDULER UPDATES
Recently AMD announced the 2nd generation of the AMD EPYC CPU architecture, the EPYC 7002 series. Most refer to the new CPU architecture using its internal codename Rome. When AMD introduced the 1st generation EPYC (Naples), they succeeded in setting a new record of core count and memory capacity per socket. However, due to the CPU multi-chip-module (MCM) architecture, it is not an apples-to-apples comparison when compared to an Intel Xeon architecture. As each chip module contains a memory controller, each module presents a standalone NUMA domain. This impacts OS scheduling decisions and, thus, virtual machine sizing. A detailed look can be found here in English or here translated by Grigory Pryalukhin in Russian. Rome is different, the new CPU architecture is more aligned with the single NUMA per Socket paradigm, and this helps with obtaining workload performance consistency. There are some differences between Xeons and Rome. In addition, we made some adjustments to the CPU scheduler to deal with this new architecture. Let’s take a closer look at the difference between Naples and Rome.
60 MINUTES OF NUMA VMWORLD SESSION COMMANDS
Verify Distribution of Memory Modules with PowerCLI Get-CimInstance -CimSession $Session CIM_PhysicalMemory | select BankLabel, Description, @{n=‘Capacity in GB';e={$_.Capacity/1GB}} PowerCLI Script to Detect Node Interleaving Get-VMhost | select @{Name="Host Name";Expression={$_.Name}}, @{Name="CPU Sockets";Expression={$_.ExtensionData.Hardware.CpuInfo.NumCpuPackages}}, @{Name="NUMA Nodes";Expression={$_.ExtensionData.Hardware.NumaInfo.NumNodes}} Action-Affinity Monitoring Sched-Stats -t numa-migration Disable Action Affinity numa.LocalityWeightActionAffinity = 0 numa.PreferHT For more information on how to enable PreferHT: KB article 2003582 Host Setting: numa.PreferHT=1 VM Setting: numa.vcpu.PreferHT = TRUE
5 THINGS TO KNOW ABOUT PROJECT PACIFIC
During the keynote of the first day of VMworld 2019, Pat unveiled Project Pacific. In short, project Pacific transforms vSphere into a unified application platform. By deeply integrating Kubernetes into the vSphere platform, developers can deploy and operate their applications through a well-known control plane. Additionally, containers are now first-class citizens enjoying all the operations generally available to virtual machines. Although it might seem that the acquisition of Heptio and Pivotal kickstarted project Pacific, VMware has been working on project Pacific for nearly three years! Jared Rosoff, the initiator or the project and overall product manager, told me that over 200 engineers are involved as it affects almost every component of the vSphere platform.