frankdenneman.nl - Page 11 of 89 -

vSphere 7 Cores per Socket and Virtual NUMA

December 2, 2021 by frankdenneman

Regularly I meet with customers to discuss NUMA technology, and one of the topics that are always on the list is the Cores per Socket setting and its potential impact.

In vSphere 6.5, we made some significant adjustments to the scheduler that allowed us to decouple the NUMA client creation from the Cores per Socket setting. Before 6.5, if the vAdmin configured the VM with a non-default Cores per Socket setting, the NUMA scheduler automatically aligned the NUMA client configured to that Cores per Socket settings. Regardless of whether this configuration was optimal for performing in its physical surroundings.

To help understand the impact of this behavior, we need to look at the components of the NUMA client and how this impacts the NUMA scheduler options for initial placement and load-balancing options.

A NUMA client consists of two elements, a virtual component and a physical component. The virtual component is called the virtual proximity domain (VPD) and is used to expose the virtual NUMA topology to the guest OS in the virtual machine, so the Operating System and its applications can apply optimizations for the NUMA topology they detect. The physical component is called the physical proximity domain (PPD). It is used by the NUMA scheduler in the VMkernel as a grouping construct for vCPUs to place a group of vCPUs on a NUMA node (physical CPU+memory) and to move that group of vCPUs between NUMA nodes for load-balancing purposes. You can see this PPD as some form of an affinity group. These vCPUs always stick together.

Please note that the CPU scheduler determines which vCPU will be scheduled on a CPU core. The NUMA scheduler only selects the NUMA node.

As depicted in the diagram, this VM runs on a dual-socket ESXi host. Each CPU socket contains a CPU package with 10 CPU cores. If this VM gets configured with a vCPU range between 11 and 20 vCPUs, the NUMA scheduler creates two NUMA clients and distributes these vCPUs evenly across the two NUMA nodes. The guest OS is presented with a virtual NUMA topology by the VPDs that aligns with the physical layout. In other words, there will be two NUMA nodes, with an x number of CPUs inside that NUMA node.

Previous to ESXi 6.5, if a virtual machine is created with 16 CPUs and 2 Cores per Socket, the NUMA scheduler will create 8 NUMA Clients, and thus the VM exposes eight NUMA nodes to the guest OS.

As a result, the guest OS and application could make the wrong process placement decisions or miss out on ideal optimizations as you are reducing resource domains as memory ranges and cache domains. Luckily we solved this behavior, and now the Cores Per Socket setting does not influence NUMA client configuration anymore. But!

But, a Cores per Socket setting other than Cores per Socket =1 will impact the distribution of vCPUs of that virtual machine across the physical NUMA nodes. As the Cores per Socket act as a mini affinity rule-set. That means that the NUMA schedule has to distribute the vCPUs of the virtual machine across the NUMA nodes with Cores per Settings still in mind, not as drastically as before, where it also exposed it to the guest OS. Still, it will impact overall balance if you start to do weird things. Let me show you some examples. Let’s start by using a typical example of Windows 2019. In my lab, I have a dual-socket Xeon system with each 20 cores, so I’m forced to equip the example VMs with more than 20 vCPUs to show you the effect of Cores per Socket on vNUMA.

In the first example, I deploy a Windows 2019 VM with the default settings and 30 vCPUs. That means that it is configured with 2 Cores per Socket. Resulting in a virtual machine configuration with 15 sockets.

The effect is seen by using the command “sched-stats -t numa-clients” if I log into the ESXi host via an SSH session. The groupID of the VM is 1405619, which I got from ESXTOP, and this command shows that 16 vCPUs are running on homeNode 0 and 14 vCPUs on homeNode 1. Maybe, this should not be a problem, but I’ve seen things! Horrible things!

This time I’ve selected 6 Cores per socket. And now the NUMA scheduler distributes the vCPUs as follows:

At this point, the NUMA nodes are imbalanced more severely. This will impact the guest OS and the application. And If you do this at scale, for every VM in your DC, you will create a scenario where you impact the guest OS and the underlying fabric of connected devices. More NUMA rebalancing is expected, which occurs across the processor interconnects (QPI-Infinity Fabric). More remote memory is fetched, all the operations impact PCIe traffic, networking traffic, HCI storage traffic, GPU VDI or Machine Learning application performance.

The Cores per Socket setting is invented to solve a licensing problem. There are small performance gains to be made if the application is sensitive to cache performance and can keep the data in the physical cache. If you have that application, please align your Cores per Socket setting to the actual physical layout. Using the last example, for the 30 vCPU VM on a dual-socket 40 cores host, the Cores per Socket should be 15, resulting in 2 virtual sockets.

But my recommendation is to avoid the management overhead for 99% of all your workload and keep the default Cores per Socket Settings. Overall you have a higher chance not to impact any underlying fabric or influence scheduler decisions.

Update

Looking at the comments, I notice that there is some confusion. Keeping the default means that you will use 2 Cores per Socket for the new Windows systems. The minor imbalance of vCPU distribution within the VM should not be noticeable. The Windows OS is smart enough to distribute processes efficiently between these NUMA nodes setup like this. The compounded effect of many VMs within the hypervisor and the load-balancing mechanism of the NUMA scheduler and the overall load frequencies (not every vCPU is active all of the time) will not make this a problem.

For AMD EPYC users, please keep in mind, that you are not working with a single chip surface anymore. EPYC is a multi-chip module architecture and hence, you should take notice of the cache boundaries of the CCX’s within the EPYCs. The new Milan (v3) has 8 cores per CCX, the Rome (v2) and Naples (v1) have 4 cores. Please test the cache boundary sensitivity of your workload if you are running your workload on AMD EPYC systems. Please note that not every workload is cache sensitive and not every combination of workloads responds the same way. This is due to the effect that load correlation and load synchronicity patterns can have on scheduling behavior and cache evictions patterns.

DRS threshold 1 does not initiate Load balancing vMotions

October 22, 2021 by frankdenneman

vSphere 7.0 introduces DRS 2.0 and its new load balancing algorithm. In essence, the new DRS is completely focused on taking care of the needs of the VMs and does this at a more aggressive pace than the old DRS. As a result, DRS will resort to vMotioning a virtual machine faster than the previous DRS. And this is something that a lot of customers are noticing. In highly consolidated clusters, you might see a lot of vMotions occur. I perceive this as an infrastructural service. However, some customers might see this as a turbulent or nervous environment and rather see fewer vMotions. As a result, these customers like to dial down the DRS threshold, which is the right thing to do. But please be aware, if you still want DRS to have load-balancing functionality, do not slide the threshold all the way to the left.

Using the setting “Conservative (1)”, DRS only triggers migrations for solving cluster constraints and violations. Meaning that if you put an ESXi host into maintenance mode, DRS moves the VMs out of that host. Or if due to a maintenance mode migration, or HA event, an anti-affinity rule is violated, DRS moves that particular VM to solve that problem. But that’s it. No moves to solve any VM happiness or host imbalance. If you want to reduce the number of vMotions and still like DRS to get the best resources for the virtual machines, do not set DRS to the utmost left setting but set it to setting 2.

Project Monterey and the need for Network Cycles Offload for ML Workloads.

October 6, 2021 by frankdenneman

VMworld has started, and that means a lot of new announcements. One of the most significant projects VMware is working on is project Monterey. Project Monterey allows the use of SmartNICS, also known as Data Processing Units, of various VMware partners within the vSphere platform. Today we use the CPU inside the ESXi host to run workloads and to process network operations. With the shift towards distributed applications, the CPUs inside the ESXi hosts need to spend more time processing network IO instead of application operations. This extra utilization impacts data center economics like consolidation ratios and availability calculations. On top of this shift from monolith application to distributed application is the advent of machine learning supported services in the enterprise data center.

As we all know, the enterprise data center is the goldmine of data. Business units within organizations are looking at the data. If combined with machine learning, it can solve their business challenges. And so, they use the data to train machine learning models. However, data stored in databases or coming from modern systems such as sensors or video systems cannot be directly fed into the vertical application stack used for machine learning model training. This data needs to be “wrangled” into shape. The higher quality of the data, the better the model can generate a prediction or recommendation. As a result, the data flows from its source through multiple systems. Now you might say machine learning datasets are only a few 100 gigabytes in size. I’ve got databases that are a few terabytes. The problem is that a database sits nice and quietly on a datastore on an array somewhere and doesn’t go anywhere. This dataset moves from one system to another in an ML infrastructure depicted below and gets transformed, copied, and versioned many times over. You need a lot of CPU horsepower to transform the data and continuously move the data around!

One of the most frequent questions I get is why vSphere is such an excellent platform for machine learning, simply data adjacency. We run an incredible amount of database systems in the world. That holds the data of the organizations, which in turn holds the key to solve those challenges. Our platform provides the tools and capabilities to use modern technologies such as GPU accelerators and data processing units to help process the data. And data is the fuel of machine learning.

The problem is that we (the industry) haven’t gotten around to make a Tesla version of machine learning model training yet. We are in the age of gas-guzzling giants. People in the machine learning space are looking into improving techniques for datasets for model training. Instead of using copious amounts of data, use more focused data points. But that’s work in progress. In the meantime, we need to deal with this massive data stream flowing through many different systems and platforms that typically run within virtual machines and containers on top of the core platform vSphere. Instead of overwhelming the X86 CPUs in the ESXi host to deal with all the network traffic generated by sending those datasets between the individual components of the ML infrastructure, we need to offload it to another device in an intelligent way. And that’s where project Monterey can come into play.

There are many presentations about project Monterey and all it’s capabilities. I would suggest you start with “10 Things You Need to Know About Project Monterey” by Niels Hagoort #MCL1833

Machine Learning Infrastructure from a vSphere Infrastructure Perspective

August 12, 2021 by frankdenneman

For the last 18 months, I’ve been focusing on machine learning, especially how customers can successfully deploy machine learning infrastructure on a vSphere infrastructure.

This space is exciting as it has so many great angles to explore. Besides the model training, a lot of stuff happens with the data. Data is transformed, data is moved. Data sets are often hundreds of gigabytes in size. Although that doesn’t sound that much compared to modern databases, these data sets are transformed and versioned. Where massive databases nest on an array, these data sets travel through pipelines that connect multiple systems with different architectures, from data lakes to in-memory key-value stores. As a data center architect, you need to think about the various components involved, where the compute horsepower is needed? How do you deal with an explosion of data? Where do you place particular storage platforms, what kind of bandwidth is needed, and do you always need the extreme low-latency systems in your ML infrastructure landscape?

The different ML model engineering life cycle phases generate different functional and technical requirements, and the persona involved is not data center architectural-minded. Sure they talk about ML infrastructure, but their concept of infrastructure is different from “our” concept of infrastructure. Typically, the lowest level of abstraction a data science team deals with is a container or a VM. Concepts of availability zones, hypervisors, or storage areas are foreign to them. When investigating ML pipelines and other toolsets, technical requirements are usually omitted. This isn’t weird, as containers are more or less system-like processes, and you typically do not specify system resource requirements for system processes. But for an architect or a VI team that wants to shape a platform capable of dealing with ML workload, you need to get a sense of what’s required.

I intend to publish a set of articles that helps to describe where the two worlds of data center infrastructure and ML infrastructure interact, where the rubber meets the road. The series covers the different phases of the ML model engineering lifecycle and what kind of requirements they produce. What is MLOps, and how does this differ from DevOps? Why is data lineage so crucial in today’s world, and how does this impact your infrastructure services? What type of persona is involved with machine learning, their tasks and role in the process, and what type of infrastructure service can benefit them. And how we can map actual data and ML pipeline components to an abstract process diagram full of beautiful terms such as data processing, feature engineering, model validation, and model serving.

I’m sure that I will introduce more topics along the way, but if you have any topic in mind that you want to see covered, please leave a comment!

CPU pinning is not an exclusive right to a CPU core!

June 11, 2021 by frankdenneman

People who configure CPU Affinity deserve a special place in hell.. pic.twitter.com/wIIZ0dHDRw
— Katarina Brookfield (@MrsBrookfield) June 10, 2021

Katarina tweeted a very expressive tweet about her love/hate (mostly hate) relation with CPU pinning, and lately I have been in conversations with customers contemplating whether they should use CPU pinning.

The analogy that I typically use to describe CPU pinning is the story of the favorite parking space at your office parking lot. CPU pinning limits the compliant CPU “slots” for that vCPU to be scheduled on. So think about that CPU slot as the parking spot closest to the entrance of your office. You have decided that you only want to park in that spot. Every day of the year, that’s your spot and no other place else. The problem is, this is not a company-wide directive. Anyone can park in that spot, but you just limited yourself to that spot only. So it can happen that Bob arrives at the office first and lazy as he is, he will park to the office entrance as close as he can. Right in your spot. Now the problem with your self-imposed rule is that you cannot and will not park anywhere else. So when you show up (late to the party), you notice that Bob’s car is in YOUR parking spot, and the only thing you can do is to drive circles in some holding pattern until Bob leaves the office again. The stupidest thing. It’s Sunday, and you and Bob are the only ones doing some work. You’re out there on the parking lot, driving circles waiting until Bob leaves again, while Bob is inside in the empty building waiting on you to get started.

CPU pinning is not an exclusive right for that vCPU to use that particular CPU slot (Core or HT). It’s just a self-imposed limitation. If you want exclusive rights to a full core, check out the setting Latency Sensitivity