frankdenneman Frank Denneman is the Machine Learning Chief Technologist at VMware. He is an author of the vSphere host and clustering deep dive series, as well as podcast host for the Unexplored Territory podcast. You can follow him on Twitter @frankdenneman

Decoupling of Cores per Socket from Virtual NUMA Topology in vSphere 6.5

7 min read

Some changes are made in ESXi 6.5 with regards to sizing and configuration of the virtual NUMA topology of a VM. A big step forward in improving performance is the decoupling of Cores per Socket setting from the virtual NUMA topology sizing.

Understanding elemental behavior is crucial for building a stable, consistent and proper performing infrastructure. If you are using VMs with a non-default Cores per Socket setting and planning to upgrade to ESXi 6.5, please read this article, as you might want to set a host advanced settings before migrating VMs between ESXi hosts. More details about this setting is located at the end of the article, but let’s start by understanding how the CPU setting Cores per Socket impacts the sizing of the virtual NUMA topology.

Cores per Socket

By default, a vCPU is a virtual CPU package that contains a single core and occupies a single socket. The setting Cores per Socket controls this behavior; by default, this setting is set to 1. Every time you add another vCPU to the VM another virtual CPU package is added, and as follows the socket count increases.

Virtual NUMA Toplogy

Since vSphere 5.0 the VMkernel exposes a virtual NUMA topology, improving performance by facilitating NUMA information to guest operating systems and applications. By default the virtual NUMA topology is exposed to the VM if two conditions are met:

  1. The VM contains 9 or more vCPUs
  2. The vCPU count exceeds the core count* of the physical NUMA node.

* When using the advanced setting “numa.vcpu.preferHT=TRUE”, SMT threads are counted instead of cores to determine scheduling options. Using the following one-liner we can explore the NUMA configuration of active virtual machines on an ESXi host:

vmdumper -l | cut -d \/ -f 2-5 | while read path; do egrep -oi "DICT.*(displayname.*|numa.*|cores.*|vcpu.*|memsize.*|affinity.*)= .*|numa:.*|numaHost:.*" "/$path/vmware.log"; echo -e; done

When using this one-liner after powering-on a 10-vCPU virtual machine on the dual E5-2630 v4 (10 cores per socket) ESXi 6.0 host the following NUMA configuration is shown:01-vmdumper-10vcpu
The VM is configured with 10 vCPUs (numvcpus). The cpuid.coresPerSocket = 1 indicates that it’s configured with one core per socket. The last entry summarizes the virtual NUMA topology of the virtual machine. The constructs virtual nodes and physical domains will be covered later in detail.

All 10 virtual sockets are grouped into a single physical domain, which means that the vCPUs will be scheduled in a single physical CPU package that typically is similar to a single NUMA node. To match the physical placement, a single virtual NUMA node is exposed to the virtual machine.

Microsoft Sysinternals tool CoreInfo exposes the CPU architecture in the virtual machine with great detail. (Linux machines contain the command numactl – – hardware and use lstopo -s to determine cache configuration). Each socket is listed and with each logical processor a cache map is displayed. Keep the cache map in mind when we start to consolidate multiple vCPUs into sockets.

Virtual NUMA topology

The vCPU count of the VM is increased to 16 vCPUs; as a consequence, this configuration exceeds the physical core count. MS Coreinfo provides the following insight:
Coreinfo uses an asterisk to represent the mapping of the logical processor to socket map and NUMA Node Map. In this configuration the logical processors in socket 0 to socket 7 belong to NUMA Node 0, the CPUs in socket 8 to socket 15 belong to NUMA 1. Please note that this screenshot does not contain the entire logical processor cache map overview. Running the vmdumper one-liner on the ESXi host, the following is displayed:
In the previous example in which the VM was configured with 10 vCPUs, the numa.autosize.vcpu.maxPerVirtualNode = “10”, in this scenario, the 16 vCPU VM, has a numa.autosize.vcpu.maxPerVirtualNode = “8”. The VMkernel symmetrically distributes the 16 vCPUs across multiple virtual NUMA Nodes. It attempts to fit as much vCPUs into the minimum number of virtual NUMA nodes, hence the distribution of 8 vCPU per virtual node. It actually states this “Exposing multicore topology with cpuid.coresPerSocket = 8 is suggested for best performance”.

Virtual Proximity Domains and Physical Proximity Domains

A virtual NUMA topology consists of two elements, the Virtual Proximity Domains (VPD) and the Physical Proximity Domains (PPD). The VPD is the construct what is exposed to the VM, the PPD is the construct used by NUMA for placement (Initial placement and load-balancing).

The PPD auto sizes to the optimal number of vCPUs per physical CPU Package based on the core count of the CPU package. Unless the setting Cores per Socket within a VM configuration is used. In ESXi 6.0 the configuration of Cores per Socket dictates the size of the PPD, up to the point where the vCPU count is equal to the number of cores in the physical CPU package. In other words, a PPD can never span multiple physical CPU packages.
The best way to perceive a proximity domain is to compare it to a VM to host affinity group, but in this context, it is there to group vCPU to CPU Package resources. The PPD acts like an affinity of a group of vCPUs to all the CPUs of a CPU package. A proximity group is not a construct that is scheduled by itself. It does not determine whether a vCPU gets scheduled on a physical resource. It just makes sure that this particular group of vCPUs consumes the available resources on that particular CPU package.

A VPD is the construct that exposes the virtual NUMA topology to the virtual machine. The number of VPDs depends on the number of vCPUs and the physical core count or the use of Cores per Socket setting. By default, the VPD aligns with the PPD. If a VM is created with 16 vCPUs on the test server two PPD’s are created. These PPD allow the VPDs and its vCPUs to map and consume physical 8 cores of the CPU package.
If the default vCPU settings are used, each vCPU is placed in its own CPU socket (Cores per Socket = 1). In the diagram, the dark blue boxes on top of the VPD represent the virtual sockets, while the light blue boxes represent vCPUs. The VPD to PPD alignment can be overruled if a non-default Cores per Socket setting is used. A VPD spans multiple PPDs if the number of the vCPUs and the Cores per Socket configuration exceeds the physical core count of a CPU package. For example, a virtual machine with 40 vCPUs and 20 Cores per Socket configuration on a host with four CPU packages containing each 10 cores, creates a topology of 2 VPD’s that each contains 20 vCPUs, but spans 4 PPDs.
The Cores per Socket configuration overwrites the default VPD configuration and this can lead to suboptimal configurations if the physical layout is not taken into account correctly.

Specifically, spanning VPDs across PPDs is something that should be avoided at all times. This configuration can render most CPU optimizations inside the guest OS and application completely useless. For example, OS and applications potentially encounter remote memory access latencies while expecting local memory latencies after optimizing thread placements. It’s recommended to configure the VMs Cores per Socket to align with the physical boundaries of the CPU package.

ESXi 6.5 Cores per Socket behavior

In ESXi 6.5 the CPU scheduler has got some interesting changes. One of the changes in ESXi 6.5 is the decoupling of Cores per Socket configuration and VPD creation to further optimize virtual NUMA topology. Up to ESXi 6.0, if a virtual machine is created with 16 CPUs and 2 Cores per Socket, 8 PPDs are created and 8 VPDs are exposed to the virtual machine.
The problem with this configuration is that it the virtual NUMA topology does not represent the physical NUMA topology correctly.
The guest OS is presented with 16 CPUs distributed across 8 sockets. Each pair of CPUs has its own cache and its own local memory. The operating system considers the memory addresses from the other CPU pairs to be remote. The OS has to deal with 8 small chunks of memory spaces and optimize its cache management and memory placement based on the NUMA scheduling optimizations. Where in truth, the 16 vCPUs are distributed across 2 physical nodes, thus 8 vCPUs share the same L3 cache and have access to the physical memory pool. From a cache and memory-centric perspective it looks more like this:
CoreInfo output of the CPU configuration of the virtual machine:
To avoid “fragmentation” of local memory, the behavior of VPDs and it’s relation to the Cores per Socket setting has changed. In ESXi 6.5 the size of the VPD is dependent on the number of cores in the CPU package. This results in a virtual NUMA topology of VPDs and PPDs that attempt to resemble the physical NUMA topology as much as possible. Using the same example of 16 vCPU, 2 Cores per Socket, on a dual Intel Xeon E5-2630 v4 (20 cores in total), the vmdumper one-liner shows the following output in ESXi 6.5:
As a result of having only two physical NUMA nodes, only two PPDs and VPDs are created. Please note that the Cores per Socket setting has not changed, thus multiple sockets are created in a single VPD.
A new line appears in ESXi 6.5; “NUMA config: consolidation =1”, indicating that the vCPUs will be consolidated into the least amount of proximity domains as possible. In this example, the 16 vCPUs can be distributed across 2 NUMA nodes, thus 2 PPDs and VPDs are created. Each VPD exposes a single memory address space that correlates with the characteristics of the physical machine.
The Windows 2012 guest operating system running inside the virtual machine detects two NUMA nodes. The CPU view of the task managers shows the following configuration:
The NUMA node view is selected and at the bottom right of the screen, it shows that virtual machine contains 8 sockets and 16 virtual CPUs. CoreInfo provides the following information:
With this new optimization, the virtual NUMA topology corresponds more to the actual physical NUMA topology, allowing the operating system to correctly optimize its processes for correct local and remote memory access.

Guest OS NUMA optimization

Modern applications and operating systems manage memory access based on NUMA nodes (memory access latency) and cache structures (sharing of data). Unfortunately most applications, even the ones that are highly optimized for SMP, do not balance the workload perfectly across NUMA nodes. Modern operating systems apply a first-touch-allocation policy, which means that when an application requests memory, the virtual address is not mapped to any physical memory. When the application accesses the memory, the OS typically attempts to allocate it on the local or specified NUMA if possible.

In an ideal world, the thread that accessed or created the memory first is the thread that processes it. Unfortunately, many applications use single threads to create something, but multiple threads distributed across multiple sockets access the data intensively in the future. Please take this into account when configuring the virtual machine and especially when configuring Cores per Socket. The new optimization will help to overcome some of these inefficiencies created in the operating system.

However, sometimes it’s required to configure the VM with a non-default Cores per Socket setting, due to licensing constraints for example. If you are required to set Cores per Socket and you want to optimize guest operating system memory behavior any further, then configure the Cores per Socket to align with the physical characteristics of the CPU package.

As demonstrated the new virtual NUMA topology optimizes the memory address space, providing a bigger more uniform memory slice that aligns better with the physical characteristics of the system. One element has not been thoroughly addressed and that is cache address space created by a virtual socket. As presented by CoreInfo, each virtual socket advertises its own L3 cache.

In the scenario of the 16 vCPU VM on the test system, configuring it with 8 Cores per socket, this configuration resembles both the memory and the cache address space of the physical CPU package the most. Coreinfo shows the 16 vCPUs distributed symmetrically across two NUMA nodes and two sockets. Each socket contains 8 CPUs that share L3 cache, similar to the physical world.

Word of caution!

Migrating VMs configured with Cores per Socket from older ESXi versions to ESXi 6.5 hosts can create PSODs and/or VM Panic
Unfortunately, there are some virtual NUMA topology configurations that cannot be consolidated properly by the GA release of ESXi 6.5 when vMotioning a VM from an older ESXi version.

If you have VMs configured with a non-default Cores per Socket setting or you have set the advanced parameter numa.autosize.once to False, enable the following advanced host configuration on the ESXi 6.5 host:
Numa.FollowCoresPerSocket = 1

A reboot of the host is not necessary! This setting makes ESXi 6.5 behave as ESXi 6.0 when creating the virtual NUMA topology. That means that the Cores per Socket setting determines the VPD sizing.

There have been some cases reported where the ESXi 6.5 crashes (PSOD). Test it in your lab and if your VM configuration triggers the error set the FollowCoresPerSocket setting as an advanced configuration.
Knowledge base article 2147958 has more information. I’ve been told that the CPU team is working on a permanent fix, I do not have insights when this fix will be released!

frankdenneman Frank Denneman is the Machine Learning Chief Technologist at VMware. He is an author of the vSphere host and clustering deep dive series, as well as podcast host for the Unexplored Territory podcast. You can follow him on Twitter @frankdenneman

7 Replies to “Decoupling of Cores per Socket from Virtual NUMA Topology…”

  1. Am I correct in assuming this improvement should just keep things humming smoother and more efficiently, and we should just watch out for those settings at the bottom of the post biting us while migrating?

  2. I experienced the crash and it was not fun. Thanks for this. If I use the corespersocket setting, will this guarantee a crash free migration?

  3. Hi Frank,

    Please write similar analogy for the new AMD EPYC processor, as the die structure has changed the NUMA concept. Would be more interested in seeing details around how big VMs 256 GB ram size will behave when fast mem pages will be asked by memory intensive applications.

  4. Hello Frank, first of all many thanks for the excellent job done with your articles and the book I’m going through. However there are a couple of details where I can’t get my head around. I have a VM with 64 vCPUs and 512GB RAM, it’s a massive db. ESXi is HP DL560 with 4 sockets (22 cores each + HT) and 1.5 TB RAM. According to your book I aimed at keeping the VPD onto as little psockets as possible. By disabling the Hot add CPU feature and by adding the preferHT setting I increased performance quite a lot. However while the VMWare KB 2003582 states how to implement the preferHT setting it does not mention something you do in your book:

    “Please remember to adjust the numa.autosize.vcpu.maxPerVirtualNode setting in the VM if it is already been powered-on once. This setting overrides the numa.vcpu.preferHT=TRUE setting”

    I read the above after I did the initial changes to the VM and I have now noticed that its numa.autosize.vcpu.maxPerVirtualNode value is 11. I don’t know why, however it does not matter, what I’m concerned with is.. following which criteria do I adjust that value?

    Shall I set it to 44 as it is the max number of logical cores in a physical socket? Or shall I disable it and let the system do its best decision? If yes how do I disable it? This is the current layout of the cpu resources of the vm:

    groupName groupID clientID homeNode affinity nWorlds vmmWorlds localMem remoteMem currLocal% cummLocal%
    vm.45125521 312845622 0 1 0xf 16 16 50170696 29880 99 99
    vm.30197740 218668959 0 2 0xf 4 4 8208392 49156 99 99
    vm.30197741 218668969 0 0 0xf 2 2 4034560 32768 99 99
    vm.54676301 368574028 0 0 0xf 8 8 16629320 18872 99 96
    vm.56780179 381439036 0 2 0xf 10 10 78430208 204800 99 86
    vm.56780179 381439036 1 2 0xf 10 10 66310144 11284480 85 73
    vm.56780179 381439036 2 2 0xf 10 10 77531136 63488 99 78
    vm.56780179 381439036 3 0 0xf 10 10 77551616 43008 99 73
    vm.56780179 381439036 4 1 0xf 10 10 75403264 94208 99 82
    vm.56780179 381439036 5 2 0xf 10 10 64526408 10971064 85 67
    vm.56780179 381439036 6 2 0xf 4 4 74442752 6144 99 86

    although performances have improved I’m not happy with the distribution of the cores. Specially considering that homeNode 3 is not used at all.
    Many thanks again Frank!

  5. PreferHT is used to consolidate the vCPUs as much as possible and create the fewest number of NUMA clients. In your situation, there are four NUMA nodes, 22 cores with each 384 GB of memory (assuming the DIMMs are equally distributed across sockets and offer the same capacity). With PreferHT, the NUMA scheduler takes the SMT capabilities into account, and therefore, each NUMA node can now accept NUMA clients similar to the HT thread count. In your situation, that is 44.
    With this theory, your VM of 64 vCPUs should be distributed across two NUMA clients, each NUMA client grouping 32 vCPUs.

    The advice of setting numa.autosize.vcpu.maxPerVirtualNode in the book is to propagate the virtual NUMA topology to the guest OS. This is typically recommended for VMs that exceed the memory capacity of a NUMA node while being able to all the vCPUs in a single NUMA node. In your scenario, setting it to 11 restricts the scheduler to prefer HT threads for sizing the NUMA client. 64/11 = 5.8; thus, it should round up to 6, but it doesn’t do this, so I expect other advanced settings are influencing the NUMA client configuration.

    Instead of diving into and wasting much time on figuring out this anomaly, my recommendation is to remove the setting numa.autosize.vcpu.maxPerVirtualNode, and allow the NUMA scheduler to align the NUMA client configuration to the PreferHT setting. Typically most customers do not use PreferHT on a virtual machine that has that many CPUs. PreferHT is an advanced setting designed to help workloads that are cache-intensive, but not CPU intensive. By scheduling all the vCPUs into the same NUMA node, it can leverage the same L1/L2/L3 cache and thus take advantage of the spatial locality of memory access. I suspect, you assigning 64 vCPUs is to have a lot of CPU resources available for the application.

    We decoupled the NUMA scheduling constructs from the user setting Cores per Socket (CPS). Thus the setting provides customers to align with licensing requirements while the NUMA scheduler can manage its constructs to optimize memory access. When setting it, it provides a socket topology, and the Guest OS translates that as cache scheduling domains. In your situation, with 64 vCPUs with PreferHT enabled and maxPerVirtualNode disabled, you should get two NUMA clients with each containing 32 vCPUs. Your CPS setting should be 32 cores per Socket. That will present a dual-socket configuration, and your Guest OS will create a Cache map, where 32 CPUs share the same cache. The application and Guest OS can now determine where to place the threads when they expect these threads to share and access the same memory blocks.

    Hope this clarifies it for you, if you decide to apply my recommendation, please test it first in a non-production environment and post the results as others might experience a similar situation and are helped by your experience.

  6. Frank many many thanks for your time on this! Really appreciated!

    Now I know I can simply delete the numa.autosize.vcpu.maxPerVirtualNode in favour of the preferHT. Since it is not clear why the numa clients count does not reflect the actual settings, I agree with you that is a waste of time investigating further. So I will proceed with your suggestion and report back to the community as soon as I get the green light for the next shutdown. Unfortunately for this very application I don’t avail of a test environment. Thanks also for the hint on the preferHT parameter. I chose it so vCPUs could stay on as less physical sockets as possible, else the other solution would have been to have the vCPUs spread over 4 physical sockets, like 4 vSockets x 16 cores and frankly I don’t know if that is a good idea. Also I will reserve VM’s RAM fully to further improve performance.

    Big thanks again!

Comments are closed.