Decoupling of Cores per Socket from Virtual NUMA Topology in vSphere 6.5

December 12, 2016 by frankdenneman

Some changes are made in ESXi 6.5 with regards to sizing and configuration of the virtual NUMA topology of a VM. A big step forward in improving performance is the decoupling of Cores per Socket setting from the virtual NUMA topology sizing.

Understanding elemental behavior is crucial for building a stable, consistent and proper performing infrastructure. If you are using VMs with a non-default Cores per Socket setting and planning to upgrade to ESXi 6.5, please read this article, as you might want to set a host advanced settings before migrating VMs between ESXi hosts. More details about this setting is located at the end of the article, but let’s start by understanding how the CPU setting Cores per Socket impacts the sizing of the virtual NUMA topology.

Cores per Socket

By default, a vCPU is a virtual CPU package that contains a single core and occupies a single socket. The setting Cores per Socket controls this behavior; by default, this setting is set to 1. Every time you add another vCPU to the VM another virtual CPU package is added, and as follows the socket count increases.

Virtual NUMA Toplogy

Since vSphere 5.0 the VMkernel exposes a virtual NUMA topology, improving performance by facilitating NUMA information to guest operating systems and applications. By default the virtual NUMA topology is exposed to the VM if two conditions are met:

The VM contains 9 or more vCPUs
The vCPU count exceeds the core count* of the physical NUMA node.

* When using the advanced setting “numa.vcpu.preferHT=TRUE”, SMT threads are counted instead of cores to determine scheduling options. Using the following one-liner we can explore the NUMA configuration of active virtual machines on an ESXi host:

When using this one-liner after powering-on a 10-vCPU virtual machine on the dual E5-2630 v4 (10 cores per socket) ESXi 6.0 host the following NUMA configuration is shown:
The VM is configured with 10 vCPUs (numvcpus). The cpuid.coresPerSocket = 1 indicates that it’s configured with one core per socket. The last entry summarizes the virtual NUMA topology of the virtual machine. The constructs virtual nodes and physical domains will be covered later in detail.

All 10 virtual sockets are grouped into a single physical domain, which means that the vCPUs will be scheduled in a single physical CPU package that typically is similar to a single NUMA node. To match the physical placement, a single virtual NUMA node is exposed to the virtual machine.

Microsoft Sysinternals tool CoreInfo exposes the CPU architecture in the virtual machine with great detail. (Linux machines contain the command numactl – – hardware and use lstopo -s to determine cache configuration). Each socket is listed and with each logical processor a cache map is displayed. Keep the cache map in mind when we start to consolidate multiple vCPUs into sockets.

Virtual NUMA topology

The vCPU count of the VM is increased to 16 vCPUs; as a consequence, this configuration exceeds the physical core count. MS Coreinfo provides the following insight:

Coreinfo uses an asterisk to represent the mapping of the logical processor to socket map and NUMA Node Map. In this configuration the logical processors in socket 0 to socket 7 belong to NUMA Node 0, the CPUs in socket 8 to socket 15 belong to NUMA 1. Please note that this screenshot does not contain the entire logical processor cache map overview. Running the vmdumper one-liner on the ESXi host, the following is displayed:

In the previous example in which the VM was configured with 10 vCPUs, the numa.autosize.vcpu.maxPerVirtualNode = “10”, in this scenario, the 16 vCPU VM, has a numa.autosize.vcpu.maxPerVirtualNode = “8”. The VMkernel symmetrically distributes the 16 vCPUs across multiple virtual NUMA Nodes. It attempts to fit as much vCPUs into the minimum number of virtual NUMA nodes, hence the distribution of 8 vCPU per virtual node. It actually states this “Exposing multicore topology with cpuid.coresPerSocket = 8 is suggested for best performance”.

Virtual Proximity Domains and Physical Proximity Domains

A virtual NUMA topology consists of two elements, the Virtual Proximity Domains (VPD) and the Physical Proximity Domains (PPD). The VPD is the construct what is exposed to the VM, the PPD is the construct used by NUMA for placement (Initial placement and load-balancing).

The PPD auto sizes to the optimal number of vCPUs per physical CPU Package based on the core count of the CPU package. Unless the setting Cores per Socket within a VM configuration is used. In ESXi 6.0 the configuration of Cores per Socket dictates the size of the PPD, up to the point where the vCPU count is equal to the number of cores in the physical CPU package. In other words, a PPD can never span multiple physical CPU packages.

The best way to perceive a proximity domain is to compare it to a VM to host affinity group, but in this context, it is there to group vCPU to CPU Package resources. The PPD acts like an affinity of a group of vCPUs to all the CPUs of a CPU package. A proximity group is not a construct that is scheduled by itself. It does not determine whether a vCPU gets scheduled on a physical resource. It just makes sure that this particular group of vCPUs consumes the available resources on that particular CPU package.

A VPD is the construct that exposes the virtual NUMA topology to the virtual machine. The number of VPDs depends on the number of vCPUs and the physical core count or the use of Cores per Socket setting. By default, the VPD aligns with the PPD. If a VM is created with 16 vCPUs on the test server two PPD’s are created. These PPD allow the VPDs and its vCPUs to map and consume physical 8 cores of the CPU package.

If the default vCPU settings are used, each vCPU is placed in its own CPU socket (Cores per Socket = 1). In the diagram, the dark blue boxes on top of the VPD represent the virtual sockets, while the light blue boxes represent vCPUs. The VPD to PPD alignment can be overruled if a non-default Cores per Socket setting is used. A VPD spans multiple PPDs if the number of the vCPUs and the Cores per Socket configuration exceeds the physical core count of a CPU package. For example, a virtual machine with 40 vCPUs and 20 Cores per Socket configuration on a host with four CPU packages containing each 10 cores, creates a topology of 2 VPD’s that each contains 20 vCPUs, but spans 4 PPDs.

The Cores per Socket configuration overwrites the default VPD configuration and this can lead to suboptimal configurations if the physical layout is not taken into account correctly.

Specifically, spanning VPDs across PPDs is something that should be avoided at all times. This configuration can render most CPU optimizations inside the guest OS and application completely useless. For example, OS and applications potentially encounter remote memory access latencies while expecting local memory latencies after optimizing thread placements. It’s recommended to configure the VMs Cores per Socket to align with the physical boundaries of the CPU package.

ESXi 6.5 Cores per Socket behavior

In ESXi 6.5 the CPU scheduler has got some interesting changes. One of the changes in ESXi 6.5 is the decoupling of Cores per Socket configuration and VPD creation to further optimize virtual NUMA topology. Up to ESXi 6.0, if a virtual machine is created with 16 CPUs and 2 Cores per Socket, 8 PPDs are created and 8 VPDs are exposed to the virtual machine.

The problem with this configuration is that it the virtual NUMA topology does not represent the physical NUMA topology correctly.

The guest OS is presented with 16 CPUs distributed across 8 sockets. Each pair of CPUs has its own cache and its own local memory. The operating system considers the memory addresses from the other CPU pairs to be remote. The OS has to deal with 8 small chunks of memory spaces and optimize its cache management and memory placement based on the NUMA scheduling optimizations. Where in truth, the 16 vCPUs are distributed across 2 physical nodes, thus 8 vCPUs share the same L3 cache and have access to the physical memory pool. From a cache and memory-centric perspective it looks more like this:

CoreInfo output of the CPU configuration of the virtual machine:

To avoid “fragmentation” of local memory, the behavior of VPDs and it’s relation to the Cores per Socket setting has changed. In ESXi 6.5 the size of the VPD is dependent on the number of cores in the CPU package. This results in a virtual NUMA topology of VPDs and PPDs that attempt to resemble the physical NUMA topology as much as possible. Using the same example of 16 vCPU, 2 Cores per Socket, on a dual Intel Xeon E5-2630 v4 (20 cores in total), the vmdumper one-liner shows the following output in ESXi 6.5:

As a result of having only two physical NUMA nodes, only two PPDs and VPDs are created. Please note that the Cores per Socket setting has not changed, thus multiple sockets are created in a single VPD.

A new line appears in ESXi 6.5; “NUMA config: consolidation =1”, indicating that the vCPUs will be consolidated into the least amount of proximity domains as possible. In this example, the 16 vCPUs can be distributed across 2 NUMA nodes, thus 2 PPDs and VPDs are created. Each VPD exposes a single memory address space that correlates with the characteristics of the physical machine.

The Windows 2012 guest operating system running inside the virtual machine detects two NUMA nodes. The CPU view of the task managers shows the following configuration:

The NUMA node view is selected and at the bottom right of the screen, it shows that virtual machine contains 8 sockets and 16 virtual CPUs. CoreInfo provides the following information:

With this new optimization, the virtual NUMA topology corresponds more to the actual physical NUMA topology, allowing the operating system to correctly optimize its processes for correct local and remote memory access.

Guest OS NUMA optimization

Modern applications and operating systems manage memory access based on NUMA nodes (memory access latency) and cache structures (sharing of data). Unfortunately most applications, even the ones that are highly optimized for SMP, do not balance the workload perfectly across NUMA nodes. Modern operating systems apply a first-touch-allocation policy, which means that when an application requests memory, the virtual address is not mapped to any physical memory. When the application accesses the memory, the OS typically attempts to allocate it on the local or specified NUMA if possible.

In an ideal world, the thread that accessed or created the memory first is the thread that processes it. Unfortunately, many applications use single threads to create something, but multiple threads distributed across multiple sockets access the data intensively in the future. Please take this into account when configuring the virtual machine and especially when configuring Cores per Socket. The new optimization will help to overcome some of these inefficiencies created in the operating system.

However, sometimes it’s required to configure the VM with a non-default Cores per Socket setting, due to licensing constraints for example. If you are required to set Cores per Socket and you want to optimize guest operating system memory behavior any further, then configure the Cores per Socket to align with the physical characteristics of the CPU package.

As demonstrated the new virtual NUMA topology optimizes the memory address space, providing a bigger more uniform memory slice that aligns better with the physical characteristics of the system. One element has not been thoroughly addressed and that is cache address space created by a virtual socket. As presented by CoreInfo, each virtual socket advertises its own L3 cache.

In the scenario of the 16 vCPU VM on the test system, configuring it with 8 Cores per socket, this configuration resembles both the memory and the cache address space of the physical CPU package the most. Coreinfo shows the 16 vCPUs distributed symmetrically across two NUMA nodes and two sockets. Each socket contains 8 CPUs that share L3 cache, similar to the physical world.

Word of caution!

Migrating VMs configured with Cores per Socket from older ESXi versions to ESXi 6.5 hosts can create PSODs and/or VM Panic
Unfortunately, there are some virtual NUMA topology configurations that cannot be consolidated properly by the GA release of ESXi 6.5 when vMotioning a VM from an older ESXi version.

If you have VMs configured with a non-default Cores per Socket setting or you have set the advanced parameter numa.autosize.once to False, enable the following advanced host configuration on the ESXi 6.5 host:
Numa.FollowCoresPerSocket = 1

A reboot of the host is not necessary! This setting makes ESXi 6.5 behave as ESXi 6.0 when creating the virtual NUMA topology. That means that the Cores per Socket setting determines the VPD sizing.

There have been some cases reported where the ESXi 6.5 crashes (PSOD). Test it in your lab and if your VM configuration triggers the error set the FollowCoresPerSocket setting as an advanced configuration.
Knowledge base article 2147958 has more information. I’ve been told that the CPU team is working on a permanent fix, I do not have insights when this fix will be released!

VMware Cloud on AWS at re:Invent

November 22, 2016 by frankdenneman

Thank you for all the great feedback since we announced our partnership with Amazon Web Services (AWS) on October 13! We have seen a lot of interest for VMware Cloud on AWS (VMC) from customers, partners, industry analysts, and social media. Following on from the announcement in San Francisco, we went on to Barcelona for VMworld Europe, and had multiple sold out sessions with our customers and partners in attendance. The #VMWonAWS hashtag on Twitter was pretty active as well, and we had our hands full answering all your questions!
Next stop is AWS re:Invent in Las Vegas, unfortunately I won’t be at re:Invent, but the core group of VMware on AWS cloud product team is. VMware is a Platinum Sponsor at re:Invent and the team to eager to talk about the service offering, use cases, architecture, Tech Preview demos and a lot more.
Here are the top three ways to get the most out of re:Invent:
ENT317 – VMware and AWS Together – VMware Cloud on AWS
Thursday, Dec 1, 2:00 PM – 3:00 PM
Location: Venetian Level 3, Murano 3205 (please check exact location on the portal)
Speakers: Matt Dreyer, VMware Product Management, Paul Bockelman – AWS Sr. Solutions Architect
Description: VMware CloudTM on AWS brings VMware’s enterprise class Software-Defined Data Center software to Amazon’s public cloud, delivered as an on-demand, elastically scalable, cloud-based VMware sold, operated and supported service for any application and optimized for next-generation, elastic, bare metal AWS infrastructure. This solution enables customers to use a common set of software and tools to manage both their AWS-based and on-premises vSphere resources consistently. Further virtual machines in this environment have seamless access to the broad range of AWS services as well. This session will introduce this exciting new service and examine some of the use cases and benefits of the service. The session will also include a VMware Tech Preview that demonstrates standing up a complete SDDC cluster on AWS and various operations using standard tools like vCenter.
PTS205 – VMware Cloud on AWS
Wednesday, Nov 30, 1:30 PM – 1:45 PM
Location: Partner Theater – Expo Hall
Speaker: Marc Umeno, VMware Product Management
Description: Learn about how VMware and AWS are joining hands to deliver a new vSphere-based service running on next-generation, elastic, bare-metal AWS infrastructure with seamless integration with AWS services.
VMware Booth 2525 at Sands Expo, Hall D. Full exhibitor list and map is here
Tue Nov 29th 5-7 pm ; Wed Nov 30th 10:30am-6pm ; Thu Dec 1st 10:30am-6pm
Description: We have three demo pods: a) VMware Cloud on AWS, b) Networking & Security, and c) Cloud Management
Beyond the sessions and booth, you can also engage with us using the following means:
Sign up for Beta, news updates, or both
Follow us on Twitter @vmwarecloud
Ask a question, share a use case, or just give us a shout out using #VMWonAWS hashtag
Pick a 30 minute slot to talk to a member of the VMware Cloud on AWS product team 1:1
Thank you and we hope to see you there!

My thought on 800 page VCDX designs

November 15, 2016 by frankdenneman

Although I’m not participating in the VCDX program any more, I still hold it dear to my heart. Many aspiring VCDX’es approach me and seek guidance on how to successfully pass the last part of the VCDX process, the defense.
Typically this starts with the discussion on the design itself and particularly how many pages the design should be comprised off. I heard stories about people advocating 800 page designs. And that makes me laugh, but mostly cry.
Let’s go back to the essence of the program and understand that the VCDX program has been erected with the idea to validate that someone is a skilled architect. That they can assist IT-organizations into building a successful vSphere architecture. In short, it’s just a stamp of approval of your skill as an architect.
Now with that in mind, how many skilled architects hand in an 800 page vSphere design document to a customer? How many customers would accept that? We are not in the business of writing the next Lord of the Rings novel. I worked on complex and massive architectures and most designs didn’t touch 150 pages.
When reviewing such 800 page designs, I noticed it’s more a cut and paste of official documentation on how a certain features work. It’s imperative that you know the inner workings of the pillars and foundation of your architecture. But your design should not be a thesis or a showcase of your knowledge of the products.
A design should highlight the requirements, the constraints and the chosen direction and technology. It should explain the workings of the used technology in a short and concise manner. Explain how this technology meet the customer requirements and if certain constraints require you to deviate from the default settings. Document thoroughly the effect the chose design on the service levels of the applications and architecture.
I feel that some people try to portray the defense as this herculean feat. And to be honest, if you haven’t operated as an architect for multiple customers, it might feel that way. But if you are the architect that has worked on multiple designs, that recognizes the risk-awareness culture differences between companies and how to cater to this need. That can drill down to the essence and explain why a certain requirement impacts a design decision and what effect this has on service levels or other requirements you should be fine!
Try to not to see it as the Mount Everest of your career, see it passing the defense as ceremony that validates your upward path of being a great architect. Do what you’ve always have been doing. If you provided your customers with 100 to 200 page designs, keep on doing that and submit such a design for your VCDX defense.

Host not Ready Error When Installing NSX Agents

November 4, 2016 by frankdenneman

Management summary: Make sure your NSX Controller is connected a distributed vSwitch instead of the standard vSwitch
During the install process of NSX, my environment refused to install the NSX agents on the host. When you prepare the host clusters for network virtualization a collection of VIBs are installed on each ESXi Node of the selected cluster. This process installs the functionality such as Distributed Routing, Distributed Firewalls and the user world agent that allows the distributed vSwitch to evolve into a NSX Virtual Switch.
Unfortunately, this process didn’t go as smooth as the other processes such as installing the NSX Manager and deploying the NSX Controller. Each time I selected Install at Host Preparation, (Within vCenter, select Networking & Security > Installation > Host Preparation. Select the cluster and click the Install link) the process returned an error “Host Not Ready”.The recent task view showed that the task cannot be completed

Events shows the following entry:

Not very helpful in order to troubleshoot the error. I followed the KB article 2075600 (Installation Status appears as Not Ready in NSX (2075600), and made sure time and DNS were set up correctly. But unfortunately, it didn’t solve the problem. Until I started to dissect the process of what Install at the Host Preparation actually does and how the components connect to each other. This made me review the settings of the NSX Manager and discovered I selected the port group designated for my management VMs on the standard switch instead of the distributed switch. It makes sense to connect it to a Distributed Switch, maybe this is the reason why many write-ups on how to install NSX assume this is basically knowledge and fail to list it as a requirement.
The UI allows you to select a standard vSwitch Port Group or a Distributed Port Group. Don’t make the same mistake I made and make sure you select the appropriate Distributed Port Group.

VMware Cloud on AWS – Elastic DRS preview

October 18, 2016 by frankdenneman

The VMworld Europe keynote featured the future VMware Cloud on AWS services. In short this services gives VMware customers instant scale and global reach delivered by AWS while continuing to use their own skill set driving and operating VMware SDDC environments on-prem and in-cloud. Avoid the risk that comes with re-platforming, re-architecting current application landscape to run on a different platform while providing the same service. In turn it allows the IT organization to connect the current applications with AWS vast service catalog and use services like RDS, Red Shift, Glacier and many more.
One of the interesting features that is under tech preview is Elastic DRS. Elastic DRS helps to solve one of the toughest challenges an IT architect can face: capacity planning. Major key points of capacity planning are current and future resource demand, failure recovery capacity and maintenance capacity. Finding the right balance between maintaining workload performance versus the downside of CAPEX and OPEX of reserved failover capacity is difficult. By leveraging the IT-at-scale operations of AWS, Elastic DRS transforms vSphere clusters into an agility powerhouse.

Rapid scaling ability allows to add additional hosts to the cluster. No more ordering new hardware, racking and stacking, just add the new host to the cluster with a right-click of the mouse. By using native metrics, DRS can detect that the cluster is running out of host resources and presents a recommendation of adding another host. Like regular DRS, you can also put Elastic DRS into automatic mode and allow it to add or remove hosts based on observed load on the cluster.

Sometimes we forget how extremely complex running IT at super scale is. Automating the install, configuration and operaing one host is interesting, doing this by the dozen is already pushing the limits for a lot of IT organizations. Now think about this doing it in more than a dozen datacenters around the world at the same time while being required to do it instantly when a customer wants this. Undeniably impressive. When joining the team, learning about Elastic DRS was exciting, understanding how this works for all the customers on all the AWS datacenters around the world is just mind-blowing! IT-at-Scale to its finest.
When you have ready-to-go ESXi hosts at your fingertips it allows you to do so many cool things , for example allow DRS to aid and assist vSphere HA. Since ESXi 3.0, vSphere HA has ensured that workloads are restarted on the surviving hosts in the cluster. However, when a host outage is not temporary, but permanently, application performance can be impacted due to the reduction of available host resources on a longer term. Auto remediation helps to address this challenge.
Auto remediation builds upon Elastic DRS and ensures that the available host resources remain consistent during an ESXi host outage. When a host failure is detected, auto remediation adds another hosts to the cluster, ensuring that the workload performance will not be impacted in the long run by a host failure. If partial (hardware) failure occurs, auto remediation ensures that VSAN operations complete before ejecting the degraded host.

Another benefit of this framework is the ability to retain similar levels of resources during maintenance. Typically during maintenance operations, hosts are patched and temporarily unavailable to run and service applications. Many IT organizations deal with this situation, by either “oversizing” cluster or by offering SLA’s that provides a reduced service during maintenance hours. With Elastic DRS, the cluster size is not reduced during maintenance operations. This way workloads are not impacted by a loss of resources and continue to perform similarly as to normal operation hours.
To emphasize this is a a technical preview of a service that is not operational yet.
For more info about VMware Cloud on AWS, take a closer look.