Page 3 of 67

Impact of CPU Hot Add on NUMA scheduling

On a regular basis, I receive the question if CPU Hot-add impacts CPU performance of the VM. It depends on the vCPU configuration of the VM. CPU Hot-Add is not compatible with vNUMA, if hot-add is enabled the virtual NUMA topology is not exposed to the guest OS and this may impact application performance.

Please note that vNUMA topology is only exposed when the vCPU count of the VM exceeds the core count, thus if the ESXi host contains two CPU packages with 10 cores, the vNUMA topology is presented to the VM if the vCPU count equals 11 or more.

vNUMA in a Nutshell
The benefit of a wide-VM is that the guest OS is informed about the physical grouping of the vCPUs. In the example of a 12 vCPU VM on a dual-10 core system, the NUMA scheduler creates 2 virtual proximity domains (VPD) better know as NUMA-clients and distributes the 12 vCPUs equally across them. As a result, a load-balancing group is created containing 6 vCPUs that are scheduled on a physical CPU package. A load-balancing group is internally referred to as a physical proximity domain (PPD). Please note that the PPD does not determine the scheduling of vCPU on a specific HT or full core, a PPD can be seen as a vCPU to CPU affinity group

From a memory perspective, the guest OS is presented with a vNUMA node sized, separated address space. These address spaces are local to the subset of the vCPUs. As a result, a 12 vCPU 32 GB VM gets to detect a system with two NUMA nodes. Each NUMA node contains 6 CPUs and has a local address space of 16 GB. Contrary to popular belief vNUMA does not expose the full CPU and memory architecture, a better way to describe it that vNUMA shows a tailor-made world to the VM.

vNUMA to Physical mapping-1

But what happens when the VM is configured with less vCPUs than the core count of the physical CPU package and CPU Hot-Add is enabled? Will there be performance impact? And the answer is no. The VPD configured for the VM fits inside a NUMA node, and thus the CPU scheduler and the NUMA scheduler optimizes memory operations. It’s all about memory locality. Let’s make use of some application workload test to determine the behavior of the VMkernel CPU scheduling.

For this test, I’ve installed DVD Store 3.0 and ran some test loads on the MS-SQL server. To determine the baseline, I’ve logged in the ESXi host via an SSH session and executed the command: sched-stats -t numa-pnode. This command shows the CPU and memory configuration of each NUMA node in the system. This screenshot shows that the system is only running the ESXi operating system. Hardly any memory is consumed. TotalMem indicates the total amount of physical memory in the NUMA node in kb. FreeMem indicates the amount of free physical memory in the NUMA node in kb.

01-Unload-ESXi-Host

An 8 vCPU 32 GB VM is created with CPU hot add disabled. NUMA scheduler has selected NUMA node 1 for initial placement and the system consumes ~13759 MB (67108864-53019184=14089680/1024).

02-8vCPU

The command memstats -r vm-stats -s name:memSize:allocTgt:mapped:consumed:touched -u mb allows us to verify the VM memory consumption of the VM.

03-VM memstats

The numbers are a close match, please note that VM-stats does not include overhead memory and that the VMkernel can consume some additional overhead in the same NUMA node for other processes.

When hot-add is enabled (power down VM is necessary to enable this feature), nothing really changes. The memory for this VM is still allocated from a single NUMA node.

04-8vCPU-hot-add

To get a better understanding of the CPU scheduling constructs at play here, the following command provides detailed insight of all the NUMA related settings of the VM. (Command courtesy of Valentin Bondzio)

vmdumper -l | cut -d \/ -f 2-5 | while read path; do egrep -oi "DICT.*(displayname.*|numa.*|cores.*|vcpu.*|memsize.*|affinity.*)= .*|numa:.*|numaHost:.*" "/$path/vmware.log"; echo -e; done

05-vmdumper

It shows hot-add is enabled and the VM is configured with a single VPD that is scheduled on a single PPD. In normal language, the vCPUs of the VM are contained with a single physical NUMA node. It’s the responsibility of the NUMA scheduler that physical local memory is consumed. To verify if the VM is consuming local memory, Esxtop can be used (memory, f, NUMA stats). However sched-stats -t numa-clients provides me also a lot of insight

06-8vCPU-hot-add-numa-client

As a result, you can conclude that enabling hot-add on a NUMA system does not lead to performance degradation as long as the vCPU count does not exceed the core count of the CPU package. That means that hot-add can be enabled on VMs, but the instruction must be clear that adding vCPUs can happen up and to the threshold of the physical core count. After that point, the VM becomes a wide-VM and vNUMA comes into play. And in the case of CPU hot-add, its sidelined.

What’s the impact of disregarding the physical NUMA topology? The key lies within the message that’s entered in the VMware.log of the VM after boot.

07-Forcing UMA

The VMkernel is forced into using UMA, Unified Memory Access on a NUMA architecture. As a result, memory is interleaved between the two physical NUMA nodes. In essence, it’s load-balancing memory across two nodes, while ignoring the vCPU location. Let’s explore this behavior a bit more.
Christmas is coming early for this VM and it gets another 4 vCPUs. Hot-add is disabled again and thus vNUMA is full in play. The Vmdumper command reveals the following:

08-12vCPU-vNUMA

The vCPUs are split up in two virtual nodes (VPD0 & VPD1), each containing 6 vCPUs. After running the DVD Store query the following memory allocation happened:

09-Non-Uniform Memory Allocation

The guest OS (Windows 2012 R2) consumed some memory from node 1, SQL consumed all of its memory from node 0. For people intimate with SQL resource management this might be strange behavior and this is true. To display memory management at the VMkernel layer I had to restrict SQL to only run on a subset of CPUs. I’ve allowed SQL to run on the first 4 vCPUs. All these were mapped to CPUs located in NUMA node 0. The NUMA scheduler ensured these CPUs consumed local memory.

After powering down and enabling Hot-add the same test was run again. No NUMA architecture is exposed to the guest OS and therefore a single memory address space is used by Windows. The memory scheduler follows the rules of UMA and interleaves memory between the two physical nodes. And as the output shows, memory is consumed from both NUMA nodes in a very balanced manner. The problem is, the executing vCPUs are all located in NUMA node 0, therefore they have to fetch a lot of memory from remote, creating an inconsistent – less – performing application.

10-UMA

Conclusion
Hot-add great feature for when you stay within the confines of the CPU package but expect performance degradation, or at least inconsistent performance when going beyond the CPU core count.

This content will appear in the upcoming vSphere 6.5 Host Resources Deep Dive book I’m writing with Niels Hagoort (expected May time-frame). For updates about the book, please follow us on twitter @hostdeepdive or like our page on Facebook

Performance Study: DRS Cluster Management with Reservation and Shares

Last Friday a new performance study was published about DRS Cluster Management. This paper covers the behavior of reservation and shares within a DRS cluster in-depth.

It’s a great read! And being honest, it’s always awesome to see a reference to the vSphere Clustering Deep Dive in official documentation.

Download it here:

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/drs-cluster-mgmt-perf.pdf

vSphere 6.5 Host Deep Dive Update

Maybe you have noticed that no new content has appeared on the site for a while. And the upcoming book “vSphere 6.5 Host Resources Deep Dive” is to blame for this situation.

Last year, Niels Hagoort and I started working on the companion book of the highly successful vSphere Clustering Deep Dive book. We set out writing this book to refocus on the fundament component of the virtual data center, the ESXi host. Today’s focal point is on upper levels/overlay’s (SDDC stack, NSX, Cloud). These topics are exciting and take IT services to the next level, but we also understand that proper host design and management fabricates the foundation for success.

As a result, this book explores the host resources, CPU, memory, storage, and network in depth. Our goal is to provide you with an in-depth view of the four major host resources. Instead of showing you where to click to achieve a certain configuration, we explain the inner-workings of these components and how various physical and virtual constructs interact with each other.

HostDeepDiveCover

We believe that this method provides a basis – a foundation on its own – that helps you to design and build the best possible architecture that aligns with the customer requirements each and every time.

As you can imagine, trying to write a fitting companion to the cluster deep dive is no small feat. Research, reverse engineering and reading through a lot of academic papers consume most of our time besides our day-time job, hence the progress is not as fast as we would like. Expect the book to be released between April and May this year.

Working on this book reminds me of the African Proverb “If you want to go quickly, go alone. If you want to go far, go together”. Although Niels and I generate the content, a lot of people are involved ensuring the quality is up to par. Both Niels and I would like to acknowledge the following persons:

Jane Rimmer (has the challenging task of restructure our content into proper English).
Chris Gianos (Lead Engineer of Intel Xeon microarchitecture),
Haoqiang Zheng (Principal Engineer CPU Scheduler VMkernel)
Valentin Bondzio (All-Star Badass GSS VMware)
Duncan Epping (Chief Technologist Storage BU VMware)
Marco van Baggum (Architect ITQ)
Myles Gray (Infrastructure Engineer Novosco)
Rutger Kosters (Solution Architect Rubrik)
Anthony Spiteri (Technical Evanglist Veeam Software)
Joop Carels (Sr. Solution Integrator Ericsson)

We expect to publish the book in print in the April/May timeframe. An ebook version will be scheduled to appear at the end of this year. Throughout the writing process, we update the books’ twitter account (@HostDeepDive) and Facebook page with sneak peeks and interesting reference material such as academic papers. Please subscribe to these channels to receive updates.

Decoupling of Cores per Socket from Virtual NUMA Topology in vSphere 6.5

Some changes are made in ESXi 6.5 with regards to sizing and configuration of the virtual NUMA topology of a VM. A big step forward in improving performance is the decoupling of Cores per Socket setting from the virtual NUMA topology sizing.

Understanding elemental behavior is crucial for building a stable, consistent and proper performing infrastructure. If you are using VMs with a non-default Cores per Socket setting and planning to upgrade to ESXi 6.5, please read this article, as you might want to set a Host advanced settings before migrating VMs between ESXi hosts. More details about this setting is located at the end of the article, but let’s start by understanding how the CPU setting Cores per Socket impacts the sizing of the virtual NUMA topology.

 

Cores per Socket

By default, a vCPU is a virtual CPU package that contains a single core and occupies a single socket. The setting Cores per Socket controls this behavior; by default, this setting is set to 1. Every time you add another vCPU to the VM another virtual CPU package is added, and as follows the socket count increases.

 

Virtual NUMA Toplogy

Since vSphere 5.0 the VMkernel exposes a virtual NUMA topology, improving performance by facilitating NUMA information to guest operating systems and applications. By default the virtual NUMA topology is exposed to the VM if two conditions are met:

  1. The VM contains 9 or more vCPUs
  2. The vCPU count exceeds the core count* of the physical NUMA node.

* When using the advanced setting “numa.vcpu.preferHT=TRUE”, SMT threads are counted instead of cores to determine scheduling options.

Using the following one-liner we can explore the NUMA configuration of active virtual machines on an ESXi host:


vmdumper -l | cut -d \/ -f 2-5 | while read path; do egrep -oi "DICT.*(displayname.*|numa.*|cores.*|vcpu.*|memsize.*|affinity.*)= .*|numa:.*|numaHost:.*" "/$path/vmware.log"; echo -e; done


When using this one-liner after powering-on a 10-vCPU virtual machine on the dual E5-2630 v4 ESXi 6.0 host the following NUMA configuration is shown:

01-vmdumper-10vcpu

The VM is configured with 10 vCPUs (numvcpus). The cupid.coresPerSocket = 1 indicates that it’s configured with one core per socket. The last entry summarizes the virtual NUMA topology of the virtual machine. The constructs virtual nodes and physical domains will be covered later in detail.

All 10 virtual sockets are grouped into a single physical domain, which means that the vCPUs will be scheduled in a single physical CPU package that typically is similar to a single NUMA node. To match the physical placement, a single virtual NUMA node is exposed to the virtual machine.

Microsoft Sysinternals tool CoreInfo exposes the CPU architecture in the virtual machine with great detail. (Linux machines contain the command numactl – – hardware and use lstopo -s to determine cache configuration). Each socket is listed and with each logical processor a cache map is displayed. Keep the cache map in mind when we start to consolidate multiple vCPUs into sockets.

02-coreinfo


Virtual NUMA topology

The vCPU count of the VM is increased to 16 vCPUs; as a consequence, this configuration exceeds the physical core count. MS Coreinfo provides the following insight:

03-vcpu-coreinfo-1cps

Coreinfo uses an asterisk to represent the mapping of the logical processor to socket map and NUMA Node Map. In this configuration the logical processors in Socket 0 to Socket 7 belong to NUMA Node 0, the CPUs in Socket 8 to Socket 15 belong to NUMA 1. Please note that this screenshot does not contain the entire logical processor cache map overview.

Running the vmdumper one-liner on the ESXi host, the following is displayed:

04-vmdumper-16vcpu

In the previous example in which the VM was configured with 10 vCPUs, the numa.autosize.vcpu.maxPerVirtualNode = “10”, in this scenario, the 16 vCPU VM, has a numa.autosize.vcpu.maxPerVirtualNode = “8”.

The VMkernel symmetrically distributes the 16 vCPUs across multiple virtual NUMA Nodes. It attempts to fit as much vCPUs into the minimum number of virtual NUMA nodes, hence the distribution of 8 vCPU per virtual node. It actually states this “Exposing multicore topology with cpuid.coresPerSocket = 8 is suggested for best performance”.


Virtual Proximity Domains and Physical Proximity Domains

A virtual NUMA topology consists of two elements, the Virtual Proximity Domains (VPD) and the Physical Proximity Domains (PPD). The VPD is the construct what is exposed to the virtual machine, the PPD is the construct used by NUMA for placement (Initial placement and load-balancing).

The PPD auto sizes to the optimal number of vCPUs per physical CPU Package based on the core count of the CPU package. Unless the setting Cores per Socket within a VM configuration is used. In ESXi 6.0 the configuration of Cores per Socket dictates the size of the PPD, up to the point where the vCPU count is equal to the number of cores in the physical CPU package. In other words, a PPD can never span multiple physical CPU packages.

05-physical-proximity-domain

The best way to perceive a proximity domain is to compare it to a VM to host affinity group, but in this context, it is there to group vCPU to CPU Package resources. The PPD acts like an affinity of a group of vCPUs to all the CPUs of a CPU package. A proximity group is not a construct that is scheduled by itself. It does not determine whether a vCPU gets scheduled on a physical resource. It just makes sure that this particular group of vCPUs consumes the available resources on that particular CPU package.

A VPD is the construct that exposes the virtual NUMA topology to the virtual machine. The number of VPDs depends on the number of vCPUs and the physical core count or the use of Cores per Socket setting. By default, the VPD aligns with the PPD. If a VM is created with 16 vCPUs on the test server two PPD’s are created. These PPD allow the VPDs and its vCPUs to map and consume physical 8 cores of the CPU package.

06-16-vcpu-vm-vpd-ppd

If the default vCPU settings are used, each vCPU is placed in its own CPU socket (Cores per Socket = 1). In the diagram, the dark blue boxes on top of the VPD represent the virtual sockets, while the light blue boxes represent vCPUs.

The VPD to PPD alignment can be overruled if a non-default Cores per Socket setting is used. A VPD spans multiple PPDs if the number of the vCPUs and the Cores per Socket configuration exceeds the physical core count of a CPU package.

For example, a virtual machine with 40 vCPUs and 20 Cores per Socket configuration on a host with four CPU packages containing each 10 cores, creates a topology of 2 VPD’s that each contains 20 vCPUs, but spans 4 PPDs.

07-40-vcpu-2vpds-4ppds-vm

The Cores per Socket configuration overwrites the default VPD configuration and this can lead to suboptimal configurations if the physical layout is not taken into account correctly.

Specifically, spanning VPDs across PPDs is something that should be avoided at all times. This configuration can render most CPU optimizations inside the guest OS and application completely useless. For example, OS and applications potentially encounter remote memory access latencies while expecting local memory latencies after optimizing thread placements.

It’s recommended to configure the VMs Cores per Socket to align with the physical boundaries of the CPU package.

 

ESXi 6.5 Cores per Socket behavior

In ESXi 6.5 the CPU scheduler has got some interesting changes. Expect an interesting blog post of Mark Achtemichuk (VMware performance team) soon. One of the changes in ESXi 6.5 is the decoupling of Cores per Socket configuration and VPD creation to further optimize virtual NUMA topology.

Up to ESXi 6.0, if a virtual machine is created with 16 CPUs and 2 Cores per Socket, 8 PPDs are created and 8 VPDs are exposed to the virtual machine.

08-13-16vcpu-2cps

The problem with this configuration is that it the virtual NUMA topology does not represent the physical NUMA topology correctly.

09-8-socket-16-vcpu-vm-virtual-numa-topology

The guest OS is presented with 16 CPUs distributed across 8 sockets. Each pair of CPUs has its own cache and its own local memory. The operating system considers the memory addresses from the other CPU pairs to be remote. The OS has to deal with 8 small chunks of memory spaces and optimize its cache management and memory placement based on the NUMA scheduling optimizations. Where in truth, the 16 vCPUs are distributed across 2 physical nodes, thus 8 vCPUs share the same L3 cache and have access to the physical memory pool. From a cache and memory-centric perspective it looks more like this:

10-8-socket-16-vcpu-virtual-and-physical-topology

CoreInfo output of the CPU configuration of the virtual machine:

11-16-esxi6-0-2-cps-coreinfo

To avoid “fragmentation” of local memory, the behavior of VPDs and it’s relation to the Cores per Socket setting has changed. In ESXi 6.5 the size of the VPD is dependent on the number of cores in the CPU package. This results in a virtual NUMA topology of VPDs and PPDs that attempt to resemble the physical NUMA topology as much as possible.

Using the same example of 16 vCPU, 2 Cores per Socket, on a dual Intel Xeon E5-2630 v4 (20 cores in total), the vmdumper one-liner shows the following output in ESXi 6.5:

12-17-esxi6-5-2-cps-coreinfo

As a result of having only two physical NUMA nodes, only two PPDs and VPDs are created. Please note that the Cores per Socket setting has not changed, thus multiple sockets are created in a single VPD.

13-8-socket-16-vcpu-vm-virtual-numa-topology-esxi-6-5

A new line appears in ESXi 6.5; “NUMA config: consolidation =1”, indicating that the vCPUs will be consolidated into the least amount of proximity domains as possible. In this example, the 16 vCPUs can be distributed across 2 NUMA nodes, thus 2 PPDs and VPDs are created. Each VPD exposes a single memory address space that correlates with the characteristics of the physical machine.

14-esxi-6-5-8-socket-16-vcpu-virtual-and-physical-topology

The Windows 2012 guest operating system running inside the virtual machine detects two NUMA nodes. The CPU view of the task managers shows the following configuration:

15-esxi6-5-2-cps-taskmanager

The NUMA node view is selected and at the bottom right of the screen, it shows that virtual machine contains 8 sockets and 16 virtual CPUs. CoreInfo provides the following information:

16-esxi6-5-2-cps-coreinfo

With this new optimization, the virtual NUMA topology corresponds more to the actual physical NUMA topology, allowing the operating system to correctly optimize its processes for correct local and remote memory access.

 

Guest OS NUMA optimization

Modern applications and operating systems manage memory access based on NUMA nodes (memory access latency) and cache structures (sharing of data).

Unfortunately most applications, even the ones that are highly optimized for SMP, do not balance the workload perfectly across NUMA nodes. Modern operating systems apply a first-touch-allocation policy, which means that when an application requests memory, the virtual address is not mapped to any physical memory. When the application accesses the memory, the OS typically attempts to allocate it on the local or specified NUMA if possible.

In an ideal world, the thread that accessed or created the memory first is the thread that processes it. Unfortunately, many applications use single threads to create something, but multiple threads distributed across multiple sockets access the data intensively in the future.

Please take this into account when configuring the virtual machine and especially when configuring Cores per Socket. The new optimization will help to overcome some of these inefficiencies created in the operating system.

However, sometimes it’s required to configure the virtual machine with a deviating Cores per Socket setting, due to licensing constraints for example.

If you are required to set Cores per Socket and you want to optimize guest operating system memory behavior any further, then configure the Cores per Socket to align with the physical characteristics of the CPU package.

As demonstrated the new virtual NUMA topology optimizes the memory address space, providing a bigger more uniform memory slice that aligns better with the physical characteristics of the system. One element has not been thoroughly addressed and that is cache address space created by a virtual socket. As presented by CoreInfo, each virtual socket advertises its own L3 cache.

In the scenario of the 16 vCPU VM on the test system, configuring it with 8 Cores per socket, this configuration resembles both the memory and the cache address space of the physical CPU package the most.

Coreinfo shows the 16 vCPUs distributed symmetrically across two NUMA nodes and two sockets. Each socket contains 8 CPUs that share L3 cache, similar to the physical world.

17-single-socket-cache-constructs


Word of caution!

Migrating VMs configured with Cores per Socket from older ESXi versions to ESXi 6.5 hosts can create PSODs and/or VM Panic

Unfortunately, there are some virtual NUMA topology configurations that cannot be consolidated properly by the GA release of ESXi 6.5 when vMotioning a VM from an older ESXi version.

If you have VMs configured with a non-default Cores per Socket setting or you have set the advanced parameter numa.autosize.once to False, enable the following advanced host configuration on the ESXi 6.5 host:

Numa.FollowCoresPerSocket = 1

A reboot of the host is not necessary! This setting makes ESXi 6.5 behave as ESXi 6.0 when creating the virtual NUMA topology. That means that the Cores per Socket setting determines the VPD sizing.

There have been some cases reported where the ESXi 6.5 crashes (PSOD). Test it in your lab and if your VM configuration triggers the error set the FollowCoresPerSocket setting as an advanced configuration.

Knowledge base article 2147958 has more information. I’ve been told that the CPU team is working on a permanent fix, I do not have insights when this fix will be released!

VMware Cloud on AWS at re:Invent

Thank you for all the great feedback since we announced our partnership with Amazon Web Services (AWS) on October 13! We have seen a lot of interest for VMware Cloud on AWS (VMC) from customers, partners, industry analysts, and social media. Following on from the announcement in San Francisco, we went on to Barcelona for VMworld Europe, and had multiple sold out sessions with our customers and partners in attendance. The #VMWonAWS hashtag on Twitter was pretty active as well, and we had our hands full answering all your questions!

Next stop is AWS re:Invent in Las Vegas, unfortunately I won’t be at re:Invent, but the core group of VMware on AWS cloud product team is. VMware is a Platinum Sponsor at re:Invent and the team to eager to talk about the service offering, use cases, architecture, Tech Preview demos and a lot more.

Here are the top three ways to get the most out of re:Invent:

ENT317 – VMware and AWS Together – VMware Cloud on AWS
Thursday, Dec 1, 2:00 PM – 3:00 PM
Location: Venetian Level 3, Murano 3205 (please check exact location on the portal)
Speakers: Matt Dreyer, VMware Product Management, Paul Bockelman – AWS Sr. Solutions Architect
Description: VMware CloudTM on AWS brings VMware’s enterprise class Software-Defined Data Center software to Amazon’s public cloud, delivered as an on-demand, elastically scalable, cloud-based VMware sold, operated and supported service for any application and optimized for next-generation, elastic, bare metal AWS infrastructure. This solution enables customers to use a common set of software and tools to manage both their AWS-based and on-premises vSphere resources consistently. Further virtual machines in this environment have seamless access to the broad range of AWS services as well. This session will introduce this exciting new service and examine some of the use cases and benefits of the service. The session will also include a VMware Tech Preview that demonstrates standing up a complete SDDC cluster on AWS and various operations using standard tools like vCenter.

PTS205 – VMware Cloud on AWS
Wednesday, Nov 30, 1:30 PM – 1:45 PM
Location: Partner Theater – Expo Hall
Speaker: Marc Umeno, VMware Product Management
Description: Learn about how VMware and AWS are joining hands to deliver a new vSphere-based service running on next-generation, elastic, bare-metal AWS infrastructure with seamless integration with AWS services.

VMware Booth 2525 at Sands Expo, Hall D. Full exhibitor list and map is here
Tue Nov 29th 5-7 pm ; Wed Nov 30th 10:30am-6pm ; Thu Dec 1st 10:30am-6pm
Description: We have three demo pods: a) VMware Cloud on AWS, b) Networking & Security, and c) Cloud Management

Beyond the sessions and booth, you can also engage with us using the following means:

Sign up for Beta, news updates, or both
Follow us on Twitter @vmwarecloud
Ask a question, share a use case, or just give us a shout out using #VMWonAWS hashtag
Pick a 30 minute slot to talk to a member of the VMware Cloud on AWS product team 1:1

Thank you and we hope to see you there!

« Older posts Newer posts »

© 2017 frankdenneman.nl

Theme by Anders NorenUp ↑