Category: CPU (page 1 of 5)

What if the VM Memory Config Exceeds the Memory Capacity of the Physical NUMA Node?

This week I had the pleasure to talk to a customer about NUMA use-cases and a very interesting config came up. They have a VM with a particular memory configuration that exceeds the ESXi host NUMA node memory configuration. This scenario is covered in the vSphere 6.5 Host Resources Deep Dive, excerpt below.

Memory Configuration
The scenario described happens in multi-socket systems that are used to host monster-VMs. Extreme memory footprint VMs are getting more common by the day. The system is equipped with two CPU packages. Each CPU package contains twelve cores. The system has a memory configuration of 128 GB in total. The NUMA nodes are symmetrically configured and contain 64 GB of memory each.

However, if the VM requires 96 GB of memory, a maximum of 64 GB can be obtained from a single NUMA node. This means that 32 GB of memory could become remote if the vCPUs of that VM can fit inside one NUMA node. In this case, the VM is configured with 8 vCPUs.

The VM fits from a vCPU perspective inside one NUMA node, and therefore the NUMA scheduler configures for this VM a single virtual proximity domain (VPD) and a single a load-balancing group which is internally referred to as a physical proximity domain (PPD).

Example Workload
Running a SQL DB on this machine resulted in the following local and remote memory consumption. The VM consumes nearly 64 GB on its local NUMA node (clientID shows the location of the vCPUs) while it consumes 31 GB of remote memory.

In this scenario, it could be beneficial to the performance of the VM to rely on the NUMA optimizations that exist in the guest OS and application. The VM advanced setting numa.consolidate = FALSE instructs the NUMA scheduler to distribute the VM configuration across as many NUMA nodes as possible.

In this scenario, the NUMA scheduler creates 2 load-balancing domains (PPDs) and allows for a more symmetrical configuration of 4 vCPUs per node.

Please note that a single VPD (VPD0) is created and as a result, the guest OS and the application only detect a single NUMA node. Local and remote memory optimizations are (only) applied by the NUMA scheduler in the hypervisor.

Whether or not the application can benefit from this configuration depends on its design. If it’s a multi-threaded application, the NUMA scheduler can allocate memory closes to the CPU operation. However, if the VM is running a single-threaded application, you still might end up with a lot of remote memory access, as the physical NUMA node hosting the vCPU is unable to provide the memory demand by itself.

Test the behavior of your application before making the change to create a baseline. As always, use advanced settings only if necessary!

A vSphere Focused Guide to the Intel Xeon Scalable Family – Memory Subsystem

The Intel Xeon Scalable Family introduces a new platform (Purley). The most prominent change regarding system design is the memory subsystem.

More Memory Bandwidth and Consistency in Speed
The new memory subsystem supports the same number of DIMMs per CPU as the previous models. However, it’s wider and less deep. What I mean by that is that the last platform (Grantley) supported up to three DIMMs per channel (DPC) and made use of four channels. In total, the Grantley platform supported up to twelve DIMMs per CPU. Purley increases the number of channels from four to six but reduces the numbers of supported DIMMs per channel from three to two. Although this sounds like a potato, potato; tomato, tomato discussion it provides a significant increase in bandwidth while ensuring consistency in speed during a scaling up exercise. Let’s take a closer look.

DIMMs per Memory Channel
Depending on the DIMM slot configuration of the server board, multiple DIMMs are supported per channel. The E5-2600 V-series supports up to 3 DIMMs per channel (3 DPC). Using more DIMMs per channel provides the largest capacity, but unfortunately, it impacts the operational speed of memory.

A DIMM groups memory chips into ranks. DIMMs come in three rank configurations; single-rank, dual-rank or quad-rank configuration, ranks are denoted as (xR). With the addition of each rank, the electrical load on the channel increases. And as more ranks are used in a memory channel, memory speed drops restricting the use of additional memory. Therefore in certain configurations, DIMMs will run slower than their listed maximum speeds. This reduction in speed occurs when 3 DIMMs per channel is used.

RDIMM 1 DPC 2 DPC 3 DPC LRDIMM 1 DPC 2 DPC 3 DPC Source
Cisco 2400 MHz 2400 MHz 1866 MHz 2400 MHz 2400 MHz 2133 MHz Cisco PDF
Dell 2400 MHz 2400 MHz 1866 MHz 2400 MHz 2400 MHz 2133 MHz Dell.com
Fujitsu 2400 MHz 2400 MHz 1866 MHz 2400 MHz 2400 MHz 1866 MHz Fujitsu PDF
HP 2400 MHz 2400 MHz 1866 MHz 2400 MHz 2400 MHz 2400 MHz* HP PDF
Performance Drop 0 0 28% 0 0 12%/28%

* HP claims no reduction of speed due to proprietary memory technology. I have not tested this.

Moving to 2 DIMMs per Channel Configuration
The Purley platform avoids this pitfall by reducing the supported number of DIMMs per channel. It supports up to 2 DIMMs per channel, maintaining the same performance regardless the number of DIMMs per channel. However, reducing the number of DIMMs supported per channel severely impacts the total supported memory capacity per CPU. Intel solves this by adding more channels to the memory controller. For some organizations, this change can result in dropping the requirement of obtaining LRDIMMs. Some organizations avoid the steep performance reduction by purchasing the (more) expensive LRDIMMs, 2 DPC configurations will not affect the performance characteristics of the memory modules.

Six Memory Channel Support
The Purley platform supports up to six channels per CPU. As a result, the bandwidth increases and the support for high capacity memory systems remains. By default, the new Xeon CPU supports up to 768 GB. Please review part 1 of this series in which covers the high memory capacity optimization option (M-suffix).

If all six channels are populated with DIMMs, the CPU interleaves memory access across the multiple memory channels. When creating a 1 DIMM per channel (1 DPC) configuration, the CPU forms one region (Region 0) and interleaves the memory access. Theoretically, this multiplies the data rate by exactly the number of channels present. A 2666 MT/s DIMM has a theoretical peak transfer rate of 21,300 MB/s. If populating all six DIMM slots, the memory controller accesses each module sequentially. Instead of writing all the data to a single DIMM, the data is written across the modules in one region in an alternating pattern, leveraging each channel bandwidth separately. That means that the memory controllers of a single Xeon CPU have access to a combined bandwidth of 127,800 MB/s. In theory, that means that a dual Xeon system has access to 256 GB per second (21,300 MB/s x 6 channels x 2 sockets). In theory!

This all depends on the type of workload and the compute power that drives the workload. The Xeon’s cores have direct access to the six channels in the CPU package. One thread can never obtain 256 GB due to the interconnect and the raw power it can produce to feed the channels. Anandtech has an excellent write-up about this behavior.

Memory Configuration
As a result of an increase of channels and the design consideration of populating every DIMM slot to create a 1 DPC or 2 DPC configuration, a new vSphere system will likely have a different memory capacity configuration than your previous systems (Inform your standards commission). Please note that the table lists the memory configuration of a single NUMA node.

6 x DIMM 16 GB 32 GB 64 GB 128 GB
1 DPC 96 GB 192 GB 384 GB 768 GB
2 DPC 192 GB 384 GB 768 GB 1536 GB *

* M-suffix Xeon CPU required

Please note that the table lists the memory configuration of a single NUMA node. Dual CPU systems is the most common configuration for vSphere servers. That means that you can expect the major system integrators such as DELL and HP to offer the following configurations:

Dual CPU Socket 16 GB 32 GB 64 GB 128 GB
1 DPC 192 GB 384 GB 768 GB 1536 GB
2 DPC 384 GB 768 GB 1536 GB 3072 GB *

For completeness sake, the next table shows the configuration of a v4 system with a maximum of 2-DPC. Very familiar configuration numbers, I guess we just need to get used to the new configuration standards such as 384, 768 and 1536 GB per system.

Dual CPU Socket v4 16 GB 32 GB 64 GB 128 GB
1 DPC 128 GB 256 GB 512 GB 1024 GB
2 DPC 256 GB 512 GB 1024 GB 2048 GB

* M-suffix Xeon CPU required.

vSphere 6.5 supports up to 12 TB per host. As a result, the entire range of Intel Xeon Scalable CPU with the extended memory feature is fully supported (8 CPUs x 1536 GB). Interesting data point, the vSphere 6.5 Configurations Maximum Guide started to list a maximum number of NUMA nodes per system. This limit is set to 16. The Intel Scalable Xeon supports sub-NUMA clustering (similar to Cluster-on-Die functionality), splitting up the CPU package into two NUMA nodes. As a result, vSphere 6.5 would support a system equipped with 8 Intel Xeon Platinum 8176M Processors each fully loaded with 1.5 TB of memory and configured with sub-NUMA clustering. This setup would create one system, offering 16 NUMA nodes each fitted with 768 GB of local memory.

Take Caution of 8 DIMM System Board Designs
The introduction of Purley forces system integrators to redesign the system boards to support the new functionality. To support the full possibilities of the memory subsystem, system boards should be equipped with either 6 or 12 DIMM sockets. Some entry-level systems are designed with 8 DIMM slots. The Intel Xeon is designed to use the six channels when creating a region, this results in an unbalanced region design of 6+2. Region 0 consists of 6 DIMM slots, offering a theoretical peak transfer rate of 127,800 MB/s (when using 2666 GT/s), while region 1 offers
42,600 MB/s. This will result in inconsistent performance, something to definitely to avoid. Thus it’s recommended to equip these systems with the six-channel configuration in mind, order these systems and only populate the first six DIMM slots per CPU.

Interconnect
The performance of a dual CPU system can be impacted by the interconnect between the CPU packages if you span VMs across the NUMA nodes (Wide-VMs). Purley introduces a new interconnect called the UltraPath Interconnect (UPI) and replaces the QuickPath Interconnect. The next article in this series provides an in-depth look at the UPI.

A vSphere Focused Guide to the Intel Xeon Scalable Family

Intel released the much-anticipated Skylake Server CPU this year. Moving away from the E5-2600-v moniker, Intel names the new iteration of its server CPU the Intel Xeon Scalable Family. On top of this it uses precious metal categories such as Platinum and Gold to identify different types and abilities.

Upholding the tradition, the new Xeon family contains more cores than the previous Xeon version. The new top-of-the-line CPU offers 28 cores on a single processor die, memory speeds are now supported up to 2666 MHz. However, the biggest appeal for vSphere datacenters is the new “Purley” platform and its focus on increasing bandwidth between possibly every component possible. In this series, we are going to look at the new Intel Xeon Scalable family microarchitecture and which functions help to advance vSphere datacenters.

NUMA and vNUMA Focus
Instead of solely listing the speeds and feeds of the new architecture, I will be reviewing the new functionality with considerations of today’s vSphere VM configuration landscape. In modern vSphere datacenters, small VMs and large VMs co-exists with a single server. Many VMs consume the interconnect between physical CPUs. Some VMs span multiple NUMA nodes (Wide-VMs) while others fit inside a single physical NUMA node. The VMkernel NUMA scheduler attempts to optimize local memory consumption as much as possible. Sometimes remote memory is unavoidable. Consolidation ratios increase each year, hence the focus on the interconnect. Yet, single-threaded applications are still prevalent in many DC’s. Therefore single-core improvements will not be ignored in this series. Designing a system that is bound to run a high consolidation with a mix of small and large VMs is not an easy task.

Rebranding
The Xeon Scalable family introduces a new naming scheme. Gone are the names such as the E5-2630 v4, E5-4660 v4 or E7-8894 v4. Now Bronze, Silver, Gold, and Platinum class indicate the range of overall performance, Bronze representing the entry-level class CPU comparable to the previous E3 series, while Platinum class CPUs provide you the highest levels of scalability and most cores possible. As of today, Intel offers 58 different CPU types within the Xeon Scalable family, i.e., 2 Bronze CPUs, 8 Silver, 6 Gold 51xx, 26 Gold 61xx and 16 Platinum CPUs.

Bronze Silver Gold 51xx Gold 61xx Platinum
Scalability 2 2 2 2-4 2-8
Max Cores 8 12 14 22 28
Max Base Frequency (GHz) 1.7 2.6 3.6 3.5 3.6
Max Memory Speed (MHz) 2133 2400 2400 2666 2666
UPI* Links 2 2 2 3 3
UPI Speed (GT/s) 9.6 9.6 10.4 10.4 10.4

* The Xeon Scalable family introduces a new processor interconnect called the UltraPath Interconnect (UPI) and replaces the QuickPath Interconnect. The next article in this series provides an in-depth look at the UPI.

Integrations and Optimizations
Intel uses suffixes to indicate particular integrations or optimizations.

Suffix Function Integration | Optimization Availability
F Fabric Integrated Intel® Omni-Path Architecture Gold 61xx, Platinum
M Memory Capacity 1.5 TB Support per Socket Gold 61xx, Platinum
T High Tcase Extended Reliability (10-Year Use) Silver,Gold, Platinum

Omni-Path Architecture
The new Xeon family offers on-die Omni-Path Architecture that allows for 100 Gbps connectivity. In-line with the industry effort to remove as much “moving parts or components as possible the new architecture the signal is not being routed through the socket and motherboard but provides a direct connection to the processor.

Image by ServeTheHome.com

The always excellent Serve The Home has published a nice article about the F-type Xeons. Unfortunately, the current vSphere version does not support the Omni-Path Architecture.

Total Addressable Memory
Intel hard-coded the addressable memory capacity on the CPU. As a result, non-M CPUs will not function if more than 768 GB of RAM is present in the DIMM sockets connected to its memory controllers. If you tend to scale-up your servers during their lifecycle, consider this limitation. If you are planning to run monster VMs that require more than 768 GB of RAM and want to avoid spanning it across NUMA nodes, consider obtaining “M” designation CPUs.

Why wouldn’t you just buy M designated CPUs in the first place, you might wonder? Well, the M badge comes with a near-3K USD price hike. Comparing, 6142 (16 cores at 2.6 GHz) and 6140 (18 cores at 2.3 GHz) the list price for the 6142 is $2946, while the 6142M is $5949. The similar price difference for the 6140, vanilla style 6140 costs $2445, M-badge $5448. But with current RAM prices, we are talking about a minimum of $100.000 price tag for 1.5 TB of memory PER socket!

Extended Reliability
For specific use-cases, Intel provides CPUs with an extended reliability of up to 10 years. As you can imagine, these CPUs do not operate at top speeds. The fastest T enabled CPU runs at 2.6 GHz base frequency.

Similar, but not identical CPU packaging
When reviewing the spectrum of available CPUs, one noticeable thing is the availability of identical core count CPUs across the precious metals. For example, the 12 core CPU package. It’s available in Silver, Gold 51xx, Gold 61xx and Platinum. It’s available with an extended Tcase (Intel Xeon Silver 4116), and the Intel Xeon Gold 6126 is also available as 6126T and 6126F. One has to dig a little bit further to determine the added benefits of selecting a Gold version over a Silver version.

Processor Silver 4116 Gold 5118 Gold 6126 Gold 6136 Gold 6146 Platinum 8158
Cores 12 12 12 12 12 12
Base Frequency (GHz) 2.10 2.30 2.60 3.00 3.20 3.00
Max Turbo Frequency (GHz) 3.00 3.20 3.70 3.70 4.20 3.70
TDP (W) 85 105 125 150 165 150
L3 Cache (MB) 16.5 16.5 19.25 24.75 24.75 24.75
# of UPI Links 2 2 3 3 3 3
Scalability 2S 4S 4S 4S 4S 8S
# of AVX-512 FMA Units 1 1 2 2 2 2
Max Memory Size (GB) 768 768 768 768 768 768
Max Memory Speed (MHz) 2400 2400 2666 2666 2666 2666
Memory Channels 6 6 6 6 6 6

Part 2: Memory Subsystem available

Why the Recent Reported Intel HT Bug is Not in Your Data Center

Yesterday I tweeted out the warning message about the HT bug of Skylake and Kaby Lake processors posted on debian.org.

https://lists.debian.org/debian-devel/2017/06/msg00308.html

My tweet got a LOT of retweets. A lot replied with concerns about their systems. I believe most Data Centers will not suffer from this bug as it is present on Skylake and Kaby Lake processors.

What is the Bug?
According to the warning: Unfixed Skylake and Kaby Lake processors could, in some
situations, dangerously misbehave when hyper-threading is enabled.
Disable hyper-threading immediately in BIOS/UEFI to work around the problem. Read this advisory for instructions about an Intel-provided fix.

https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v5-spec-update.html

Unlikely Present in Your Data Center
The reason why I believe most systems in data centers are not hit by this bug is that it solely applies to E3 Xeons from the Skylake microarchitecture. E3 CPUs are designed to operate in a single socket system, they have no QuickPath Interconnect. Therefore unable to create a symmetric multiprocessing system.

The current E5 (dual-socket) system is based on the Broadwell microarchitecture. The Skylake microarchitecture is expected to appear within the next couple of months. According to the report, they will have the fix included when the product launched. If you are running a NUC in your lab, you might want to check to see whether your system might hit that bug

http://ark.intel.com/products/codename/82879/Kaby-Lake
http://ark.intel.com/products/codename/37572/Skylake

The link will forward you to a perl script that can help detect if your
system is affected or not. Many thanks to Uwe Kleine-König for
suggesting, and writing this script.
https://lists.debian.org/debian-devel/2017/06/msg00309.html

Impact of CPU Hot Add on NUMA scheduling

On a regular basis, I receive the question if CPU Hot-add impacts CPU performance of the VM. It depends on the vCPU configuration of the VM. CPU Hot-Add is not compatible with vNUMA, if hot-add is enabled the virtual NUMA topology is not exposed to the guest OS and this may impact application performance.

Please note that vNUMA topology is only exposed when the vCPU count of the VM exceeds the core count, thus if the ESXi host contains two CPU packages with 10 cores, the vNUMA topology is presented to the VM if the vCPU count equals 11 or more.

vNUMA in a Nutshell
The benefit of a wide-VM is that the guest OS is informed about the physical grouping of the vCPUs. In the example of a 12 vCPU VM on a dual-10 core system, the NUMA scheduler creates 2 virtual proximity domains (VPD) better know as NUMA-clients and distributes the 12 vCPUs equally across them. As a result, a load-balancing group is created containing 6 vCPUs that are scheduled on a physical CPU package. A load-balancing group is internally referred to as a physical proximity domain (PPD). Please note that the PPD does not determine the scheduling of vCPU on a specific HT or full core, a PPD can be seen as a vCPU to CPU affinity group

From a memory perspective, the guest OS is presented with a vNUMA node sized, separated address space. These address spaces are local to the subset of the vCPUs. As a result, a 12 vCPU 32 GB VM gets to detect a system with two NUMA nodes. Each NUMA node contains 6 CPUs and has a local address space of 16 GB. Contrary to popular belief vNUMA does not expose the full CPU and memory architecture, a better way to describe it that vNUMA shows a tailor-made world to the VM.

vNUMA to Physical mapping-1

But what happens when the VM is configured with less vCPUs than the core count of the physical CPU package and CPU Hot-Add is enabled? Will there be performance impact? And the answer is no. The VPD configured for the VM fits inside a NUMA node, and thus the CPU scheduler and the NUMA scheduler optimizes memory operations. It’s all about memory locality. Let’s make use of some application workload test to determine the behavior of the VMkernel CPU scheduling.

For this test, I’ve installed DVD Store 3.0 and ran some test loads on the MS-SQL server. To determine the baseline, I’ve logged in the ESXi host via an SSH session and executed the command: sched-stats -t numa-pnode. This command shows the CPU and memory configuration of each NUMA node in the system. This screenshot shows that the system is only running the ESXi operating system. Hardly any memory is consumed. TotalMem indicates the total amount of physical memory in the NUMA node in kb. FreeMem indicates the amount of free physical memory in the NUMA node in kb.

01-Unload-ESXi-Host

An 8 vCPU 32 GB VM is created with CPU hot add disabled. NUMA scheduler has selected NUMA node 1 for initial placement and the system consumes ~13759 MB (67108864-53019184=14089680/1024).

02-8vCPU

The command memstats -r vm-stats -s name:memSize:allocTgt:mapped:consumed:touched -u mb allows us to verify the VM memory consumption of the VM.

03-VM memstats

The numbers are a close match, please note that VM-stats does not include overhead memory and that the VMkernel can consume some additional overhead in the same NUMA node for other processes.

When hot-add is enabled (power down VM is necessary to enable this feature), nothing really changes. The memory for this VM is still allocated from a single NUMA node.

04-8vCPU-hot-add

To get a better understanding of the CPU scheduling constructs at play here, the following command provides detailed insight of all the NUMA related settings of the VM. (Command courtesy of Valentin Bondzio)

vmdumper -l | cut -d \/ -f 2-5 | while read path; do egrep -oi "DICT.*(displayname.*|numa.*|cores.*|vcpu.*|memsize.*|affinity.*)= .*|numa:.*|numaHost:.*" "/$path/vmware.log"; echo -e; done

05-vmdumper

It shows hot-add is enabled and the VM is configured with a single VPD that is scheduled on a single PPD. In normal language, the vCPUs of the VM are contained with a single physical NUMA node. It’s the responsibility of the NUMA scheduler that physical local memory is consumed. To verify if the VM is consuming local memory, Esxtop can be used (memory, f, NUMA stats). However sched-stats -t numa-clients provides me also a lot of insight

06-8vCPU-hot-add-numa-client

As a result, you can conclude that enabling hot-add on a NUMA system does not lead to performance degradation as long as the vCPU count does not exceed the core count of the CPU package. That means that hot-add can be enabled on VMs, but the instruction must be clear that adding vCPUs can happen up and to the threshold of the physical core count. After that point, the VM becomes a wide-VM and vNUMA comes into play. And in the case of CPU hot-add, its sidelined.

What’s the impact of disregarding the physical NUMA topology? The key lies within the message that’s entered in the VMware.log of the VM after boot.

07-Forcing UMA

The VMkernel is forced into using UMA, Unified Memory Access on a NUMA architecture. As a result, memory is interleaved between the two physical NUMA nodes. In essence, it’s load-balancing memory across two nodes, while ignoring the vCPU location. Let’s explore this behavior a bit more.
Christmas is coming early for this VM and it gets another 4 vCPUs. Hot-add is disabled again and thus vNUMA is full in play. The Vmdumper command reveals the following:

08-12vCPU-vNUMA

The vCPUs are split up in two virtual nodes (VPD0 & VPD1), each containing 6 vCPUs. After running the DVD Store query the following memory allocation happened:

09-Non-Uniform Memory Allocation

The guest OS (Windows 2012 R2) consumed some memory from node 1, SQL consumed all of its memory from node 0. For people intimate with SQL resource management this might be strange behavior and this is true. To display memory management at the VMkernel layer I had to restrict SQL to only run on a subset of CPUs. I’ve allowed SQL to run on the first 4 vCPUs. All these were mapped to CPUs located in NUMA node 0. The NUMA scheduler ensured these CPUs consumed local memory.

After powering down and enabling Hot-add the same test was run again. No NUMA architecture is exposed to the guest OS and therefore a single memory address space is used by Windows. The memory scheduler follows the rules of UMA and interleaves memory between the two physical nodes. And as the output shows, memory is consumed from both NUMA nodes in a very balanced manner. The problem is, the executing vCPUs are all located in NUMA node 0, therefore they have to fetch a lot of memory from remote, creating an inconsistent – less – performing application.

10-UMA

Conclusion
Hot-add great feature for when you stay within the confines of the CPU package but expect performance degradation, or at least inconsistent performance when going beyond the CPU core count.

This content will appear in the upcoming vSphere 6.5 Host Resources Deep Dive book I’m writing with Niels Hagoort (expected May time-frame). For updates about the book, please follow us on twitter @hostdeepdive or like our page on Facebook

Older posts

© 2017 frankdenneman.nl

Theme by Anders NorenUp ↑