Category: CPU (page 4 of 4)

NUMA, Hyperthreading and NUMA.PreferHT

I received a lot of questions about Hyperthreading and NUMA in ESX 4.1 after writing the ESX 4.1 NUMA scheduling article. A common misconception is that Hyperthreading is ignored and therefore not used on a NUMA system. This is not entirely true and due to the improved Hyperthreading code on Nehalems, the CPU scheduler is programmed to use the HT feature more aggressively than the previous releases of ESX. The main reason why I think this misconception exists is the way the NUMA load balancer handles vCPU placement of vSMP virtual machine. Before continuing, let’s get our CPU elements nomenclature aligned, I’ve created a diagram showing all the elements:

NUMA and CPU elemenents

The Nehalem Hyperthreading feature is officially called Symmetric MultiThreading (SMT), the term HT and SMT are interchangeable.

1. An Intel Nehalem processor often called a CPU or package.
2. An Intel Nehalem processor contains 4 cores in one package.
3. Each core contains 2 threads if Hyperthreading is enabled.
4. A SMT Thread equals a logical processor.
5. A logical processor is translated in esxtop as a PCPU.
6. A vCPU is scheduled on a PCPU.
7. NUMA= Non-uniform Memory Access (Each Processor has its own local memory assigned)
8. LLC= Last Level Cache: Shared by Cores is last on-die cache memory before turning to Local memory.

NUMA load balancer virtual machine placement
During placement of a vSMP virtual machine, the NUMA load balancer assigns a single vCPU per CPU core and “ignores” the availability of SMT threads. As a result a 4-way vSMP virtual machine will be placed on four cores. In ESX 4.1 this virtual machine can be placed on one processor or on two processors, depending on the amount of cores on the processor or if set the advanced option numa.vcpu.maxPerMachineNode.

When a virtual machine contains more vCPUs than the amount of cores the processor, this virtual machine will span across multiple processors (Wide-VM). The default policy is to span the virtual machine across as few processors (NUMA nodes) as possible, but this can be overridden by an advanced option called numa.vcpu.maxPerMachineNode, which defines the maximum amount of vCPUs of a virtual machine per NUMA client. But as always, only use advanced options if you know the full impact of this setting on your environment. But I digress; let’s go back to NUMA and Hyperthreading.

Now the key to understand is that only during placement the SMT threads are ignored by the NUMA load balancer. It is the up to the CPU scheduler to decide in which way it will schedule the vCPUs within the core. It can allow the vCPU to use the full core or schedule it on a SMT thread depending on the workload, resource entitlement, the amount of active vCPUs and available pCPUs in the system.
Because SMT threads share resources within a core will result into lesser performance than running a vCPU on a dedicated singe core. The ESX scheduler is designed in such a way that it will try to spread the load across all the cores in the NUMA node or in the server.
But basically, If the workload is low it will try to schedule the vCPU on a complete core, if that’s not possible, it will schedule the vCPU on a SMT thread.

As mentioned before, running a vCPU on a SMT thread will not offer the same progress than running on a complete core; therefore a different charging scheme is used for each scenario. This charging scheme is used to keep track of the delivered resources and to check if the VM gets it entitled resources, more on this topic can be found in the article “Reservations and CPU scheduling”.

NUMA.preferHT=One NUMA node to rule them all?
Although the CPU scheduler can decide how to schedule the vCPU within the core, it will only schedule one vCPU of a vSMP virtual machine onto one core. Scott Drummonds article about numa.preferHT might offer a solution. Setting the advanced parameter numa.preferHT=1 allows the NUMA load balancer to assign vCPU to SMT thread and if possible “contain” one vSMP VM into a single NUMA node. However the amount of vCPU must be less or equal than the amount of pCPUs within the NUMA node.

By placing all vCPUs within a processor a virtual machine with a “intensive-cache-footprint” workload can benefit from a “warmed-up” cache. The vCPUs can fetch the memory from Last Level Cache instead of turning to local memory resulting in less latency. And this is exactly why this setting might not be beneficial to most environments.

The numa.preferHT setting is a CPU scheduler wide setting, that means that the NUMA load balancer will place every vSMP virtual machine inside a processor i.e. both intensive cache workloads and low-cache footprint workloads. Currently the ESX 4.1 CPU scheduler does not detect different workloads so it cannot distinguish virtual machines from each other and select an appropriate placement method i.e. place the virtual machine within one processor and use SMT threads or use wide-VM numa placement and “isolate” a vCPU per core.

It is crucial to know that by placing all vCPU on one processor doesn’t guarantee it to have all its memory in local memory, the main goal is to use LLC as much as possible, but if there is a cache miss (memory not available in cache) it will fetch it from local memory. The VMkernel tries to keep memory as local as possible but if there is not enough room inside local memory, it will place the memory into remote memory. Storing memory in remote memory is still faster than swapping it out to disk but inter-socket communication is noticeable slower than intra-socket communications.

This brings me to migration of virtual machines between NUMA nodes, if a virtual machines home node is more heavily loaded than other NUMA nodes, it will be migrated to a less loaded NUMA node. During the migration phase, local memory turns into remote memory. This newly remote memory is moved gradually because moving memory has high overhead.

By using the numa.preferHT option forces you to scope the maximum amount of memory assigned to a virtual machine and the consolidation ratio. Having multiple virtual machine traverse the quick path interlink to fetch memory stored in remote memory defeats the purpose of containing the virtual machines inside a processor.

ESX 4.1 NUMA Scheduling

VMware has made some changes to the CPU scheduler in ESX 4.1; one of the changes is the support for Wide virtual machines. A wide virtual machine contains more vCPUs than the total amount of available cores in one NUMA node. Wide VM’s will be discussed after a quick rehash of NUMA.

NUMA stands for Non-Uniform Memory Access, which translates into a variance of memory access latencies. Both AMD Opteron and Intel Nehalem are NUMA architectures. A processor and memory form a NUMA node. Access to memory within the same NUMA node is considered local access, access to the memory belonging to the other NUMA node is considered remote access.

NUMA Local and Remote Memory

Remote memory access is slower, because the instructions has to traverse a interconnect link which introduces additional hops. Like many other techniques and protocols, more hops equals more latency, therefore keeping remote access to a minimum is key to good performance. (More info about NUMA scheduling in ESX can be found in my previous article “Sizing VM’s and NUMA nodes“.)

If ESX detects its running on a NUMA system, the NUMA load balancer assigns each virtual machine to a NUMA node (home node). Due to assigning soft affinity rules, the memory scheduler preferentially allocates memory for the virtual machine from its home node. In previous versions (ESX 3.5 and 4.0) the complete virtual machine is treated as one NUMA client. But the total amount of vCPUs of a NUMA client cannot exceed the number of CPU cores of a package (physical CPU installed in a socket) and all vCPUs must reside within the NUMA node.
NUMA Client

If the total amount of vCPUs of the virtual machine exceeds the number of cores in the NUMA node, then the virtual machine is not treated as a NUMA client and thus not managed by the NUMA load balancer.

Misallignment NUMA client on NUMA node

Because the VM is not a NUMA client of the NUMA load balancer, no NUMA optimization is being performed by the CPU scheduler. Meaning that the vCPUs can be placed on any CPU core and memory comes from either a single CPU or all CPUs in a round-robin manner. Wide virtual machines tend to be scheduled on all available CPUs.

Spanning VM as NON-NUMA Client

The ESX 4.1 CPU scheduler supports wide virtual machines. If the ESX4.1 CPU scheduler detects a virtual machine containing more vCPUs than available cores in one NUMA node, it will split the virtual machine into multiple NUMA clients. At the virtual machine’s power on, the CPU scheduler determines the number of NUMA clients that needs to be created so each client can reside within a NUMA node. Each NUMA client contains as many vCPUs possible that fit inside a NUMA node.
The CPU scheduler ignores Hyper-Threading, it only counts the available number of cores per NUMA node. An 8-way virtual machine running on a four CPU quad core Nehalem system is split into a two NUMA clients. Each NUMA client contains four vCPUs. Although the Nehalem CPU has 8 threads 4 cores plus 4 HT “threads”, the CPU scheduler still splits the virtual machine into multiple NUMA clients.

8 vCPU VM splitting into two NUMA Clients

The advantage of wide VM
The advantage of a wide VM is the improved memory locality, instead of allocating memory pages random from a CPU, memory is allocated from the NUMA nodes the virtual machine is running on.

While reading the excellent whitepaper: “VMware vSphere: The CPU Scheduler in VMware ESX 4.1 VMware vSphere 4.1 whitepaper” one sentence caught my eye:

However, the memory is interleaved across the home nodes of all NUMA clients of the VM.

This means that the NUMA scheduler uses an aggregated memory locality of the VM to the set of NUMA nodes. Call it memory vicinity. The memory scheduler receives a list (called a node mask) of the NUMA node the virtual machine is scheduled on.

NUMA Node Mask

The memory scheduler will preferentially allocate memory for the virtual machine from this set of NUMA nodes, but it can distributed pages across all the nodes within this set. This means that there is a possibility that the CPU from NUMA node 1 uses memory from NUMA node 2.

Initially this looks like no improvement compared to the old situation, but fortunately supporting Wide VM makes a big difference. Wide-VM’s stop large VM’s from scattering all over the CPU’s with having no memory locality at all. Instead of distributing the vCPU’s all over the system, using a node mask of NUMA nodes enables the VMkernel to make better memory allocations decisions for the virtual machines spanning the NUMA nodes.

Reservations and CPU scheduling

Most of my resource management articles focus more on the behavior of memory management than on CPU management. Mainly because the Memory scheduler within ESX is such an interesting complex system which comprises of memory allocation, swapping and reclamation with algorithms such as Idle Memory Tax and mechanisms like ballooning and swapping. But lately it seems that CPU scheduling seems to attract more and more my attention. The discussion Duncan and I had prior to posting his article about how CPU limits actually sparked the interest how CPU scheduling works when setting reservations, so additional to Duncan excellent article, I want to take a closer look how the ESX CPU scheduler handles CPU reservations and shares and show why CPU scheduling is more fair that memory management.

Similar to memory, the resource allocation settings, reservations, shares and limits can be set on CPU level. Limits and shares have similar behavior on CPU as well as Memory. Reservation act differently, let’s take a quick look at the resource allocation settings:

Shares:Shares indicate the proportional value of the entity on the same hierarchical level. If everything else is equal, reservations, limits and active utilization, the virtual machine that is allocated twice as many shares as another virtual machine is entitled to consume twice as many CPU cycles.

Limit: A limit is a mechanism to restrict physical resource usage of the virtual machine. A limit ensures that the VM will never receive more CPU cycles than specified, even if extra cycles are available on the host.

Reservation: A reservation is a guarantee of the specified amount of physical resources regardless of the total number of shares in his environment.

Now reservations act differently when setting it on a CPU than setting it on memory. When the virtual machine does not use its CPU cycles, these CPU cycles are redistributed to other active virtual machines, so unused reservations are not wasted. Contrary to memory management, when the memory will not be reclaimed by the scheduler once the virtual machine touched the pages.

By redistributing available CPU cycles and not letting the virtual machine hoard CPU resources, the VMkernel tries to properly divide the resources and achieve better fairness among virtual machines and improve utilization of the resources. To achieve both goals and divide the CPU resources among virtual machines the CPU scheduler calculates a MHzPerShare metric. This metric tries to identify which virtual machines are “ahead” of their entitlement and which virtual machines are “behind” and do not fully utilize their entitlement.

MHzPerShare = MHzUsed / Shares

MHzUsed is the current utilization of the virtual machine measured in Megahertz.
Shares is the current configured amount of shares of the virtual machine.

For example, the virtual machine is using 2500 MHZ and has 1000 shares, this means that the MHzPershare value is 2.5.The VMkernel will calculate the MHzPerShare number of each active virtual machine and the virtual machine with the lowest MHzPerShare value will have the highest priority of running on the CPU. If the virtual machine with the lowest MHzPerShare value decides not to use it right to allocate the cycles, the cycles can be used by the virtual machine with the next lower MHzPerShare value.

ESX CPU Scheduler MHzPerShare distribution

Although not shown, reservations play a important part in this calculation. As mentioned before, reservations overrule shares and guarantee the amount of physical resources regardless of the amount of shares. This means that the virtual machine always can use the CPU cycles specified in its reservation, even if the virtual machine has a greater MHzPerShare value. So how exactly do reservations and shares interact with each other when it comes to calculating the MHzPerShare value?

For example:

In a 6 GHz system, 1 virtual machine is running and 2 are powered on, VM1 is running a memory intensive app and doesn’t really care much about CPU cycles, the virtual machine is configured with 1000 CPU shares and no reservation. The 2 other virtual machines run CPU intensive apps and are currently competing for resources. VM2 has a reservation of 2250 MHz and has a default share setting of 1000 shares, the other CPU intensive virtual machine, VM3 is equipped with 2vcpu’s and therefore receives 2000 shares, but the administrator didn’t set any reservation.

Now VM1 is running at 500 MHz, with its 1000 shares, the MHzPerShare value equals 0.5. Because VM2 is in need of CPU cycles, it immediately utilizes its reservations and “occupies” all 2250 MHz, its
MHzPerShare value equals 2.25 (2250/1000).
ESX CPU scheduler free MHz distribution

Now because VM3 doesn’t have any reservation and is in need of CPU cycles, the VMkernel looks at its MHzPerShare value to decide how many CPU cycles it can use before distributing excess CPU cycles to other virtual machines. The kernel will distribute cycles to VM3 until it reaches the same MHzPerShare value of VM2, which is 2.25. In theory this means that the VMkernel will allocate 2000 x 2.25 = 4500 MHz before looking at another VM. Due to the fact that CPU scheduler already allocated 500 MHz to VM1 and 2250 MHz to VM2 of the available 6GHz, it can allocate VM3 3250 Mhz.
ESX CPU Scheduler MHzPerShare value

Because VM2 has a reservation it can allocate up to its reservation even when initially VM3 has a lower MHzPerShare value (0) and the CPU cycle requirements of VM1 are met at 500MHz. However due to the fairness principle VM2’s own MHzPerShare value influences the VMkernel’s decision how much cycles to allocate to VM3 before considering allocating additional cycles to vm2 again.

Now for some reason the application in VM3 is leveling out at 2000 MHz, VM1 is still using 500 MHz and VM2 is in desperate need of extra CPU cycles. No settings are changed so VM1 and VM2 has a 1000 shares each and VM2 has a reservation of 2250MHz, VM3 has 2000 shares and no reservation is set.

The VMkernel will satisfy the request of VM1, resulting in a MHzPerShare value of 0.5. VM2 claims its reservation and utilizes 2250 MHz resulting in a MHzPerShare value of 2.25, VM3 can allocate up to 4500 before reaching the MHzPerShare value of VM3, but stops consuming above 2000Mhz, ending up with a MHzPerShare value of 2000/2000 = 1, this means that inside the 6GHz host 1250 cycles are available.

The CPU scheduler will shop around with these available cycles and see which VM is interested. Now the VMkernel will offer the cycles to the virtual machines in the increasing order of MHzPerShare, so first it will ask VM1 (0.5), because its CPU request is satisfied, it will forfeit its claim, VM2 also forfeits this claim, so VM3 will happily accepts the remaining cycles and its resource usage will increase to 3500 MHz.

So here you have it, both shares and reservation interact or even battle with each other to allocate CPU cycles for the virtual machines. Shares are by many perceived as an inferior resource allocation setting, hopefully this demonstrates the power of shares, it can in combination with utilization become a very important factor in ESX resource management.

Sizing VMs and NUMA nodes

Note: This article describes NUMA scheduling on ESX 3.5 and ESX 4.0 platform, vSphere 4.1 introduced wide NUMA nodes, information about this can be found in my new article: ESX4.1 NUMA scheduling

With the introduction of vSphere, VM configurations with 8 CPUs and 255 GB of memory are possible. While I haven’t seen that much VM’s with more than 32GB, I receive a lot of questions about 8-way virtual machines. With today’s CPU architecture, VMs with more than 4 vCPUs can experience a decrease in memory performance when used on NUMA enabled systems. While the actually % of performance decrease depends on the workload, avoiding performance decrease must always be on the agenda of any administrator.

Does this mean that you stay clear of creating large VM’s? No need to if the VM needs that kind of computing power, but the reason why I’m writing this is that I see a lot of IT departments applying the same configuration policy used for physical machines. A virtual machine gets configured with multiple CPU or loads of memory because it might need it at some point during its lifecycle. While this method saves time, hassle and avoid office politics, this policy can create unnecessary latency for large VMs. Here’s why:

NUMA node
Most modern CPU’s, Intel new Nehalem’s and AMD’s veteran Opteron are NUMA architectures. NUMA stands for Non-Uniform Memory Access, but what exactly is NUMA? Each CPU get assigned its own “local” memory, CPU and memory together form a NUMA node. An OS will try to use its local memory as much as possible, but when necessary the OS will use remote memory (memory within another NUMA node). Memory access time can differ due to the memory location relative to a processor, because a CPU can access it own memory faster than remote memory.

Figure 1: Local and Remote memory access

Accessing remote memory will increase latency, the key is to avoid this as much as possible. How can you ensure memory locality as much as possible?

VM sizing pitfall #1, vCPU sizing and Initial placement.

ESX is NUMA aware and will use the NUMA CPU scheduler when detecting a NUMA system. On non-NUMA systems the ESX CPU scheduler spreads load across all sockets in a round robin manner. This approach improves performance by utilizing as much as cache as possible. When using a vSMP virtual machine in a non-NUMA system, each vCPU is scheduled on a separate socket.
On NUMA systems, the NUMA CPU scheduler kicks in and use the NUMA optimizations to assigns each VM to a NUMA node, the scheduler tries to keep the vCPU and memory located in the same node. When a VM has multiple CPUs, all the vCPUs will be assigned to the same node and will reside in the same socket, this is to support memory locality as much as possible.

Figure 2: NON-NUMA vCPU placement

Figure 3: NUMA vCPU placement

At this moment, AMD and Intel offer Quad Core CPU’s, but what if the customer decides to configure an 8-vCPU virtual machine? If a VM cannot fit inside one NUMA node, the vCPUs are scheduled in the traditional way again and are spread across the CPU’s in the system. The VM will not benefit from the local memory optimization and it’s possible that the memory will not reside locally, creating added latency by crossing the intersocket connection to access the memory.

VM sizing pitfall #2: VM configured memory sizing and node local memory size
NUMA will assign all vCPU’s to a NUMA node, but what if the configured memory of the VM is greater than the assigned local memory of the NUMA node? Not aligning the VM configured memory with the local memory size will stop the ESX kernel of using NUMA optimizations for this VM. You can end up with all the VM’s memory scattered all over the server.
So how do you know how much memory every NUMA node contains? Typically each socket will get assigned the same amount of memory; the physical memory (minus service console memory) is divided between the sockets. For example 16GB will be assigned to each NUMA node on a two socket server with 32GB total physical. A quick way to confirm the local memory configuration of the NUMA nodes is firing up esxtop. Esxtop will only display NUMA statistics if ESX is running on a NUMA server. The first number list the total amount of machine memory in the NUMA node that is managed by ESX, the statistic displayed within the round brackets is the amount of machine memory in the node that is currently free.

Figure 4: esxtop memory totals

Let’s explore NUMA statistics in esxtop a little bit more based on this example. This system is a HP BL 460c with two Nehalem quad cores with 64GB memory. As shown, each NUMA node is assigned roughly 32GB. The first node has 13GB free; the second node has 372 MB free. It looks it will run out of memory space soon, luckily the VMs on that node still can get access remote memory. When a VM has a certain amount of memory located remote, the ESX scheduler migrates the VM to another node to improve locality. It’s not documented what threshold must be exceeded to trigger the migration, but its considered poor memory locality when a VM has less than 80% mapped locally, so my “educated” guess is that it will be migrated when the VM hit a number below the 80%. Esxtop memory NUMA statistics show the memory location of each VM. Start esxtop, press m for memory view, press f for customizing esxtop and press f to select the NUMA Statistics.

Figure 5: Customizing esxtop

Figure 6 shows the NUMA statistics of the same ESX server with a fully loaded NUMA node, the N%L field shows the percentage of mapped local memory (memory locality) of the virtual machines.

Figure 6: esxtop NUMA statistics

It shows that a few VMs access remote memory. The man pages of esxtop explain all the statistics:

Metric Explanation
NHN Current Home Node for virtual machine
NMIG Number of NUMA migrations between two snapshots. It includes balance migration, inter-mode VM swaps performed for locality balancing and load balancing
NRMEM (MB) Current amount of remote memory being accessed by VM
NLMEM (MB) Current amount of local memory being accessed by VM
N%L Current percentage memory being accessed by VM that is local
GST_NDx (MB) The guest memory being allocated for VM on NUMA node x. “x” is the node number
OVD_NDx (MB) The VMM overhead memory being allocated for VM on NUMA node x

Transparent page sharing and memory locality.
So how about transparent page sharing (TPS), this can increase latency if the VM on node 0 will share its page with a VM on node 1. Luckily VMware thought of that and TPS across nodes is disabled by default to ensure memory locality. TPS still works, but will share identical pages only inside nodes. The performance hit of accessing remote memory does not outweigh the saving of shared pages system wide.

Figure 7: NUMA TPS boundaries

This behavior can be changed by altering the setting VMkernel.Boot.sharePerNode. As most default settings in ESX, only change this setting if you are sure that it will benefit your environment, 99.99% of all environments will benefit from the default setting.

Take away
With the introduction of vSphere ESX 4, the software layer surpasses some abilities current hardware techniques can offer. ESX is NUMA aware and tries to ensure memory locality, but when a VM is configured outside the NUMA node limits, ESX will not apply NUMA node optimizations. While a VM still run correctly without NUMA optimizations, it can experience slower memory access. While the actually % of performance decrease depends on the workload, avoiding performance decrease if possible must always be on the agenda of any administrator.

To quote the resource management guide:

The NUMA scheduling and memory placement policies in VMware ESX Server can manage all VM transparently, so that administrators do not need to address the complexity of balancing virtual machines between nodes explicitly.

While this is true, administrators must not treat the ESX server as a black box; with this knowledge administrators can make informed decisions about their resource policies. This information can help to adopt a scale-out policy (multiple smaller VMs) for some virtual machines instead of a scale up policy (creating large VMs) if possible.

Beside the preference for scale up or scale out policy, a virtual environment will profit when administrator choose to keep the VMs as agile as possible. My advice to each customer is to configure the VM reflecting its current and near future workload and actively monitor its habits. Creating the VM with a configuration which might be suitable for the workload somewhere in its lifetime can have a negative effect on performance.

Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman

Newer posts

© 2017

Theme by Anders NorenUp ↑