ESX 4.1 NUMA Scheduling

33 Replies to “ESX 4.1 NUMA Scheduling”

duncan says:

September 13, 2010 at 1:55 pm

nice article again frank!
Yellow-Bricks says:

September 13, 2010 at 1:56 pm

nice article again.
Gabrie van Zanten says:

September 13, 2010 at 2:16 pm

Hi Frank,
Very good post!
You say the number of cores and the number of NUMA clients is decided at startup of the VM. When a VM would VMotion from 4 core to 6 core CPU, would this not change the number of NUMA clients?
Gabrie
1. Frank Denneman says:
  
  September 13, 2010 at 9:47 pm
  
  Hi Gabrie,
  Thanks for the compliment.
  The NUMA scheduler is a part of the local CPU and Memory scheduler and not of the global (DRS) scheduler, so the local scheduler calculates the amount of NUMA clients and assigns the NUMA nodes to the virtual machine. As the virtual machine is being migrated to the new host, the local CPU scheduler must assign the amount of NUMA clients. The CPU scheduler tries to load as many vCPU in a NUMA client, so my guess is that CPU scheduler creates a NUMA client containing 6 vCPU’s and a 2 vCPU numa client.
Arnim van Lieshout says:

September 13, 2010 at 3:25 pm

Great article Frank.
Just the information I was looking for as I’m looking into deploying 8vcpu vms.
I want to raise one question:
As you said in your article the CPU scheduler determines at boot time the number of NUMA clients.
So let’s asume I have 2 sockets packed with 12 cores each. If I boot up VM1 which has 8vcpu’s it’ll fit inside 1 NUMA client. The same is true for VM2, which also has 8vcpu’s.
Now let’s asume that both vm’s are heavy workloads and are scheduled all the time.
I now boot up VM3 (and VM4, ect,etc) with 8vcpu’s. The CPU scheduler determines the vm will fit into one NUMA client again.
When the CPU scheduler now wants to schedule this VM3 it cannot be scheduled together with VM1 and VM2 since I only have 2x 4 cores unoccupied. Does VM3 just need to wait for VM1 or VM2 to finish before it can be scheduled or will the CPU scheduler split this VM3 into 2 NUMA client (2x 4vcpu) so it can be scheduled simultaneously?
When the CPU scheduler only determines the number of NUMA clients at startup, I’m concluding that I won’t benefit from a 12core system over an 8core system if I only want to run 8vcpu machines! Is this correct?
Thanks for sharing.
-Arnim
Matt Liebowitz says:

September 13, 2010 at 3:32 pm

Frank – great article! I had two questions for follow-up, one really more of a question/feature request.
1) Have you noticed that ESX seems to heavily favor NUMA nodes 0 and 1? I’ve seen in 4 socket environments (4 x 6 core, 128GB per host) that NUMA nodes 0 and 1 have almost no memory left in them while nodes 2 and 3 have nearly all of their memory available. I thought ESX performed migrations between NUMA nodes if that situation occurred but I’ve seen the behavior on at least 5 different ESX hosts.
2) Any idea when DRS will become NUMA aware and migrate VMs based on whether it could improve memory locality by moving VMs to another host? Seems like it would be easy enough to do. Then again I think we’ve all been waiting for a CPU Ready aware DRS so having one that is also NUMA aware is a pipe dream.
Keep up the great work…
NiTRo says:

September 13, 2010 at 11:51 pm

Really nice article indeed ! About vmotion, my guess is the NUMA clients “split” is decided at VMM start rather that at the VM start so for each vmotion, NUMA clients are recalculated again. Am I right Frank ?
Craig Risinger says:

September 17, 2010 at 8:25 pm

Just to be clear then: ESX 4.1 lessens but does not eliminate the performance penalty of having a VM wider than a single NUMA node.
If you have 4 NUMA nodes and your VM needs two of them, with ESX 4.1 the probability that any given CPU access to memory will be to a NUMA-local address is 1/2 instead of 1/4. (In practice there are probably factors that change those odds but that’s the first-order estimate.) For example, say the VM sits on nodes A and B but not C or D. When vCPU0 running on node A accesses memory, the memory could be on A or B (either equally likely) but–with 4.1–not on C or D.
Agreed?
Frank Denneman says:

September 17, 2010 at 9:56 pm

Craig,
Totally! The memory can be supplied from one of the nodes listed in the node mask of the virtual machine. In your example, the node mask contains CPU A and B. C and D are not listed in the node mask and therefore the Memory scheduler uses a “soft” affinity rule for memory allocation in node A and node B.
its a soft affinity rule, because in the case of memory “shortage” in both nodes, the memory scheduler still can allocate remote memory. Remote memory is slower than local memory but still “way” better than ballooning, compressing or swapping.
Pingback: 5 days to vote | frankdenneman.nl
Pingback: Sizing VMs and NUMA nodes | frankdenneman.nl
Craig Stewart says:

September 21, 2010 at 4:07 pm

Superb article Frank, clarifies a lot of my fears around a 3rd party vendor asking us to implement 8-vCPU SQL and application VMs on 2 x 6 core westmere. What’s worse is their application is heavily memory driven to reduce latency so I have highlighted this particular issue to them as it could be a major problem.
When I asked them why the 8-vCPU machine, their answer was “because you have Enterprise Plus Licensing and you can”. Needless to say I have since put them directly in touch with the VMware ISV certification team as I think they have a bit of learning to do about the downside of insisting on 8-vCPU machines.
Great work as always Frank
Cheers
Craig
Brandon says:

September 22, 2010 at 11:28 am

Great article, sorry for hijacking..
@Matt Liebowitz – On a 4 NUMA Node box (AMD) I have seen the issue of NUMA nodes 0 and 1 being heavily loaded (high cpu ready time) when nodes 2 and 3 are almost unused. I have logged NUMEROUS SR’s with VMware with no luck (they KNOW there is an issue). My best option was to actually reduce the amount of memory in my servers, thus making the NUMA nodes smaller, which then made the scheduler use nodes 2/3 more.
There is a command line, which changes the default behaviour of the scheduler, whereby it will round robin initially placement of VM’s, however I don’t find that this command line works that well to be honest.
Now that Intel boxes are NUMA, im sure they will get A LOT more complaints about how poor the NUMA scheduler is when compared to NON NUMA scheduling, eg you dont get HOT CPU’s on old school Intel boxes, all CPU’s run very evenly.
Pingback: Technology Short Take #4 - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers
Pingback: NUMA, Hyperthreading and NUMA.PreferHT | frankdenneman.nl
Stu says:

October 11, 2010 at 2:08 am

@Arnim – I was pondering this same question recently (albeit with reference to Westmere CPU’s (6 cores each) and 3 x 4 vCPU VM’s heavily loaded with a properly multithreaded application.
Prior to ESX 4.0, with the cell scheduler, if you leave the default cell size of 4 (on a 6-core CPU) then as detailed in http://kb.vmware.com/kb/1007361 you would get 2 VM’s running entirely within a node and 1 VM split across nodes.
But as we all know (thanks in no small part to posts like this from Frank), the cell scheduler is gone in 4.x. Since all the VM’s in this scenario would be identified as NUMA clients, I guess several questions are raised:
1) Is the decision to run a VM as a NUMA client based purely on number of vCPU’s of the VM, or is system load taken into account? I’m leaning towards purely vCPU, otherwise you would get different behaviour depending on whether all 3 were powered on simultaneously or if 2 were powered on and under load before the 3rd was powered on.
2) Is the NUMA scheduler so strict that it will only schedule a client within a node, even though it would result in an unbalanced system load? From everything I have read, this appears to be the case.
3) Can a running VM be migrated from one scheduler to another? This capability would provide a solution, however I have no idea if that’s what happens or even if it’s technically possible.
Depending on the answer, you may find you’d actually be better off creating VM’s with 3vCPUs when you have 6-core CPUs. Don’t be fooled into thinking that a 3vCPU machine is not SMP – the ‘S’ in SMP simply denotes that all CPU’s have access to the same resources, not that there needs to be an even number of CPUs :). An ASMP system is denoted by specific CPUs doing specific things.
Finding the answer to this should keep Frank busy at VMworld, although it wouldnt surprise me if he knew if off the top of his head 😉
Brandon says:

November 16, 2010 at 12:12 pm

Bump, wondering if anybody have any updates at all, have things improved in 4.1 in regards to NUMA boxes and poor VM distribution between NUMA nodes? Ive moved on to an Intel only site (old intel no NUMA), but am about to deploy some 48 core AMD boxes, so would be interested to know if the situation has changed
Pingback: Impact of oversized virtual machines part 1 | frankdenneman.nl
Pingback: ESX 4.1 NUMA Scheduling « vmhackers
Pingback: Impact of oversized virtual machines part 3 | frankdenneman.nl
Pingback: AMD Magny-Cours and ESX | frankdenneman.nl
Pingback: VMware vSphere: Manage for Performance Course Experience « TheSaffaGeek
Ed Grigson says:

March 22, 2011 at 11:15 am

@Brandon – we’re experiencing the same issue. HP BL460c blades (Nehalem) and one NUMA node is at 99% memory utilisation whereas the other is only 25%. The memory locality of one VM is at only 54% even though there’s enough free memory on the other node for it to be migrated but it never does. We’re only on v4.0u1 admittedly – upgrading and trying the round robin fix are my next steps.
Pingback: %RDY & NUMA : ESX 4.0 ou 4.1 ? - Hypervisor.fr
lutfi says:

June 21, 2011 at 2:59 pm

Hi Frank,
Really useful article indeed =)
I just want to know, can we know how many vCPU will we have in a host? Let’s say I have a Xeon X5675 host (has 6 core, 12 threads) and a X5570 (4 core, 8 threads). How many vCPU available that can be assigned to a single VM?
Many thanks in advance
Jeremy says:

July 14, 2011 at 7:57 pm

We have had lots of issues with poor NUMA balancing on Four CPU 12-core AMD hosts (48 cores). I have been up and down with our TAM and GSS and they have discovered why the nodes aren’t being rebalanced. During vMotion the world is not entirely initialized on the destination host, so the NUMA round-robin isn’t invoked and most/all VMs keep getting placed in NUMA node 0/1. They recognize the issue but getting answers on when its getting fixed has been difficult.
Pingback: Get NUMA information across hosts using PowerCLI « vPeeling
Invisible says:

November 18, 2011 at 9:10 pm

Frank,
First of all – great article. I’ve done quite a reading recently about this topic and by observing my production environment (4.1 U1, HP BL460 G6, 2 QC CPUs, 96GB of RAM – 48GB on each CPU) got questions I could not find answers yet:
For example – I have bunch of VMs running on that host. Number of VCPUs of any particular VM does not exceed number of cores: I have mostly 2 and some 4 VCPU VMs, RAM of VMs fit entirely into NUMA node – most of VMs have 4GB or 6GB of RAM, however there are couple of 16GB VMs.
Yet I see that some VMs display very poor memory locality (less than 30% – one memory intensive VM with 16GB of RAM has 13% of local memory access according to esxtop) but at the same time ESX server is not memory overcommited and there is enough memory available on each NUMA node.
Why does this happen and why CPU/NUMA schedule does not try to improve memory locality? Also, according to esxtop there was no migration of the VM on another NUMA node despite the fact that other node has enough free memory available.
vSphere 4.1 Resource Management Guide, page 20 says following:
—
In undercommitted systems, the ESX CPU scheduler spreads load across all sockets by default. This improves performance by maximizing the aggregate amount of cache available to the running virtual CPUs. As a result, the virtual CPUs of a single SMP virtual machine are spread across multiple sockets (unless each socket is also a NUMA node, in which case the NUMA scheduler restricts all the virtual CPUs of the virtual machine to reside on the same socket.)
In some cases, such as when an SMP virtual machine exhibits significant data sharing between its virtual CPUs, this default behavior might be sub-optimal. For such workloads, it can be beneficial to schedule all of the virtual CPUs on the same socket, with a shared last-level cache, even when the ESX/ESXi host is undercommitted. In such scenarios, you can override the default behavior of spreading virtual CPUs across packages by including the following configuration option in the virtual machine’s .vmx configuration file: sched.cpu.vsmpConsolidate=”TRUE”.
To find out if a change in this parameter helps with performance, please do proper load testing.
—–
Does sched.cpu.vsmpConsolidate=”TRUE” setting has any effect on NUMA systems and memory locality?
Further down in the document, page 86:
—
When memory is allocated to a virtual machine, the ESX/ESXi host preferentially allocates it from the home node.
The NUMA scheduler can dynamically change a virtual machine’s home node to respond to changes in system load. The scheduler might migrate a virtual machine to a new home node to reduce processor load imbalance. Because this might cause more of its memory to be remote, the scheduler MIGHT MIGRATE the virtual machine’s memory dynamically to its new home node to improve memory locality.
—
I highlighted the interesting point above in all caps. One more quote – page 87:
—
Unless a virtual machine’s home node changes, it uses only local memory, avoiding the performance penalties associated with remote memory accesses to other NUMA nodes.
Rebalancing is an effective solution to maintain fairness and ensure that all nodes are fully used. The rebalancer might need to move a virtual machine to a node on which it has allocated little or no memory. In this case, the virtual machine incurs a performance penalty associated with a large number of remote memory accesses.
ESX/ESXi can eliminate this penalty by transparently migrating memory from the virtual machine’s original node to its new home node:
1 The system selects a page (4KB of contiguous memory) on the original node and copies its data to a page in the destination node.
2 The system uses the virtual machine monitor layer and the processor’s memory management hardware to seamlessly remap the virtual machine’s view of memory, so that it uses the page on the destination node for all further references, eliminating the penalty of remote memory access.
—
So my questions are:
What are factors influencing the decision of scheduling of the particular VM on the particular NUMA node and migration of a VM between NUMA nodes?
Do other settings on ESX server affect memory locality? I disabled large pages on ESX server to improve TPS ratio (memory utilization on ESX server declined from 70% to 40%) could it play a role into poor locality?
VM’s home node has never changed (NMIG=0 for all VMs) yet some VMs are using remote memory and at the same time there is enough free memory available on the home NUMA node. Even when VM is using remote memory, the percentage of that remote memory usage has not declined after time – above mentioned 4KB page copying mechanism does not work. Why?
On page 114 there is mentioning of VM’s advanced setting:
numa.mem.interleave – Specifies whether the memory allocated to a virtual machine is statically interleaved across all the NUMA nodes on which its constituent NUMA clients are running. By default, the value is TRUE.
If I put FALSE on that setting will that mean that VM won’t use remote memory?
Here – http://download.virtuallyghetto.com/hidden_vmx_params.html I also found some other NUMA related undocumented parameters:
numa.mem.firstAccess
numa.mem.nextAccess
I would be interested if anyone can comment on these.
——————–
Anyway, why I wrote this essay – most of our apps are memory intensive and performance depends on it. I have not yet seen what performance penalties are associated with the remote memory access, but two academic research papers I found give quite a serious thought base:
Measuring NUMA effects with the STREAM benchmark
http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.3225v1.pdf
A Case for NUMA-aware Contention Management on Multicore Systems
http://www.sfu.ca/~sba70/files/atc11-blagodurov.pdf
Pingback: CPU Sockets VS. Cores/Socket Plus Numa Nodes Explained | Computer Maintenance, Repair and Optimization!
Pingback: So what is NUMA?! « allthingsvIT
Pingback: The Impact Of NUMA On Virtualizing Business Critical Applications « vArchitect Musings
Hamler says:

February 17, 2013 at 10:05 am

Hello, I’m new here, but I’m surprised about how good is your work Frank. So, let me ask you a question. If I have a server with 4 sockets and 6 cores per socket, and heavy loaded CPU Virtual Machines, what is the best? 4 VM with 6 vCPU or 3 with 8 vCPU. Please I need to know this. Thanks you very much.
Vishal says:

February 28, 2013 at 11:47 am

Hi Frank
can you pls share your thoughts on vNUMA with by defualt enabled on more than 8 cores VM.
Rgds
Vishal