Node Interleaving: Enable or Disable?

December 28, 2010 by frankdenneman

There seems to be a lot of confusion about this BIOS setting, I receive lots of questions on whether to enable or disable Node interleaving. I guess the term “enable” make people think it some sort of performance enhancement. Unfortunately the opposite is true and it is strongly recommended to keep the default setting and leave Node Interleaving disabled.

Node interleaving option only on NUMA architectures
The node interleaving option exists on servers with a non-uniform memory access (NUMA) system architecture. The Intel Nehalem and AMD Opteron are both NUMA architectures. In a NUMA architecture multiple nodes exists. Each node contains a CPU and memory and is connected via a NUMA interconnect. A pCPU will use its onboard memory controller to access its own “local” memory and connects to the remaining “remote” memory via an interconnect. As a result of the different locations memory can exists, this system experiences “non-uniform” memory access time.

Node interleaving disabled equals NUMA
By using the default setting of Node Interleaving (disabled), the system will build a System Resource Allocation Table (SRAT). ESX uses the SRAT to understand which memory bank is local to a pCPU and tries* to allocate local memory to each vCPU of the virtual machine. By using local memory, the CPU can use its own memory controller and does not have to compete for access to the shared interconnect (bandwidth) and reduce the amount of hops to access memory (latency)

* If the local memory is full, ESX will resort in storing memory on remote memory because this will always be faster than swapping it out to disk.

Node interleaving enabled equals UMA
If Node interleaving is enabled, no SRAT will be built by the system and ESX will be unaware of the underlying physical architecture.

ESX will treat the server as a uniform memory access (UMA) system and perceives the available memory as one contiguous area. Introducing the possibility of storing memory pages in remote memory, forcing the pCPU to transfer data over the NUMA interconnect each time the virtual machine wants to access memory.

By leaving the setting Node Interleaving to disabled, ESX can use System Resource Allocation Table to the select the most optimal placement of memory pages for the virtual machines. Therefore it’s recommended to leave this setting to disabled even when it does sound that you are preventing the system to run more optimally.

Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman

Comments

Scott Lowe says

December 28, 2010 at 5:52 pm

Excellent explanation of how node interleaving relates to NUMA, the impact that it has on the hypervisor’s knowledge of the underlying hardware, and the (potential) effect on VM performance. You’ve covered NUMA fairly extensively so you’ve probably mentioned this elsewhere, but how noticeable is the performance impact when remote memory is involved? Or is the performance impact dependent on too many variables to be estimated? Just curious.
- Frank Denneman says
  
  December 28, 2010 at 9:27 pm
  
  Thanks Scott,
  The impact on performance is dependent of many factors like load on core, load per pcpu, distance of remote memory (think 4 to 8 pCPU systems), amount of remote memory, load on interconnect and bandwidth of interconnect. Not only can this impact the throughput and performance of the virtual machine, it can also impact the remaining virtual machines.
  Bruce Herndon posted a VMmark test on NUMA performance on VROOM back in 2007. While it might a bit dated it shows the impact of NUMA optimizations.
  The paper VMware vSphere: The CPU Scheduler in VMware ESX 4.1 provides the performance improvements of Wide VM’s. Performance improvements up to 17% are recorded. Which is quite remarkable as Wide VMs only ensure the minimum use of NUMA nodes. See http://frankdenneman.nl/2010/09/esx-4-1-numa-scheduling/
  for more info about wide-vm’s.
Gabrie van Zanten says

December 28, 2010 at 8:30 pm

Great post again !!!
Pure hypothetically: memory of CPU0 is full and remote memory is used from CPU1. Now a new VM is started on CPU1 and demands more memory. What will ESX do? Swap out to disk for the new VM or swap out the remote memory to disk?
Gabrie
- Frank Denneman says
  
  December 28, 2010 at 10:15 pm
  
  Thanks Gabrie
  Pure hypothetically: memory of CPU0 is full and remote memory is used from CPU1.
  Normally ESX will try to reschedule virtual machines to improve memory locality, this check is done every 2 seconds. Before you ask, I cannot disclose the exact threshold metric \ decision tree used by ESX. But lets continue with your question.
  Now a new VM is started on CPU1 and demands more memory. What will ESX do? Swap out to disk for the new VM or swap out the remote memory to disk?
  
  Because memory demand is higher than the amount of available memory, default contention rules will be enforced. The VMkernel will look at resource entitlement to decide how much memory is backed by physical resources and how many memory will be ballooned, compressed or swapped.
Scott Lowe says

December 29, 2010 at 4:09 pm

I probably read that VROOM post, but I’ll review it again. Thanks for the response, Frank!
Gabrie van Zanten says

December 30, 2010 at 12:22 am

Hi
Not sure if that is what I meant to ask. Say VM1 (6GB) uses 4GB memory on NUMA node 1 and because NUMA node 1 was full, also 2GB memory from NUMA node 2. Now VM2 is started and ESX has to decide where to get 2GB for VM2. Of course ballooning will kick in etc, but will it make a difference for ESX between the “remote” 2GB of VM1 and the possible use of the “local” 2GB for VM2?
Gabrie
Duncan says

December 30, 2010 at 10:08 am

@Gabe: Short answer is NO.
Long answer is: ESX uses “idle memory” to determine which VM to reclaim memory from. It will never use NUMA locality as that says absolutely nothing about how often (and if at all) the memory is currently accessed and as such could lead to severe performance issues. You are trying to make a relationship between components which isn’t there.
Abdullah says

January 3, 2011 at 10:50 am

Thanks a lot Frank
I have been reading all your posts and found them very very informative. Your articles have helped me a lot. Thanks a lot. I am from India.
Already ordered vSphere 4.1 HA & DRS Technical Deepdive from Amazon.
Satish Kumar says

January 18, 2012 at 1:32 pm

Thanks Frank for such a simplefied explaination and through understanding of VNUMA thingii..
Regards
Satish Kumar
Brian Smith says

May 7, 2012 at 8:36 pm

Thanks for clarifying this Frank, for some reason this topic feels like a double negative, and difficult to know what to choose.