• Skip to primary navigation
  • Skip to main content

frankdenneman.nl

  • AI/ML
  • NUMA
  • About Me
  • Privacy Policy

Impact of oversized virtual machines part 3

January 3, 2011 by frankdenneman

In part 1 of the series of post on the impact of oversized virtual machines NUMA architecture, memory overhead reservation and share levels are reviewed, part 2 zooms in on the impact of memory overhead reservation and share levels on HA and DRS. This part looks at CPU scheduling, memory management and what impact oversized virtual machines have on the environment when a bootstorm occurs.
Multiprocessor virtual machine
In most cases, adding more CPUs to a virtual machine does not automatically guarantee increase throughput of the application, because some workloads cannot always take advantage of all the available CPUs. Sharing resources and scheduling these processes will introduce additional overhead.
For example, a four-way virtual machine is not four times as productive as a single-CPU system. If the application is unable to scale than the application will not benefit from these additional available resource.

Progress
Although relaxed co-scheduling reduces the requirement of the VMkernel to simultaneous schedule all vCPUs of the virtual machine, periodically scheduling the unused or idle vCPUs is still necessary to keep the progress of each vCPU in the virtual machine acceptably synchronized.

Esxtop also gives scheduling stats for SMP virtual machines;

%CRUN: All VCPUs want to run at once. CRUN is the amount of time between when a PCPU is told to run a certain VCPU on an SMP VM and when it is actually able to run that VM. This should be almost 0.

%CSTOP: If a VCPU gets ahead of another VCPU of the same SMP VM, then we ask the faster VCPU to stop until the other one can catch up. The time spent in this stopped state is CSTOP.

Single thread application
Only applications with multiple threads and allow them to be scheduled in parallel can benefit from multiprocessor systems. A single-threaded application can only be scheduled on one CPU at the time and will not benefit from the multiple CPUs available. The Guest OS is able to migrate the thread between the available CPUs, introducing unnecessary overhead such as interrupts or context switches and cache misses.

Timer interrupts
In older guest operating systems, the unused virtual CPUs still take timer interrupts, which consumes a small amount of additional CPU. Please refer to KB articles “High CPU Utilization of Inactive Virtual Machines – KB1077”

Configured memory
Oversizing the memory configuration of a virtual machine can impact the performance of the virtual machine itself or even worse, impact the other active virtual machines on the host and in the cluster. Using memory reservations on oversized virtual machines will make it go from bad to worse.

Application memory management
Excess memory is a problem when the application uses this memory opportunistically, in other words the application is hoarding memory. Java, SAP and often Oracle workloads assume it can use all the memory it detects. Because ESX cannot determine which memory is important to the virtual machine, it always backs memory pages of the virtual machine with physical pages. Besides creating a large memory footprint on the physical level, these kinds of applications add a third level of memory management as well.

Due to this additional management level, the Guest OS does not understand which pages are important and which are not. And because the Guest OS isn’t aware, it can not return inactive pages to the balloon driver when requested, therefor impacting the performance of the application during contention even more.

Setting memory reservation at virtual machine level will guarantee the availability of physical memory and will secure a certain level of application performance (if memory bound). However setting memory reservations at virtual machine level will impact the virtual infrastructure and the larger the memory reservation, the larger the impact. Visit “Impact of memory reservation” for more info.

To avoid these effects, it is recommended to monitor the behavior of the application over time and tune the configuration of the virtual machine and its reservation to get proper performance and limit the impact of its configured memory and the memory reservation.

NUMA node
If the virtual machines mentioned in the previous paragraph are configured with more memory than available in their home NUMA node, the system needs to fetch the memory from remote NUMA nodes. Accessing memory from remote nodes introduces latencies and generally reduced throughput of the vCPU. ESX does not communicate any NUMA information to the Guest OS and therefore both the Guest OS as well as the application are unaware of the non-uniform latency characteristics of the underlying platform. The Guest OS and application are therefor unable to prioritize which memory it will use.

If the virtual machine uses all the available memory of a NUMA node, it will lead to a higher degree of remote memory of all the other active virtual machines using the pCPU, leading to higher memory latencies and less throughput of the other virtual machines and eventually an intra-node migration. For more information about NUMA nodes, please read the articles: Sizing VMs and NUMA nodes and ESX 4.1 NUMA Scheduling.

Attempt to configure virtual machine with less memory than available in a NUMA node.

Swap file
During boot a swap file is created that equals the virtual machines configured memory minus the configured memory reservation. If no memory reservation is set, the virtual machine swap file (.vswap) equals the configured memory. Large virtual machines will generate an additional requirement for storing these large swap files reducing the consolidation ratio of virtual machines per VMFS datastore.

Bootstorms

A bootstorm is the occurrence of powering on a multitude of virtual machines simultaneously.

Virtual infrastructures running versions prior to ESX 4.1 can encounter memory contention when a bootstorm occurs of virtual machines running windows. Windows checks how much memory is available to the OS by zeroing out pages it detects. Transparent page sharing will collapse these pages but this will not occur immediately. Transparent Page Sharing is a cycle-driven process that tries to make a pass over the virtual machine memory with a timeframe of 3600 seconds. The level of contention will impact the speed of the TPS process. During a bootstorm, this zero-out behavior and delayed TPS process can introduce contention. Usually this contention is short-lived. Unfortunately during the startup phase of the guest OS the balloon driver will not be loaded and this situation can lead to compressing (10% of configured memory) and swapping useless data straight to disk.
ESXTOP will display swapped out memory but due to the nature of the data will show little to none swap-in.
ESX 4.1 uses a new technique called zero-page sharing. An in-depth post about this cool new technique will follow shortly.
End-note
This post concludes the three-part series about the impact of oversized virtual machines. The reason I wrote these articles is that I know many organizations still size their virtual machines on assumed peak loads happing somewhere in the (late) future of that service or application. Many organizations are using the same policy or method used for physical machines. The beauty of using virtual machines is the flexibility an organization has when it comes to determining the size of a machine during its lifecycle. Leverage these mechanisms and incorporate this in your service catalog and daily operations. Size the virtual machine according to its current or near-future workload.

Filed Under: CPU, Memory Tagged With: Bootstorms, NUMA, VMware

Node Interleaving: Enable or Disable?

December 28, 2010 by frankdenneman

There seems to be a lot of confusion about this BIOS setting, I receive lots of questions on whether to enable or disable Node interleaving. I guess the term “enable” make people think it some sort of performance enhancement. Unfortunately the opposite is true and it is strongly recommended to keep the default setting and leave Node Interleaving disabled.

Node interleaving option only on NUMA architectures
The node interleaving option exists on servers with a non-uniform memory access (NUMA) system architecture. The Intel Nehalem and AMD Opteron are both NUMA architectures. In a NUMA architecture multiple nodes exists. Each node contains a CPU and memory and is connected via a NUMA interconnect. A pCPU will use its onboard memory controller to access its own “local” memory and connects to the remaining “remote” memory via an interconnect. As a result of the different locations memory can exists, this system experiences “non-uniform” memory access time.

Node interleaving disabled equals NUMA
By using the default setting of Node Interleaving (disabled), the system will build a System Resource Allocation Table (SRAT). ESX uses the SRAT to understand which memory bank is local to a pCPU and tries* to allocate local memory to each vCPU of the virtual machine. By using local memory, the CPU can use its own memory controller and does not have to compete for access to the shared interconnect (bandwidth) and reduce the amount of hops to access memory (latency)

* If the local memory is full, ESX will resort in storing memory on remote memory because this will always be faster than swapping it out to disk.

Node interleaving enabled equals UMA
If Node interleaving is enabled, no SRAT will be built by the system and ESX will be unaware of the underlying physical architecture.

ESX will treat the server as a uniform memory access (UMA) system and perceives the available memory as one contiguous area. Introducing the possibility of storing memory pages in remote memory, forcing the pCPU to transfer data over the NUMA interconnect each time the virtual machine wants to access memory.

By leaving the setting Node Interleaving to disabled, ESX can use System Resource Allocation Table to the select the most optimal placement of memory pages for the virtual machines. Therefore it’s recommended to leave this setting to disabled even when it does sound that you are preventing the system to run more optimally.

Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman

Filed Under: NUMA, VMware Tagged With: node interleaving, NUMA, VMware

Impact of oversized virtual machines part 2

December 17, 2010 by frankdenneman

In part 1 of the series of post on the impact of oversized virtual machines NUMA architecture, memory overhead reservation and share levels are reviewed, part 2 zooms in of the impact of memory overhead reservation and share levels on HA and DRS.
Impact of memory overhead reservation on HA Slot size
The VMware High Availability admission control policy “Host failures cluster tolerates” calculates a slot size to determine the maximum amount of virtual machines active in the cluster without violating failover capacity. This admission control policy determines the HA cluster slot size by calculating the largest CPU reservation, largest memory reservation plus it’s memory overhead reservation. If the virtual machine with the largest reservation (which could be an appropriate sized reservation) is oversized, its memory overhead reservation still can substantial impact the slot size.
The HA admission control policy “Percentage of Cluster Resources Reserved” calculate the memory component of its mechanism by summing the reservation plus the memory overhead of each virtual machine. Therefore allowing the memory overhead reservation to even have a bigger impact on admission control than the calculation done by the “Host Failures cluster tolerates” policy.
DRS initial placement
DRS will use a worst-case scenario during initial placement. Because DRS cannot determine resource demand of the virtual machine that is not running, DRS assumes that both the memory demand and CPU demand is equal to its configured size. By oversizing virtual machines it will decrease the options in finding a suitable host for the virtual machine. If DRS cannot guarantee the full 100% of the resources provisioned for this virtual machine can be used it will vMotion virtual machines away so that it can power on this single virtual machine. In case there are not enough resources available DRS will not allow the virtual machine to be powered on.
Shares and resource pools
When placing a virtual machine inside a resource pool, its shares will be relative to the other virtual machines (and resource pools) inside the pool. Shares are relative to all the other components sharing the same parent; easier way to put it is to call it sibling share level. Therefore the numeric share values are not directly comparable across pools because they are children of different parents.

By default a resource pool is configured with the same share amount equal to a 4 vCPU, 16GB virtual machine. As mentioned in part 1, shares are relative to the configured size of the virtual machine. Implicitly stating that size equals priority.
Now lets take a look again at the image above. The 3 virtual machines are reparented to the cluster root, next to resource pools 1 and 2. Suppose they are all 4 vCPU 16GB machines, their share values are interpreted in the context of the root pool and they will receive the same priority as resource pool 1 and resource pool2. This is not only wrong, but also dangerous in a denial-of-service sense — a virtual machine running on the same level as resource pools can suddenly find itself entitled to nearly all cluster resources.
Because of default share distribution process we always recommend to avoid placing virtual machines on the same level of resource pools. Unfortunately it might happen that a virtual machine is reparented to cluster root level when manually migrating a virtual machine using the GUI. The current workflow defaults to cluster root level instead of using its current resource pool. Because of this it’s recommended to increase the number of shares of the resource pool to reflect its priority level. More info about shares on resource pools can be found in Duncan’s post on yellow-bricks.com.
Go to Part 3: Impact of oversized virtual machine.

Filed Under: DRS, VMware Tagged With: DRS, HA, memory overhead reservation, Shares, VMware

Enhanced vMotion Compatibility

December 14, 2010 by frankdenneman

Enhanced vMotion Compatibility (EVC) is available for a while now, but it seems to be slowly adopted. Recently VMguru.nl featured an article “Challenge: vCenter, EVC and dvSwitches” which illustrates another case where the customer did not enable EVC when creating the cluster. There seem to be a lot of misunderstanding about EVC and the impact it has on the cluster when enabled.
What is EVC?
VMware Enhanced VMotion Compatibility (EVC) facilitates VMotion between different CPU generations through use of Intel Flex Migration and AMD-V Extended Migration technologies. When enabled for a cluster, EVC ensures that all CPUs within the cluster are VMotion compatible.
What is the benefit of EVC?
Because EVC allows you to migrate virtual machines between different generations of CPUs, with EVC you can mix older and newer server generations in the same cluster and be able to migrate virtual machines with VMotion between these hosts. This makes adding new hardware into your existing infrastructure easier and helps extend the value of your existing hosts.

EVC forces newer processors to behave like old processors

Well, this is not entirely true; EVC creates a baseline that allows all the hosts in the cluster that advertises the same feature set. The EVC baseline does not disable the features, but indicates that a specific feature is not available to the virtual machine.
Now it is crucial to understand that EVC only focuses on CPU features, such as SSE or AMD-now instructions and not on CPU speed or cache levels. Hardware virtualization optimization features such as Intel VT-Flexmigration or AMD-V Extended Migration and Memory Management Unit virtualization such as Intel EPT or AMD RVI will still be available to the VMkernel even if EVC is enabled. As mentioned before EVC only focuses of the availability of features and instructions of the existing CPUs in the cluster. For example features like SIMD instructions such as the SSE instruction set.
Let’s take a closer look, when selecting an EVC baseline, it will apply a baseline feature set of the selected CPU generation and will expose specific features. If an ESX host joins the cluster, only those CPU instructions that are new and unique to that specific CPU generation are hidden from the virtual machines. For example; if the cluster is configured with an Intel Xeon Core i7 baseline, it will make the standard Intel Xeon Core 2 feature plus SSE4.1., SSE4.2, Popcount and RDTSCP features available to all the virtual machines, when an ESX host with a Westmere (32nm) CPU joins the cluster, the additional CPU instruction sets like AES/AESNI and PCLMULQDQ are suppressed.

As mentioned in the various VMware KB articles, it is possible, but unlikely, that an application running in a virtual machine would benefit from these features, and that the application performance would be lower as the result of using an EVC mode that does not include the features.
DRS-FT integration and building block approach
When EVC is enabled in vSphere 4.1, DRS is able to select an appropriate ESX host for placing FT-enabled virtual machines and is able to load-balance these virtual machines, resulting in a more load-balanced cluster which likely has positive effect on the performance of the virtual machines. More info can be found in the article “DRS-FT integration”.
Equally interesting is the building block approach, by enabling EVC, architects can use predefined set of hosts and resources and gradually expand the ESX clusters. Not every company buys computer power per truckload, by enabling EVC clusters can grow clusters by adding ESX host with new(er) processor versions.
One potential caveat is mixing hardware of different major generations in the same cluster, as Irfan Ahmad so eloquently put it “not all MHz are created equal”. Meaning that newer major generations offer better performance per CPU clock cycle, creating a situation where a virtual machine is getting 500 MHz on a ESX host and when migrated to another ESX host where that 500 MHz is equivalent to 300 MHz of the original machine in terms of application visible performance. This increases the complexity of troubleshooting performance problems.
Recommendations?
No performance loss will be likely when enabling EVC. By enabling EVC, DRS-FT integration will be supported and organizations will be more flexible with expanding clusters over longer periods of time, therefor recommending enabling EVC on clusters. But will it be a panacea to stream of new major CPU generation releases? Unfortunately not! A possibility is to treat the newest hardware (Major releases) as a higher service as the older hardware and because of this create new clusters

Filed Under: DRS Tagged With: EVC, VMware

HA and DRS Technical deepdive available

December 6, 2010 by frankdenneman

After spending almost a year on writing, drawing and editing, the moment Duncan and I waited for finally arrived… Our new book, the vSphere 4.1 HA and DRS technical deepdive is available on CreateSpace and Amazon.com.
Early this year Duncan approached me and asked me if I was interested in writing a book together on HA and DRS, without hesitation I accepted the honor. Before discussing the contents of the book I would like take the opportunity to thank our technical reviewers for their time, their wisdom and their input: Anne Holler (VMware DRS Engineering), Craig Risinger (VMware PSO), Marc Sevigny (VMware HA Engineering) and Bouke Groenescheij (Jume.nl). And a very special thanks to Scott Herold for writing the foreword!
But most of all I would like to thank Duncan for giving me this opportunity to work together with him on creating this book. The in-depth discussions we had are without a doubt the most difficult I have ever experienced and were very interesting, both most of all fun! Thanks!
Now let’s take a look at the book.Please note that we are still working on an electronic version of the book and we expect to finish this early 2011. This is the description of the book that is up on CreateSpace:
About the authors:
Duncan Epping (VCDX 007) is a Consulting Architect working for VMware as part of the Cloud Practice. Duncan works primarily with Service Providers and large Enterprise customers. He is focused on designing Public Cloud Infrastructures and specializes in bc-dr, vCloud Director and VMware HA. Duncan is the owner of Yellow-Bricks.com, the leading VMware blog.
Frank Denneman (VCDX 029) is a Consulting Architect working for VMware as part of the Professional Services Organization. Frank works primarily with large Enterprise customers and Service Providers. He specializes in Resource Management, DRS and storage. Frank is the owner of frankdenneman.nl which has recently been voted number 6 worldwide on vsphere-land.com
VMware vSphere 4.1 HA and DRS Technical Deepdive zooms in on two key components of every VMware based infrastructure and is by no means a “how to” guide. It covers the basic steps needed to create a VMware HA and DRS cluster, but even more important explains the concepts and mechanisms behind HA and DRS which will enable you to make well educated decisions. This book will take you in to the trenches of HA and DRS and will give you the tools to understand and implement e.g. HA admission control policies, DRS resource pools and host affinity rules. On top of that each section contains basic design principles that can be used for designing, implementing or improving VMware infrastructures.
Coverage includes:
• HA node types
• HA isolation detection and response
• HA admission control
• VM Monitoring
• HA and DRS integration
• DRS imbalance algorithm
• Resource Pools
• Impact of reservations and limits
• CPU Resource Scheduling
• Memory Scheduler
• DPM
vSphere 4.1 HA and DRS Technical deepdive cover
We hope you will enjoy reading it as much as we did writing it. Thanks,
Eric Sloof received a proof copy of the book and shot a video about it.

Filed Under: Miscellaneous Tagged With: VMware, vSphere 4.1 HA and DRS technical deepdive

  • « Go to Previous Page
  • Page 1
  • Page 2
  • Page 3
  • Page 4
  • Page 5
  • Interim pages omitted …
  • Page 8
  • Go to Next Page »

Copyright © 2025 · SquareOne Theme on Genesis Framework · WordPress · Log in