Memory Archives - Page 4 of 5

Memory reclamation, when and how?

June 11, 2010 by frankdenneman

After discussing with Duncan the performance problem presented by @heiner_hardt , we discussed the exact moment the VMkernel decides which reclamation technique it will use and specific behaviors of the reclamation techniques. This article supplements Duncan’s article on Yellow-bricks.com.
Now let’s begin with when the kernel decides to reclaim memory and see how the kernel reclaims memory. So host physical memory is reclaimed based on four “free memory states”, each with a corresponding threshold. Based on the Threshold, the VMkernel chooses which reclamation technique it will use to reclaim memory from virtual machines.

Free Memory state	Threshold	Reclamation technique
High	6%	None
Soft	4%	Ballooning
Hard	2%	Ballooning and Swapping
Low	1%	Swapping

The high memory state has a threshold hold of 6%, that means that 6% of the ESX host physical memory minus the service console memory must be free. When the virtual machines use less than 94% of the host physical memory, the VMkernel will not reclaim memory because there is no need to, but when the memory usage starts to fall towards the free memory threshold the VMkernel will try to balloon memory. The VMkernel selects the virtual machines with the largest amounts of idle memory (detected by the idle memory tax process) and will ask the virtual machine to select it’s idle memory pages. Now to do this the guest os needs to swap those pages, so if the guest is not configured with sufficient swap space, ballooning can become problematic. Linux behaves pretty worse in this situation, invoking OOM (out-of memory) killer when its swap space is full and starts to randomly kill processes.
Back to the VMkernel, in the High and Soft state, ballooning if favored over swapping. If it ESX server cannot reclaim memory by ballooning in time before it reaches the Hard state, the ESX turns to swapping. Swapping has proven to be a sure thing within a limited amount of time. Opposite of the balloon driver, which tries to understand the needs of the virtual machine let the guest decides whether and what to swap, the swap mechanism just brutally picks pages at random from the virtual machine, this impacts the performance of the virtual machine but will help the VMkernel to survive.
Now the fun thing is, before the VMkernel detects the free memory is reaching the soft threshold, it will start to request pages through the balloon driver (vmmemctl), this is because it takes time for the Guest OS to respond to the vmmemctl driver with suitable pages. By starting prematurely, the VMkernel tries to avoid the situation that it will reach the Soft state or worse. So you can see ballooning occurring sometimes before the Soft state is reached. (between 6 and 4% free memory)
One exception is the virtual machine memory limit, if a limit is set on the virtual machine, the VMkernel always tries to balloon or swap pages of the virtual machine after reaching its limit, even if the ESX host has enough free memory available.

Reservations and CPU scheduling

June 8, 2010 by frankdenneman

Most of my resource management articles focus more on the behavior of memory management than on CPU management. Mainly because the Memory scheduler within ESX is such an interesting complex system which comprises of memory allocation, swapping and reclamation with algorithms such as Idle Memory Tax and mechanisms like ballooning and swapping. But lately it seems that CPU scheduling seems to attract more and more my attention. The discussion Duncan and I had prior to posting his article about how CPU limits actually sparked the interest how CPU scheduling works when setting reservations, so additional to Duncan excellent article, I want to take a closer look how the ESX CPU scheduler handles CPU reservations and shares and show why CPU scheduling is more fair that memory management.
Similar to memory, the resource allocation settings, reservations, shares and limits can be set on CPU level. Limits and shares have similar behavior on CPU as well as Memory. Reservation act differently, let’s take a quick look at the resource allocation settings:

Shares:Shares indicate the proportional value of the entity on the same hierarchical level. If everything else is equal, reservations, limits and active utilization, the virtual machine that is allocated twice as many shares as another virtual machine is entitled to consume twice as many CPU cycles.

Limit: A limit is a mechanism to restrict physical resource usage of the virtual machine. A limit ensures that the VM will never receive more CPU cycles than specified, even if extra cycles are available on the host.

Reservation: A reservation is a guarantee of the specified amount of physical resources regardless of the total number of shares in his environment.

Now reservations act differently when setting it on a CPU than setting it on memory. When the virtual machine does not use its CPU cycles, these CPU cycles are redistributed to other active virtual machines, so unused reservations are not wasted. Contrary to memory management, when the memory will not be reclaimed by the scheduler once the virtual machine touched the pages.
By redistributing available CPU cycles and not letting the virtual machine hoard CPU resources, the VMkernel tries to properly divide the resources and achieve better fairness among virtual machines and improve utilization of the resources. To achieve both goals and divide the CPU resources among virtual machines the CPU scheduler calculates a MHzPerShare metric. This metric tries to identify which virtual machines are “ahead” of their entitlement and which virtual machines are “behind” and do not fully utilize their entitlement.

MHzPerShare = MHzUsed / Shares

MHzUsed is the current utilization of the virtual machine measured in Megahertz.
Shares is the current configured amount of shares of the virtual machine.
For example, the virtual machine is using 2500 MHZ and has 1000 shares, this means that the MHzPershare value is 2.5.The VMkernel will calculate the MHzPerShare number of each active virtual machine and the virtual machine with the lowest MHzPerShare value will have the highest priority of running on the CPU. If the virtual machine with the lowest MHzPerShare value decides not to use it right to allocate the cycles, the cycles can be used by the virtual machine with the next lower MHzPerShare value.

Although not shown, reservations play a important part in this calculation. As mentioned before, reservations overrule shares and guarantee the amount of physical resources regardless of the amount of shares. This means that the virtual machine always can use the CPU cycles specified in its reservation, even if the virtual machine has a greater MHzPerShare value. So how exactly do reservations and shares interact with each other when it comes to calculating the MHzPerShare value?
For example:
In a 6 GHz system, 1 virtual machine is running and 2 are powered on, VM1 is running a memory intensive app and doesn’t really care much about CPU cycles, the virtual machine is configured with 1000 CPU shares and no reservation. The 2 other virtual machines run CPU intensive apps and are currently competing for resources. VM2 has a reservation of 2250 MHz and has a default share setting of 1000 shares, the other CPU intensive virtual machine, VM3 is equipped with 2vcpu’s and therefore receives 2000 shares, but the administrator didn’t set any reservation.
Now VM1 is running at 500 MHz, with its 1000 shares, the MHzPerShare value equals 0.5. Because VM2 is in need of CPU cycles, it immediately utilizes its reservations and “occupies” all 2250 MHz, its
MHzPerShare value equals 2.25 (2250/1000).

Now because VM3 doesn’t have any reservation and is in need of CPU cycles, the VMkernel looks at its MHzPerShare value to decide how many CPU cycles it can use before distributing excess CPU cycles to other virtual machines. The kernel will distribute cycles to VM3 until it reaches the same MHzPerShare value of VM2, which is 2.25. In theory this means that the VMkernel will allocate 2000 x 2.25 = 4500 MHz before looking at another VM. Due to the fact that CPU scheduler already allocated 500 MHz to VM1 and 2250 MHz to VM2 of the available 6GHz, it can allocate VM3 3250 Mhz.

Because VM2 has a reservation it can allocate up to its reservation even when initially VM3 has a lower MHzPerShare value (0) and the CPU cycle requirements of VM1 are met at 500MHz. However due to the fairness principle VM2’s own MHzPerShare value influences the VMkernel’s decision how much cycles to allocate to VM3 before considering allocating additional cycles to vm2 again.
Now for some reason the application in VM3 is leveling out at 2000 MHz, VM1 is still using 500 MHz and VM2 is in desperate need of extra CPU cycles. No settings are changed so VM1 and VM2 has a 1000 shares each and VM2 has a reservation of 2250MHz, VM3 has 2000 shares and no reservation is set.
The VMkernel will satisfy the request of VM1, resulting in a MHzPerShare value of 0.5. VM2 claims its reservation and utilizes 2250 MHz resulting in a MHzPerShare value of 2.25, VM3 can allocate up to 4500 before reaching the MHzPerShare value of VM3, but stops consuming above 2000Mhz, ending up with a MHzPerShare value of 2000/2000 = 1, this means that inside the 6GHz host 1250 cycles are available.
The CPU scheduler will shop around with these available cycles and see which VM is interested. Now the VMkernel will offer the cycles to the virtual machines in the increasing order of MHzPerShare, so first it will ask VM1 (0.5), because its CPU request is satisfied, it will forfeit its claim, VM2 also forfeits this claim, so VM3 will happily accepts the remaining cycles and its resource usage will increase to 3500 MHz.
So here you have it, both shares and reservation interact or even battle with each other to allocate CPU cycles for the virtual machines. Shares are by many perceived as an inferior resource allocation setting, hopefully this demonstrates the power of shares, it can in combination with utilization become a very important factor in ESX resource management.

Virtual Machine memory overhead

May 31, 2010 by frankdenneman

Every virtual machine running on an ESX host consumes some memory overhead additional to the current usage of its configured memory. This extra space is needed by ESX for the internal VMkernel datastructures like virtual machine frame buffer and mapping table for memory translation (mapping physical virtual machine memory to machine memory). Two kinds of virtual machine overhead exists:
Static overhead
Static overhead is the minimum overhead that is required for the virtual machine startup. DRS and the VMkernel uses this metric for admission control and VMotion calculations. The destination host must be able to back the virtual machine reservation and the static overhead otherwise the VMotion will fail.
Dynamic overhead
Once the virtual machine has started up, the virtual machine monitor (VMM) can request additional memory space. The VMM will request the space, but the VMkernel is not required to supply it. If the VMM does not obtain the extra memory space, the virtual machine will continue to function but this can lead to performance degradation. The VMkernel treats virtual machine overhead reservation the same as VM-level memory reservation and it will not reclaim this memory once it used.
Overhead memory used in admission control
As mentioned before, DRS and the VMkernel will not allow the virtual machine to be powered up if reservations cannot be guaranteed, this means that the effective memory reservation for a virtual machine is the user configured memory reservation (VM-level reservation) plus the overhead reservation.
Resource pool memory reservations
This means that during the design phase of a resource pool, the memory overhead of a virtual machine must be included in the calculation of the memory reservation specified on the resource pool. The behavior of dynamic overhead must also be taken into account.
Table 3.2 of the vSphere resource management guide list the overhead memory on virtual machines. VMware vSphere Online Library – Table 3.2 overhead memory
Please be aware of the fact that memory overheads are growing with each new release of ESX, so keep this in mind when upgrading to a new version. Verify the documentation of the virtual machine memory overhead and check the specified memory reservation on the resource pool.

Re: Swapping

May 26, 2010 by frankdenneman

Recently we had a discussion about swapping, as Duncan mentioned in his article “Swapping” Swapped memory might not have impact on the performance of the virtual machine.
There are scenarios when pages can be swapped out without experiencing performance problems. One common scenario is a bootstorm, i.e startup of many virtual machines at once. Bootstorms can happen when a host failure occurs and High Availability powers on the virtual machines on other host, but are also frequently encountered in windows shops after Patch Tuesday, when the operations team need to obey a limited maintenance window timeslot.
When a virtual machine guest OS starts, there will be a period of time before the VMware tools is loaded and the vmmemctl (balloon driver) is operational. During this timeslot the operating system can access a large portion of its configured memory. Windows systems are notorious for this as they tend to touch every page until it reaches the end of their configured memory. Unfortunately page sharing due to Transparent Page Sharing (TPS) is also at a minimum. Redundant memory pages are not collapsed immediately when a virtual machine is started. TPS is a VMkernel background process and uses a cycle of 60 minutes (Mem.ShareScanTime) to scan a virtual machine for page sharing opportunities.
During these bootstorms many virtual machines are powered on at the same time, all claiming lots of memory or even their maximum configured memory (windows). This behavior leads to a spike in memory usage and without the help of the balloon driver and TPS, the ESX host needs to resort to swapping out memory.
When referring to windows startup, windows will touch every page and this forces ESX to back all in machine memory (physical memory). These pages are filled with useless information and chances are that this might never be accessed by the virtual machine again. Now ESX will not proactively swap memory back in to physical memory when the memory pressure disappears. These pages will remain swapped until it is accessed by the virtual machine, at that point ESX will swap it into memory.
Swapping during the bootstorm will delay the boot process, but these swapped out pages will not cause any performance problems during normal operation.
As mentioned in Duncan’s “Swapping article“, there are a few metrics that indicates that a virtual machine is swapping or has swapped before. When encountering swapped memory, check the metrics SWCUR (Swap current) and SWTGT (Swap target). If a bootstorm occurred it is likely to have a higher value at SWCUR than at the SWTGT.
The SWTGT indicates the desired amount of memory to be swapped out, this is determined by ESX by the resource entitlement calculation of the virtual machine. If there is no memory pressure, the swaptarget will be equal to 0, but because pages remain in the swap file until accessed, the SWCUR will indicate the remaining swapped out pages.
If memory contention does occur, ESX will attempt to make the SWCUR equal to the SWTGT (swap target).

Resource pools memory reservations

May 18, 2010 by frankdenneman

After publishing the article “impact of memory reservations” I received a lot of questions about setting memory reservation at resource pool level. It seems there are several common facts about resource pools and memory reservations that are often misunderstood.
Because reservations are used by the VMkernel\DRS resource schedulers and (HA) admission control, the behavior of reservation can be very confusing. Before memory reservation on resource pool is addressed, let look at which mechanisms uses reservations and when reservations are used.
When are reservations actually used besides admission control?
If a cluster is under-committed the VM resource entitlement will be the same as its demand, in other words, the VM will be allocated whatever it wants to consume within its configured limit.
When a cluster is overcommitted, the cluster experiences more resource demand than its current capacity, at this point DRS and the VMkernel will allocate resources based on the resource entitlement of the virtual machine. Resource entitlement is covered later in the article.
Is there any difference between resource pool level and virtual machine level memory reservation?
To keep it short, VM level reservation can be rather evil, it will hoard memory if it has been used by the virtual machine once. Even if the virtual machine becomes idle, the VMkernel will not reclaim this memory and return it to the free memory set. This means that ESX can start swapping and ballooning if no free memory is available for other virtual machines while the owning VM’s aren’t using their claimed reserved memory. It also has influence on the slot size of High availability, for more information about HA slot sizes, please visit the HA deep dive page at yellow-bricks.com. For more information about virtual machine level memory reservation, please read the article “impact of memory management“.
Behavior of resource pool memory reservation
Now setting a memory reservation on a resource pool level has its own weaknesses, but it is much fairer and more along the whole idea of consolidation and sharing than virtual machine memory reservations. RP level reservations are immediately active, but are not claimed. This means it will only subtract the specified amount of memory from the unreserved capacity of the cluster.
RP reservations are used when children of the resource pool uses memory and the system is under contention. Reservations are not wasted and the resources can be used by other virtual machines. Be aware, using and reserving are two distinct concepts! Virtual machines can use the resource, but they cannot reserve this as well if it is already reserved by another item.
It appears that resource pool memory reservations work almost similar to CPU reservations, they won’t let any resource go to waste. And to top it off, resource pool reservations don’t flow to virtual machines, they will not influence HA slot sizes. Which unfortunately can lead to (temporary) performance loss if a host failover occurs. When a virtual machine is restarted by HA they are not restarted in the correct resource pool but in the root resource pool, which can lead to starvation. Until DRS is invoked, the virtual machine need to do it without any memory reservations.
How to use resource pool memory reservation?
Ok so two popular strategies exist when it comes to setting memory reservation on resource pool levels:
1. CPU and Memory reservations within the resource pool is never overcommitted i.e configured memory all VM’s (40Gb) equals reservation (40GB)
2. Percentage of Cluster resources reserved i.e. memory reservation resource pool (20GB) less than configured memory virtual machines inside RP (40GB)
The process of divvying is rather straightforward if the memory reservation equals the configured memory of the virtual machines inside the resource pool. All pages by the virtual machines are backed by machine pages, the resource entitlement is at least as large as its memory reservation.
What I find more interesting is what happens if the resource pool is configured with a memory reservation that is less than all virtual machine configured memory? DRS will divvy memory reservations based on the virtual machine resource entitlement.
So how is resource entitlement calculated?
A virtual machines resource entitlement is based on various statistics and some estimation techniques. DRS computes a resource entitlement for each virtual machine, based on virtual machine and resource pool configured shares, reservations, and limits settings, as well as the current demands of the virtual machines and resource pools, the memory size, its working set and the degree of current resource contention.
Now by setting a reservation on the resource pool level, the virtual machines who are actively using memory profits the most of this mechanism. Basically if no reservation is set on the VM level, the “RP” reservation is granted to all virtual machines inside the resource pool who are actively using memory.
DRS and the VMkernel calculates the resource pool and the virtual machine share levels. Please read the article “the resource pool priority-pie paradox” to get more information about share levels. and use this to specify the virtual machines priority.
Besides the share level the active utilization (working set) and the configured memory size are both accounted when calculating the resource entitlement. Virtual machines who are idling aren’t competing for resources, so they won’t get any new resources. If the memory is also idle the allocation get adjusted by the idle memory tax. Idle memory tax uses a progressive tax rate, the more idle memory a VM has, the more tax it will generate, this is why the configured memory size is also taken into account. (nice ammo if your customer wants to configure the DHCP server with 64GB memory!)
When we create a “Diva” VM (coined by Craig Risinger), that is setting VM level reservations, this allocation setting is passed to the VMkernel. It will subtract the specified amount of the reservation pool of the RP and it will not share it with others, i.e. the Diva VM is a special creature.
As stated above, RP memory reservations flow more than VM-level reservation, it will not claim\hoard memory. So basically when setting a resource pool reservation, reservations are just a part of the computation of the virtual machines resource entitlement. When the host is overcommitted, the memory usage of the virtual machine is either above or below the resource entitlement. If the memory usage exceeds its resource entitlement, the memory is ballooned or swapped from the virtual machine until it is at or below its entitlement.
Disclosure
Now before you think I fabricated this article all by myself I am happily to admit that I’m in the lucky position to work for VMware and to call some of the world brightest minds my colleagues. Kit Colbert, Carl Waldspurger and Chirag Bhatt took the time and explain this theory very thoroughly to me. Luckily
my colleagues and good friends Duncan Epping and Craig Risinger helped me decipher some out-of-this-world emails from the crew above and participated in some excellent discussions.