DRS Archives - frankdenneman.nl

vSphere 7 DRS Scalable Shares Deep Dive

May 27, 2020 by frankdenneman

You are one tickbox away from completely overhauling the way you look at resource pools. Yes you can still use them as folders (sigh), but with the newly introduced Scalable Shares option in vSphere 7 you can turn resource pools into more or less Quality of Service classes. Sounds interesting right? Let’s first take a look at the traditional working of a resource pool, the challenges they introduced, and how this new delivery of resource distribution works. To understand that we have to take a look at the basics of how DRS distributes unreserved resources first.

Compute Resource Distribution

A cluster is the root of the resource pool. The cluster embodies the collection of the consumable resources of all the ESXi hosts in the cluster. Let’s use an example of a small cluster of two hosts. After overhead reduction, each host provides 50GHz and 50GB of memory. As a result, the cluster offers 100 GHz and 100 GB of memory for consumption.

A resource pool provides an additional level of abstraction, allowing the admin to manage pools of resources instead of micro-managing each VM or vSphere pod individually. A resource pool is a child object of a cluster. In this scenario, two resource pools exist; a resource pool with the HighShares, and a resource pool (RP) with the name NormalShares.

The HighShares RP is configured with a high CPU shares level and a high memory shares level, the NormalShares RP is configured with normal CPU shares level, and normal memory shares level. As a result, HighShares RP receives 8000 CPU shares and 327680 shares of memory, while the NormalShares RP receives 4000 CPU shares and 163840 shares of memory. A ratio is created between the two RPs of 2:1.

In this example, eight VM with each two vCPUs and 32 GBs are placed in the cluster. Six in the HighShares RP and two VMs in the NormalShares RP. If contention occurs, the cluster awards 2/3 of the cluster resources to HighShares RP and 1/3 of cluster resources to the NormalShares RP. The next step for the RP is to divide the awarded resources to its child-objects, those can be another level of resource pools or workload objects such as VMs and vSphere Pods. If all VMs are 100% active, HighShares RP is entitled to 66 GHz and 66 GBs of memory, the NormalShares RP gets 33 GHz and 33 GBs of memory.

And this is perfect because the distribution of resources follows the desired ratio “described” by the number of shares. However, it doesn’t capture the actual intent of the user. Many customers use resource pools to declare the relative priority of workload compared to the workload in the other RPs, which means that every VM in the resource pool HighShares is twice as important as the VMs in the NormalShare RP. The normal behavior does not work that way, as it just simply passes along the awarded resources.

In our example, each of the six VMs in HighShares RP gets 1/6 of 2/3s of the cluster resources. In other words, 16% of 66Ghz & 66GB = ~11 GHz & ~ 11 GBs, while the two VMs in the NormalShares RP get 1/2 of 1/3 of the cluster resources. 50% of 33 GHz & 33 GB = ~16 GHz and ~16 GBs. In essence, the lower priority group VMs can provide more resources per individual workload. This phenomenon is called the priority pie paradox.

Scalable Shares
To solve this problem and align resource pool sizing more with the intent of many of our customers, we need to create a new method. A technique that auto-scales the shares of RP to reflect the workloads deployed inside it. Nice for VMs, necessary for high-churn containerized workloads. (See vSphere Supervisor Namespace for more information about vSphere Pods and vSphere namespaces. And this new functionality is included in vSphere 7 and is called Scalable Shares. (Nice backstory, the initial idea was developed by Duncan Epping and me, not on the back of a napkin, but on some in-flight magazine found on the plane on our way to Palo Alto back in 2012. It felt like a tremendous honor to receive a patent award on it. It’s even more rewarding to see people rave about the new functionality).

Enable Scalable Shares
Scalable shares functionality can be enabled at the cluster level and the individual resource pool level.

It’s easier to enable it at the cluster level as each child-RP automatically inherits the scalable shares functionality. You can also leave it “unticked” at the cluster level, and enable the scalable shares on each individual resource pool. The share value of each RP in that specific resource pool is automatically adjusted. Setting it at this level is pretty much intended for service providers as they want to carve up the cluster at top-level and assign static portions to customers while providing a self-service IAAS layer beneath it.

When enabling shares at the cluster-level, nothing really visible happens. The UI shows that the functionality is enabled, but it does not automatically change the depicted share values. They are now turned into static values, depended on the share value setting (High/Normal/Low).

We have to trust the system to do its thing. And typically, that’s what you want anyway. We don’t expect you to keep on staring at dynamically changing share values. But to prove it works, it would be nice if we can see what happens under the cover. And you can, but of course, this is not something that we expect you to do during normal operations. To get the share values, you can use the vSphere Managed Object Browser. William (of course, who else) has written extensively about the MOB. Please remember that it’s disabled by default, so follow William’s guidance on how to enable it.

To make the scenario easy to follow, I grouped the VMs of each RP on a separate host. The six VMs deployed in the HighShares RP run on host ESXi01. The two VMs deployed in the NormalShares RP run on host ESXi02. I did this because when you create a resource pool tree on a cluster, the RP-tree is copied to the individual hosts inside the cluster. But only the RPs that are associated with the VMs that run on that particular host. Therefore when reviewing the resource pool tree on ESXi01, we will only see the HighShares RP, and when we look at the resource pool tree of ESXi02, it will only show the NormalShares RP. To view the RP tree of a host, open up a browser, ensure the MOB is enabled and go to

https://<ESXi-name-or-ipaddress>/mob/?moid=ha%2droot%2dpool&doPath=childConfiguration

Thanks to William for tracking this path for me. When reviewing ESXi01 before enabling scalable shares, we see the following:

ManagedObjectReference:ResourcePool: pool0 (HighShares)
CpuAllocation: Share value 8000
MemoryAllocation: Share value: 327680

I cropped the image for ESXi02, but here we can see that the NormalShare RP defaults are:

ManagedObjectReference:ResourcePool: pool1 (NormalShares)
CpuAllocation: Share value 4000
MemoryAllocation: Share value: 163840

Resource Pool Default Shares Value

If you wonder about how these numbers are chosen, an RP is internally sized as a 4vCPU 16GB virtual machine. With a normal setting (default), you get 1000 shares of CPU for each vCPU and ten shares of memory for each MB (16384×10). High Share setting award 2000 shares for each vCPU and twenty shares of memory for each MB. Using a low share setting leaves you with 500 shares per CPU and five shares of memory for each MB.

When enabled, we can see that scalable shares have done its magic. The shares value of HighShares is now 24000 for CPU and 392160 shares of memory. How is this calculation made:

Each VM is set to normal share value.
Each VM has 2 vCPUs ( 2 x 1000 shares = 2000 CPU shares)
Each VM has 32 GB of memory = 327680 shares.
There are six VMs inside the RP, and they all run on ESXi01:
Sum of CPU shares active in RP: 2000 + 2000 + 2000 + 2000 + 2000 + 2000 = 12000
Sum of Memory shares active in RP: 327680 + 327680 + 327680 + 327680 + 327680 + 327680 = 1966080
The result is multiplied by the ratio defined by the share level of the resource pools.

The ratio between the three values (High:Normal:Low) is 4:2:1. That means that the ratio between high and normal is 2:1, and thus, HighShares RP is awarded 12000 x 2 = 24000 shares of CPU and 1966080 x 2 = 3932160 shares of memory.

To verify, the MOB shows the adjusted values of NormalShares RP, which is 2 x 2000 CPU shares = 4000 CPU shares and 2 x 163840 = 655360 shares of memory.

If we are going to look at the worst-case-scenario allocation of each VM (if every VM in the cluster is 100% active), then we notice that the VMs allocation is increased in the HighShares RP, and decreased in the NormalShares RP. VM7 and VM8 now get a max of 7 GB instead of 16 GB, VMs 1 to 6 allocation increases 3 GHz and 3 GB each. Easily spotted, but the worst-case-scenario allocation is modeled after the RP share level ratio.

What if I adjust the share level at the RP-level? The NormalShares RP is downgraded to a low memory share level. The CPU shares remain the same. The RP receives 81920 of shares and now establishes a ratio of 4:1 compared to the HighShares RP (327680 vs. 81920). The interesting thing is that the MOB shows the same values as before, 655360 shares of memory. Why? Because it just sums the shares of the entities in the RP.

As a test, I’ve reduced the memory shares of VM7 from 327680 to 163840. The MOB indicates a drop of shares from 655360 to 491520 (327680+163840), proofing that the share value is a total of shares of child-objects.

Please note that this is a fundamental change in behavior. With non-scalable shares RP, share values are only relative at the sibling level. That means that a VM inside a resource pool competes for resources with other VMs on the same level inside that resource pool. Now a VM with an absurd high number (custom-set or monster-VM) impacts the whole resource distribution in the cluster. The resource pool share value is a summation of its child-object. Inserting a monster-VM in a resource pool automatically increases the share value of the resource pool; therefore, the entire group of workloads benefits from this.

I corrected the share value of VM7 to the default of 327680 to verify the ratio of the increase occurring on HighShares RP. The ratio between low and high is 4:1, and therefore the adjusted memory shares at HighShares should be 1966080 x 4 = 7864320.

What if we return NormalShares to the normal share value similar to the beginning of this test, but add another High Share value RP to the environment? For this test, we add VM9 and VM10, both equipped with two vCPUs and 32GBs of memory. For test purposes, they are affined with ESXi01, similar to the HighShare RP VMs. The MOB on ESXi01 shows the following values for the new RP HighShares-II: 8000 shares of CPU, 1310720 shares of memory, following the ratio of 2:1.

If we are going to look at the worst-case-scenario allocation of each VM, then we notice that the VMs allocation is decreased for all the VMs in the HighShares and NormalShares RP. VMs 1 to 6 get 16% (11 GHz & 11 GBs), while VM 7 and 8 get 50% of 11% of the cluster resources, i.e. 5.5 GHz and 5.5 GBs each. The new VMs 9 and 10 each can allocate up to 11 GHz and 11 GB, same as the VMs in Highshares RP, following the RP share level ratio.

What happens if we remove the HighShares-II RP and move VM9 and VM10 into a new LowShares RP? This creates a situation where there are three RPs with a different share level assigned to it, providing us with a ratio of 4:2:1. The MOB view of ESXi01 shows that the LowShares RP shares value is not modified, and the HighShares RP shares quadrupled.

The MOB view of ESXi01 shows that the share value of the NormalShares RP shares is now doubled, following the 4:2:1 ratio exactly.

This RP design results in the following worst-case-scenario allocation distribution:

VMs as Siblings

The last scenario I want to highlight is a VM deployed at the same level at the RP level. A common occurrence. Without scalable shares, this could be catastrophic as a Monster-VM could cast a shadow over a resource pool. A (normal share value) VM with 16 vCPUs and 128 GB would receive 16000 shares of CPU and 1310720 shares of memory. In the pre-scalable shares, it would dwarf a normal share value RP with 4000 shares and 163840 shares of memory. Now with scalable shares bubbling up the number of shares of its child-objects, it evens out the playing field. It doesn’t completely solve it, but it reduces the damage. As always, the recommendation is to commit to a single object per level. Once you use resource pools, provision only resource pools at that level. Do not mix VMs and RPs on the same level, especially when you are in the habit of deploying monster VMs. As an example, I’ve deployed the VM “High-VM11” at the same level as the resource pool, and DRS placed it on ESXi02, where the NormalShares RP lives in this scenario. The share value level is set to high, thus receiving 4000 shares for its two vCPUs and 655360 shares for its memory configuration, matching the RP config, which needs to feed the need of two VMs inside.

I hope this write-up helps to understand how outstanding Scalable Shares is, turning Share levels more or less into QoS levels. Is it perfect? Not yet, as it is not bulletproof against VMs being provisioned out of place. My recommendation is to explore VEBA (4) for this and generate a function to automatically move root-deployed VMs into a General RP, avoiding mismatch.

Closing Notes

Please note that I constrained the placement of VMs of an entire RP to a single host in the scenarios I used. In everyday environments, this situation will not exist, and RPs will not be tied to a single host. The settings I used are to demonstrate the inner workings of scalable shares and must not be seen as endorsements or any kind of description of normal vSphere behavior. The platform was heavily tuned to provide an uncluttered view to make it more comprehensible.

Worst-case-scenario numbers are something that shows a situation that is highly unlikely to occur. This is the situation where each VM is simultaneously 100% active. It helps to highlight resource distribution while explaining a mechanism, typically resource demand ebbs and flows between different workloads, thus the examples used in these scenarios are not indicative of expected resource allocation when using resource pools and shares only.

DRS Migration Threshold in vSphere 7

May 8, 2020 by frankdenneman

DRS in vSphere 7 is equipped with a new algorithm. The old algorithm measured the load of each host in the cluster and tried to keep the difference in workload within a specific range. And that range, the target host load standard deviation, was tuned via the migration threshold.

The new algorithm is designed to find an efficient host for the VM that can provide the resources the workload demand while considering the potential behavior of other workloads. Throughout a series of articles, I will explain the actions of the algorithm in more detail.

Due to the changes, the behavior of the migration threshold changed as well. In general, the migration threshold still acts as the metaphorical gas pedal. By sliding it to the right, you press the gas pedal to the floor. You tell DRS to become more aggressive, to reduce or relax certain thresholds. Underneath the covers, things changed a lot. There is no host load standard deviation to compare, but DRS needs to understand workload demand change and how much host inefficiency to tolerate. Let’s take a look closer:

Migration Threshold

Other than greatly improving the text describing each migration threshold, the appearance and behavior of the Migration Threshold (MT) have not changed much in vSphere 7. By default, the slider is set to the moderate “3” setting. There are two more aggressive load-balance settings, 4, and 5. 5 Being the most aggressive. To the left of the default, there are two settings, 1 and 2. However, only setting 2 generates load-balancing recommendations. If you set the slider to 1, the most conservative option, DRS only produces migration recommendations that are considered mandatory. Mandatory moves are generated for a few events, the three common reasons are when the host is placed into maintenance mode, when a proactive HA evacuation occurs, or when a VM violates an (anti-) affinity rule. The remainder of the article describes the setting of the sliders by referring to the numbers.

Selecting the Appropriate Migration Threshold Setting

MT setting 2 is intended for a cluster with mostly stable workloads. Stable in the sense of workload variation, not failure rates ;). When workload generates a continuous workload, it makes less sense to move workload aggressively around. MT 3 is a healthy mix of stable and bursty workloads, while MT 4 and MT5 are designed to react to workloads spikes and bursts.

Headroom threshold

To influence balancing moves, DRS uses a threshold that identifies a particular level of headroom (used capacity) within a host. To point out, this is not a strict admission control function.

The threshold identifies a point where overall host utilization starts to introduce (some) performance loss to the associated workloads on the host. As the new DRS algorithm is designed to consider VM demand, it is focused on finding the best host possible. DRS compares the overall host utilization of each host and migrates VMs to help workloads have enough room to burst.

By default, DRS starts to consider the host less efficient when the host load exceeds 50%. Once the threshold is surpassed, DRS examines possible migrations to help find workloads a more efficient host. Similar to the old algorithm, the cost of the migration needs to exceed the improvement of efficiency. When you select a more aggressive migration threshold setting (MT4, MT5), the tolerance for host load is lowered to 30%. From that point on, DRS will take a particular level of inefficiency into account and starts to analyze other hosts to understand whether specific workloads will benefit from placement on another host. Another way to put it is that DRS attempts to provide 70% headroom in this situation. As a result, you will notice more workload migrations when selecting MT4 or MT5.

Demand Estimation

To find an efficient host, DRS needs to understand the demand of the vSphere Pod or the VM. DRS uses a number of stats to get a proper understanding of the workload demand. These stats are provided by the ESXi hosts. DRS is designed to be conservative as it doesn’t want to move a virtual machine based on an isolated event. Taking care of a sudden increase in demand is the role of the ESXi host. When this becomes structural behavior, then it’s DRS task to find the right spot for this VM.

Please note that a vSphere pod will not be migrated. Although DRS does not load-balance vSphere pods to get a better overall capacity usage, it will keep track of the vSphere Pods demand since it could affect the performance of other workloads (VMs and pods) running on the host. For more info about vSphere pods, please read the article “Initial Placement of a vSphere Pod“.

By default, DRS calculates an average demand of over 40 minutes for each workload. This period depends on the VM memory configuration and the migration threshold. We learned that DRS needs to use a shorter history for smaller VMs to better catch the behavior of vSphere pods or VMs with a smaller memory footprint. Below is an overview of the Migration Threshold settings and the number of minutes used to determine the demand for each workload.

Cost-Benefit

DRS needs to consider the state of the cluster, the workload demand of all the vSphere pods and the VMs, and it needs to ensure that whatever it does, does not interfere with the primary goal of the virtual infrastructure, and that is providing resources to workloads. Every move made consumes CPU resources. It absorbs bandwidth, and in some cases, can affect memory sharing benefits. DRS must weigh all the possibilities. With DRS in vSphere 7 that functionality is called cost-benefit filtering. Moving the migration threshold to the right reduces certain cost filtering aspects while relaxing some benefit requirements. This allows DRS to become more responsive to bursts. As a result, you will notice more workload migrations when selecting MT4 or MT5.

DRS Responsibility

Please remember that vSphere consists of many layers of schedulers all working closely together. DRS is responsible for placement, finding the best host for that particular vSphere pod or VM. However, it is the responsibility of the actual ESXi host schedulers to ensure the workload gets scheduled on the physical resources. Consider DRS as the host or hostess of the restaurant, escorting you to a suitable table, while the ESXi schedulers are the cooks and the waiters.

Individual workload behavior can change quite suddenly, or there is an abrupt change in resource availability. DRS needs to coordinate and balance all the workloads and their behavior in any possible scenario. The new DRS algorithm is completely redesigned, and I think it’s an incredible step forward.

But a new algorithm, with new tweakable parameters, also means that we can expect different behavior. It’s expected that you will see more vMotions compared to the old algorithm, regardless of MT selection. A future article will explain the selection process of the algorithm. As always, it’s recommended to test the new software in a controlled environment. Get to understand its behavior and test out which migration threshold fits your workload best.

SIOC on datastores backed by a single datapool

December 6, 2012 by frankdenneman

Duncan posted an article today in which he brings up the question: Should I use many small LUNs or a couple large LUNs for Storage DRS? In this article he explains the differences between Storage I/O Control (SIOC) and Storage DRS and why they work well together, to re-emphasize, the goal of Storage DRS load balancing is to fix long term I/O imbalances, while SIOC addresses short term burst and loads. SIOC is all about managing the queue’s while Storage DRS is all about intelligent placement and avoiding bottlenecks.
Julian Wood makes an interesting remark, and both Duncan and I hear this remark when discussing SIOC. Don’t get me wrong I’m not picking on Julian, I’m merely stating the fact he made a frequently used argument.

“There is far less benefit in using Storage IO Control to load balance IO across LUNs ultimately backed by the same physical disks than load balancing across separate physical storage pools. “

Well when you look at the way SIOC works I tend to disagree with this statement. As stated before, SIOC manages queues, queues to the datastores used by the virtual machines in the virtual datacenter. Typically speaking these virtual machines differ from workload types, from peak moments and also they differ in importance to the organization. With the use of disk shares, important virtual machine can be assigned a higher priority within the disk queue. When contention occurs, and this is important to realize, when contention occurs these business critical virtual machine get prioritized over other virtual machines. Not all important virtual machines generate a constant stream of I/O, while other virtual machines, maybe with a lower priority do generate a constant stream of IO. The disk shares provide the high priority low IO virtual machines to get a foot between the door and get those I/Os to the datastore and back. Without SIOC and disk shares you need to start thinking of increasing the queue depth of each hosts and think about smart placement of these virtual machines (both high and low I/O load) to avoid those high I/O load getting on the same host. These placement adjustment might impact DRS load balancing operations, possibly affecting other virtual machines along the way. Investing time in creating and managing a matrix of possible vm to datastore placement is not the way to go in this time with rapidly expanding datacenters.
Because SIOC is a datastore-wide scheduler, SIOC determines the queue-depth of the ESX hosts connected to the datastores running virtual machines on those datastores. Hosts with higher priority virtual machines get “deeper” queue depths to the datastore and hosts with lower priority virtual machines running on the datastore receive shorter queue-depths. To be more precise, SIOC calculates the datastore wide latency and each local host scheduler determines the queue depth for the queues of the datastore.
But remember queue depth changes only occur when there is contention, when the datastore exceeds the SIOC latency threshold. For more info about SIOC latency read “To which Host level latency statistic is the SIOC threshold related”
Coming back to the argument, I firmly believe that SIOC has benefits in a shared diskpool structure, between the VMM and the datastore a lot of queue’s exists.

Because SIOC takes the avg device latency off all hosts connected to the datastore into account, it understands the overall picture when determining the correct queue depth for the virtual machines. Keep in mind, queue depth changes occur only during contention. Now the best part of SIOC in 5.1 is that it has the Automatic Latency Threshold Computation. By leveraging the SIOC injector it understands the peak value of a datastore and adjust the SIOC threshold. The SIOC threshold will be set to 90% of its peak value, therefor having an excellent understanding of the performance capability of the datastore. This is done on a regular basis so it keeps actual workload in mind. This dynamic system will give you far more performance benefit that statically setting the queue-depth and DNSRO for each host.
One of the main reasons of creating multiple datastores that are backed by a single datapool is because of creating a multi-path environment. Together with advanced multi-pathing policies and LUN to controller port mappings, you can get the most out of your storage subsystem. With SIOC, you can manage your queue depths dynamically and automatically, by understanding actually performance levels, while having the ability to prioritize on virtual machine level.

DRS and memory balancing in non-overcomitted clusters

August 10, 2012 by frankdenneman

First things first, I normally do not recommend changing advanced settings. Always try to tune system behavior by changing the settings provided by the user interface or try to understand system behavior and how it aligns with your design.
The “problem”
DRS load balancing recommendations could be sub-optimal when no memory overcommitment is preferred.
Some customers prefer not to use memory overcommitment. The clusters contain (just) enough memory capacity to ensure all running virtual machines have their memory backed by physical memory. Nowadays it is not uncommon seeing virtual machines with fairly highly allocated (consumed) memory and due to the use of large pages on hosts with recent CPU architectures, little to no memory is shared. Common scenario with this design is a usual host memory load of 80-85% consumed. In this situation, DRS recommendations may have a detrimental effect on performance as DRS does not consider consumed memory but active memory.
DRS behavior
When analyzing the requirements of a virtual machine during load balancing operations, DRS calculates the memory demand of the virtual machine.
The main memory metric used by DRS to determine the memory demand is memory active. The active memory represents the working set of the virtual machine, which signifies the number of active pages in RAM. By using the working-set estimation, the memory scheduler determines which of the allocated memory pages are actively used by the virtual machine and which allocated pages are idle. To accommodate a sudden rapid increase of the working set, 25% of idle consumed memory is allowed. Memory demand also includes the virtual machine’s memory overhead.
Let’s use an 8 GB virtual machine as example on how DRS calculates the memory demand. The guest OS running in this virtual machine has touched 50% of its memory size since it was booted but only 20% of its memory size is active. This means that the virtual machine has consumed 4096 MB and 1639.2 MB is active.

As mentioned, DRS accommodate a percentage of the idle consumed memory to accommodate a sudden increase of memory use. To calculate the idle consumed memory, the active memory 1639.2 MB is subtracted from the consumed memory, 4096 MB, resulting in a total 2456.8 MB. By default DRS includes 25% of the idle consumed memory, i.e. 614.2 MB.

The virtual machine has a memory overhead of 90 MB. The memory demand DRS uses in it’s load balancing calculation is as follows: 1639.2 MB + 614.2 MB + 90 MB = 2343.4 MB. This means that DRS will select a host that has 2343.4 MB available for this machine and the move to this host improves the load balance of the cluster.
DRS and corner stone of virtualization resource overcommitment
Resource sharing and overcommitment of resources are primary elements of the virtualization. When designing virtual infrastructure it is a challenge to build the environment in such a way that it can handle virtual machine workloads while improving server utilization. Because every workload is not equal, applying resource allocation settings such as shares, reservations and limits can make distinction in priority.
DRS is designed with this corner stone in mind. And that’s makes DRS sometimes a hard act to follow. DRS is all about solving imbalance and providing enough resources to the virtual machines aligned to their demand. This means that DRS balances workload on demand and trust in its core value that overcommitment is allowed. It then relies on the host local scheduler to figure out the priority of the virtual machines. And this behavior is sometimes not in line with the perception of DRS.
A common perception is that DRS is about optimizing performance. This is partially true. As mentioned before DRS looks at the demand of the VM, and will try to mix and match activity of the virtual machines with the available resources in the cluster. As it relies on resource allocation settings, it assumes that priority is defined for each virtual machine and that the host local schedulers can reclaim memory safely. For this reason the DRS memory imbalance metric is tuned to focus on VM active memory to allow efficient sharing of host memory resources. Allowing to run with less cluster memory than the sum of all running virtual machine memory sizes and reclaiming idle consumed memory from lower priority virtual machines for other virtual machines’ active workloads.
Unfortunately DRS does not know when the environment is designed in such a way to avoid overcommitment. Based on the input it can place a virtual machine on a host with virtual machine that have lots of idle consumed memory laying around. Instigating memory reclamation. In most cases this reclamation is hardly noticeable due to the use of the balloon driver. However in the case where all hosts are highly utilized, ballooning might not be as responsive as required, forcing the kernel to compress memory and swap. This means that migrations for the sole purpose of balancing active memory are not useful in environments like these and, if the target host memory is highly consumed, can cause a performance impact on the migrating virtual machine as it waits to obtain memory and on the other virtual machines on the target host as they do processing to allow reclamation of their idle memory.
The solution? You might want to change the 25% idle consumed memory setting
The solution I recommend to start with is to lower the migration threshold by moving the slider to the left. This allows the DRS cluster to have an higher imbalance and allows DRS to be more conservative when recommending migrations.
If this is not satisfactory, then I would suggest changing the DRS advanced option called IdleTax. Please note that this DRS advanced option is not the same setting as the memory kernel setting. Mem.IdleTax.
The DRS IdleTax advanced option (default 75) controls how much consumed idle memory should be added to active memory in estimating memory demand. The calculation is as follows: 100-IdleTax. Default caluculation = 100-75=25
This means that the smaller the value of IdleTax, more consumed idle memory is added to the active memory by DRS for load balancing.

Be aware that the value of IdleTax is a heuristic, tuned to facilitate memory overcommitment; tuning it to a lower value is appropriate for environments not using overcommitment. Note that the option is set per cluster, and would need to be changed for all DRS clusters as appropriate.
Again, try to use a lower migration threshold setting and monitor if this setting provides satisfying results before setting this advanced feature.

Impact of oversized virtual machines part 2

December 17, 2010 by frankdenneman

In part 1 of the series of post on the impact of oversized virtual machines NUMA architecture, memory overhead reservation and share levels are reviewed, part 2 zooms in of the impact of memory overhead reservation and share levels on HA and DRS.
Impact of memory overhead reservation on HA Slot size
The VMware High Availability admission control policy “Host failures cluster tolerates” calculates a slot size to determine the maximum amount of virtual machines active in the cluster without violating failover capacity. This admission control policy determines the HA cluster slot size by calculating the largest CPU reservation, largest memory reservation plus it’s memory overhead reservation. If the virtual machine with the largest reservation (which could be an appropriate sized reservation) is oversized, its memory overhead reservation still can substantial impact the slot size.
The HA admission control policy “Percentage of Cluster Resources Reserved” calculate the memory component of its mechanism by summing the reservation plus the memory overhead of each virtual machine. Therefore allowing the memory overhead reservation to even have a bigger impact on admission control than the calculation done by the “Host Failures cluster tolerates” policy.
DRS initial placement
DRS will use a worst-case scenario during initial placement. Because DRS cannot determine resource demand of the virtual machine that is not running, DRS assumes that both the memory demand and CPU demand is equal to its configured size. By oversizing virtual machines it will decrease the options in finding a suitable host for the virtual machine. If DRS cannot guarantee the full 100% of the resources provisioned for this virtual machine can be used it will vMotion virtual machines away so that it can power on this single virtual machine. In case there are not enough resources available DRS will not allow the virtual machine to be powered on.
Shares and resource pools
When placing a virtual machine inside a resource pool, its shares will be relative to the other virtual machines (and resource pools) inside the pool. Shares are relative to all the other components sharing the same parent; easier way to put it is to call it sibling share level. Therefore the numeric share values are not directly comparable across pools because they are children of different parents.

By default a resource pool is configured with the same share amount equal to a 4 vCPU, 16GB virtual machine. As mentioned in part 1, shares are relative to the configured size of the virtual machine. Implicitly stating that size equals priority.
Now lets take a look again at the image above. The 3 virtual machines are reparented to the cluster root, next to resource pools 1 and 2. Suppose they are all 4 vCPU 16GB machines, their share values are interpreted in the context of the root pool and they will receive the same priority as resource pool 1 and resource pool2. This is not only wrong, but also dangerous in a denial-of-service sense — a virtual machine running on the same level as resource pools can suddenly find itself entitled to nearly all cluster resources.
Because of default share distribution process we always recommend to avoid placing virtual machines on the same level of resource pools. Unfortunately it might happen that a virtual machine is reparented to cluster root level when manually migrating a virtual machine using the GUI. The current workflow defaults to cluster root level instead of using its current resource pool. Because of this it’s recommended to increase the number of shares of the resource pool to reflect its priority level. More info about shares on resource pools can be found in Duncan’s post on yellow-bricks.com.
Go to Part 3: Impact of oversized virtual machine.