Disabling MinGoodness and CostBenefit

Over the last couple of months I’ve seen recommendations popping up on changing the MinGoodNess and CostBenefit settings to zero on a DRS cluster (KB1017291) . Usually after the maintenance window, when hosts where placed in maintenance mode, the hosts remain unevenly loaded and DRS won’t migrate virtual machines to the less loaded host.

By disabling these adaptive algorithms, DRS to consider every move and the virtual machines will be distributed aggressively across the hosts. Although this sounds very appealing, MinGoodness and CostBenefit calculations are created for a reason. Let’s explore the DRS algorithm and see why this setting should only be used temporarily and not as a permanent setting.

DRS load balance objectives

DRS primary objective is to provide virtual machines their required resources. If the virtual machine is getting the resources it request (dynamic entitlement), than there is no need to find a better spot. If the virtual machines do not get their resources specified in their dynamic entitlement, then DRS will consider moving the virtual machine depending on additional factors.

This means that DRS allow certain situations where the administrator feels like the cluster is unbalanced, such as an uneven virtual machine count on hosts inside the cluster. I’ve seen situations where one host was running 80% of the load while the other hosts where running a couple of virtual machines. This particular cluster was comprised of big hosts, each containing 1TB memory while the entire virtual machine memory footprint was no more than 800GB. One host could easily run all virtual machines and provide the resources the virtual machines were requesting.

This particular scenario describes the biggest misunderstanding of DRS, DRS is not primarily designed to equally distribute virtual machines across hosts in the cluster. It distributes the load as efficient as possible across the resources to provide the best performance of the virtual machines. And this is the key to understand why DRS does or does not generate migration recommendation. Efficiency! To move virtual machines around, it cost CPU cycles, memory resources and to a smaller extent datastore operation (stun/unstun) virtual machines. In the most extreme case possible, load balancing itself can be a danger to the performance of virtual machines by withholding resources from the virtual machines, by using it to move virtual machines. This is worst-case scenario, but the main point is that the load balancing process cost resources that could also be used by virtual machines providing their services, which is the primary reason the virtual infrastructure is created for. To manage and contain the resource consumption of load balancing operations, MinGoodness and CostBenefit calculations were created.

CostBenefit
DRS calculates the Cost Benefit (and risk) of a move. Cost: How many resources does it take to move a virtual machine by vMotion? A virtual machine that is constantly updating its large memory footprint cost more CPU cycles and network traffic than a virtual machine with a medium memory footprint that is idling for a while. Benefit: how many resources will it free up on the source host and what will the impact be on the normalized entitlement on the destination host? The normalized entitlement is the sum of dynamic entitlement of all the virtual machines running on that host divided by the capacity of the host. Risk is predicted how the workload might change on both the source and destination host and if the outcome of the move of the candidate virtual machine is still positive when the workload changes.

MinGoodness
To understand which host the virtual machine must move to, DRS uses the normalized entitlement of the host as the key metric and will only consider hosts that have a lower normalized entitlement than the source host. MinGoodness helps DRS understand what effect the move has on the overall cluster imbalance.

DRS awards every move a CostBenefit and MinGoodness rating and these are linked together. DRS will only recommend a move with a negative CostBenefit rating if the move has a highly positive MinGoodness rating. Due to the metrics used, CostBenefit ratings are usually more conservative than the MinGoodness ratings. Overpowering the decision to move virtual machine to host with a lower normalized entitlement due to the cost involved or risk of that particular move.

When MinGoodness and CostBenefit are set to zero, DRS calculates the cluster imbalance and recommend any move* that increases the balance of the normalized entitlement of each host within the cluster without considering the resource cost involved. In oversized environments, where resource supply is abundant, setting these options temporarily should not create a problem. In environments where resource demand rivals resource supply, setting these options can create resource starvation.

*The number of recommendations are limited to the MaxMovesPerHost calculation. This article contains more information about MaxMovesPerHost.

Recommendation
My recommendation is to use this advanced option sparingly, when host-load is extremely unbalanced and DRS does not provide any migration recommendation. Typically when the hosts in the cluster were placed in maintenance mode. Permanently activating this advanced option is similar to lobotomizing the DRS load balancing algorithm, this can do more harm in the long run as you might see virtual machines in an almost-constant state of vMotion.

Comments

  1. PunchingClouds says

    Great stuff Frank. I like the level of detail explanation on this.

  2. Chris Beggs says

    Thanks for the article Frank, as always, very detailed with great insights.

  3. karlochacon says

    I’ve seen the scenario below….but the only thing I say to myself is why vmware just to be careful move these VMs arounb the other hosts…. I mean the host you mentioning fails and it has 10 VMs and the others have 2 or 3 VMs… if the host with more VMs fails we are going to have 10 VMs down for X minutes instead of having those 15 VMs among these 3 hosts….just a thought you know

    “This means that DRS allow certain situations where the administrator feels like the cluster is unbalanced, such as an uneven virtual machine count on hosts inside the cluster. I’ve seen situations where one host was running 80% of the load while the other hosts where running a couple of virtual machines. This particular cluster was comprised of big hosts, each containing 1TB memory while the entire virtual machine memory footprint was no more than 800GB. One host could easily run all virtual machines and provide the resources the virtual machines were requesting.”