DRS-FT INTEGRATION

Another new feature of vSphere 4.1 is the DRS-Fault Tolerance integration. vSphere 4.1 allows DRS not only to perform initial placement of the Fault Tolerance (FT) virtual machines, but also migrate the primary and secondary virtual machine during DRS load balancing operations. In vSphere 4.0 DRS is disabled on the FT primary and secondary virtual machines. When FT is enabled on a virtual machine in 4.0, the existing virtual machine becomes the primary virtual machine and is powered-on onto its registered host, the newly spawned virtual machine, called the secondary virtual machine is automatically placed on another host. DRS will refrain from generating load balancing recommendations for both virtual machines. The new DRS integration removes both the initial placement- and the load-balancing limitation. DRS is able to select the best suitable host for initial placement and generate migration recommendations for the FT virtual machines based on the current workload inside the cluster. This will result in a more load-balanced cluster which likely has positive effect on the performance of the FT virtual machines. In vSphere 4.0 an anti-affinity rule prohibited both the FT primary- and secondary virtual machine to run on the same ESX hosts based on an anti-affinity rule, vSphere 4.1 offers the possibility to create a VM-host affinity rule ensuring that the FT primary and secondary virtual machine do not run on ESX hosts in the same blade chassis if the design requires this. For more information about VM-Host affinity rules please visit this article. Not only has the DRS-FT integration a positive impact on the performance of the FT enabled virtual machines and arguably all other VMs in the cluster but it will also reduce the impact of FT-enabled virtual machines on the virtual infrastructure. For example, DPM is now able to move the FT virtual machine to other hosts if DPM decides to place the current ESX host in standby mode, in vSphere 4.0, DPM needs to be disabled on at least two ESX host because of the DRS disable limitation which I mentioned in this article. Because DRS is able to migrate the FT-enabled virtual machines, DRS can evacuate all the virtual machines automatically if the ESX host is placed into maintenance mode. The administrator does not need to manually select an appropriate ESX host and migrate the virtual machines to it, DRS will automatically select a suitable host to run the FT-enabled virtual machines. This reduces the need of both manual operations and creating very “exiting” operational procedures on how to deal with FT-enabled virtual machines during the maintenance window. DRS FT integration requires having EVC enabled on the cluster. Many companies do not enable EVC on their ESX clusters based on either FUD on performance loss or arguements that they do not intend to expand their clusters with new types of hardware and creating homogenous clusters. The advantages and improvement DRS-FT integration offers on both performance and reduction of complexity in cluster design and operational procedures shed some new light on the discussion to enable EVC in a homogeneous cluster. If EVC is not enabled, vCenter will revert back to vSphere 4.0 behavior and enables the DRS disable setting on the FT virtual machines.

LOAD BASED TEAMING

In vSphere 4.1 a new network Load Based Teaming (LBT) algorithm is available on the distributed virtual switch dvPort groups. The option “Route based on physical NIC load” takes the virtual machine network I/O load into account and tries to avoid congestion by dynamically reassigning and balancing the virtual switch port to physical NIC mappings. The three existing load-balancing policies, Port-ID, Mac-Based and IP-hash use a static mapping between virtual switch ports and the connected uplinks. The VMkernel assigns a virtual switch port during the power-on of a virtual machine, this virtual switch port gets assigned to a physical NIC based on either a round-robin- or hashing algorithm, but all algorithms do not take overall utilization of the pNIC into account. This can lead to a scenario where several virtual machines mapped to the same physical adapter saturate the physical NIC and fight for bandwidth while the other adapters are underutilized. LBT solves this by remapping the virtual switch ports to a physical NIC when congestion is detected. After the initial virtual switch port to physical port assignment is completed, Load Based teaming checks the load on the dvUplinks at a 30 second interval and dynamically reassigns port bindings based on the current network load and the level of saturation of the dvUplinks. The VMkernel indicates the network I/O load as congested if transmit (Tx) or receive (Rx) network traffic is exceeding a 75% mean over a 30 second period. (The mean is the sum of the observations divided by the number of observations). An interval period of 30 seconds is used to avoid MAC address flapping issues with the physical switches. Although an interval of 30 seconds is used, it is recommended to enable port fast (trunk fast) on the physical switches, all switches must be a part of the same layer 2 domain.

DPM SCHEDULED TASKS

vSphere 4.1 introduces a lot of new features and enhancements of the existing features, one of the hidden gems I believe is the DPM enable and disable scheduled tasks. The DPM “Change cluster power settings” schedule task allows the administrator to enable or disable DPM via an automated task. If the admin selects the option DPM off, vCenter will disable all DPM features on the selected cluster and all hosts in standby mode will be powered on automatically when the scheduled task runs. This option removes one of the biggest obstacles of implementing DPM. One of the main concerns administrators have, is the incurred (periodic) latency when enabling DPM. If DPM place an ESX host in standby mode, it can take up to five minutes before DPM decides to power up the ESX host again. During this (short) period of time, the environment experiences latency or performance loss, usually this latency occurs in the morning. It’s common for DPM to place ESX hosts in standby mode during the night due to the decreased workloads, when the employees arrive in the morning the workload increases and DPM needs to power on additional ESX hosts. The period between 7:30 and 10:00 is recognized as one of the busiest periods of the day and during that period the IT department wants their computing power lock, stock and ready to go. This scheduled task will give the administrators the ability to disable DPM before the workforce arrive. Because the ESX hosts remain powered-on until the administrator or a DPM scheduled task enables DPM again, another schedule can be created to enable DPM after the periods of high workload demand ends. To create a scheduled task to disable DPM, open vCenter, go to Home>Management>Scheduled Tasks (CTRL-Shift-T) and select the task “Change cluster power settings”. Select the default power management for the cluster , On or Off and configure the task. For example, by scheduling a DPM disable task on every weekday at 7:00, the administrator is ensured that all ESX hosts are powered on before 8 o’clock every weekday in advance of the morning peak, rather than have to wait for DPM to react to the workload increase. By scheduling the DPM disable task more than one hour in advance of the morning peak, DRS will have the time to rebalance the virtual machine across all active hosts inside the cluster and Transparent Page Sharing process can collapse the memory pages shared by the virtual machines on the ESX hosts. By powering up all ESX hosts early, the ESX cluster will be ready to accommodate load increases.

VM TO HOSTS AFFINITY RULE

VMware vSphere 4.1 introduces a new affinity rule, called “Virtual Machines to Hosts” (VM-Host). This new rule is available in vSphere 4.1 DRS clusters in addition to the existing (anti) affinity rule, which is now called VM-VM affinity rule. The new VM-Host affinity rule provides the ability of placing a group of virtual machines on a subset of hosts inside the cluster. The new rule can very useful in blade system environments and for honoring ISV license requirements. Rules can be created to ensure that virtual machines run on ESX hosts in different blade chassis for availability reasons, or the complete opposite and limit the virtual machines to ESX hosts inside a blade chassis to optimize network speeds by keeping network traffic inside the blade chassis. VM-host are also very useful to fulfill the requirements of special ISV license models as well, for example restricting Oracle database virtual machines to run only on ESX hosts which are licensed by Oracle. Difference between VM-Host affinity rules and VM-VM rules The VM-host affinity rule differ from the VM-VM rule, A VM-Host (anti) affinity rule specify the (anti) affinity between a group of virtual machines and a group of ESX hosts inside the cluster, whether a VM-VM (anti) affinity rule only specify the (anti) affinity between individual virtual machines. Components. A virtual machine to host affinity rule exists out of three components: • Virtual machine DRS group • ESX host DRS group • Designation – “Must” affinity\anti-affinity or “Should” affinity\anti-affinity Virtual machine DRS groups and ESX host DRS Group are quite self-explanatory so let’s dive into the designations component straight away. Designations Two different types of VM-Host rules are available, a VM-Host affinity rule can either be a “must” rule or a “should” rule. The must-rule is a mandatory rule for HA, DRS and DPM, it confines or prevent the virtual machines to run on the ESX hosts specified in the ESX host DRS Group. The “should” rule is a preferential rule for DRS and DPM and expresses a preference. DRS and DPM use their best effort to try to confine or prevent the virtual machines from running on the ESX host they are affined to, but DRS and DPM can violate “should” rules if it compromises certain key operations, HA is not aware of preferential rules because DRS will not communicate these rules to HA. HA, DRS and DPM must take the mandatory rules into account when generating or executing operations. HA, DRS and DPM will never take any action that result in the violation of mandatory affinity rules. Because of this, mandatory rules place more constraints on VM mobility, making it more difficult for DRS to balance load and enforce resource allocation policies, HA and DPM operations are constrained as well, for example, mandatory rules will; • Limit DRS in selecting hosts to load-balance the cluster • Limit HA in selecting hosts to power up the virtual machines • Limit DPM in selecting hosts to power down Due its limiting behavior, it is recommended to use mandatory rules sparingly and only for specific cases, such as licensing requirements. Preferential rules can be used to meet availability requirements such as separating virtual machines between blade centers. DRS and mandatory rules DRS takes mandatory rules into account when generating load-balance recommendations. If a rule is created and the current virtual machine placement is in violation with the rule, DRS will create a priority one recommendation (five stars) and executes the recommendation if DRS is set to fully automatic. DRS will not generate recommendation that will violate the rule, it will not migrate virtual machines to or from an ESX server, even if places the source ESX host into maintenance mode. VMotion will reject the operation as well if it detects that the operation is in violation of the mandatory rule If a reservation is set on the virtual machine, DRS takes both reservation and mandatory affinity rule into account. Both requirements must be satisfied during placement or power on. If DRS is unable to honor either one of the requirements the virtual machine is not powered on or migrated to the proposed destination host. For example if a new rule is created and the current virtual machine placement is in violation of the rule, it can only migrate to a new host if the virtual machine memory reservation can be satisfied on the new host, if this is not possible, DRS will not generate the recommendation. If a rule is created that conflict with another active, the older rule overrules the newer rule and DRS will disable the new rule. As you can imagine that mandatory affinity rules can complicate troubleshooting in certain scenarios for example, why a virtual machine is not migrated from a highly utilized host to an alternative lightly utilized host in the cluster. DPM DPM does not place an ESX host into standby mode if it will violate the mandatory rule and will power-on ESX hosts if these are needed to meet the requirements of the mandatory riles. High Availability Due to the DRS-HA integration in vSphere 4.1, HA respects mandatory (must) rules. During an ESX host failure event, HA ask DRS to supply the list of hosts and places the virtual machines only on the compatible host, i.e. the host that are allowed by the mandatory rules. HA is unaware of the preferential (should) rules, so HA might unknowingly violate the rule during placement of virtual machines after an ESX failure, but the violation will be corrected by the next DRS invocation. Let’s take a look at a configuration which I think is going to be widely implemented soon, the Oracle Must affinity rule. 1. Place all Oracle virtual machines in a Cluster VM DRS group. (vm01, vm03, vm11, vm20) 2. Place all Oracle licensed ESX host in a Cluster Host DRS Group (ESX07, ESX08, ESX15, ESX16) 3. Select “Must run on Host in Group” In this scenario, DRS never places, migrates, or recommend placement of a host-affined virtual machine on a host to which is not listed in the Cluster Host DRS Group (ESX01 - ESX06 & ESX09-ESX14). This means that DRS will never ever place the virtual machine on an unlicensed host, not for maintenance mode, not for DPM power saving and not after an ESX host failure event. This virtual machine to host affinity rule make it possible to run oracle inside big clusters without having to license all the ESX host. I have been involved in a few projects where Oracle license was a constraint. Normally separate smaller clusters were deployed for Oracle database virtual machines, increasing both OPEX and CAPEX of the environment. These rules allows the Oracle virtual machines to run inside the cluster with other virtual machines without having to license all the ESX host inside the cluster. Hereby making the lives easier of both the architect and the administrator. vSphere 4.1, you gotta love it! Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman

VMWARE FAULT TOLERANCE AND DPM

Some requirements of the design I am working on is to be as “green” as possible and to offer the highest level of redundancy for business continuity. Enter VMware Fault Tolerance (FT) and Distributed Power Management (DPM)! When mixing multiple features, the requirements of one feature can have impact on- or even worse becomes a constraint of the other feature. DPM works together with DRS to VMotion virtual machines onto fewer ESX host servers when the resource demand drops below a specific threshold. In the current release of vSphere, DRS does not consider the FT-enabled virtual machines during load balancing operations and DRS will not migrate FT-enabled virtual machine automatically, because of this DPM cannot power down these hosts until the administrator will manually VMotion the primary or secondary virtual machines to another ESX host server. Fortunately when enabling DPM on the cluster, you can disable DPM at ESX host level. Due to the current limitations of DRS with VMware Fault Tolerance, it is recommended to disable DPM on at least two ESX server host to act as host for FT-enabled virtual machines.

MEMORY RECLAMATION, WHEN AND HOW?

After discussing with Duncan the performance problem presented by @heiner_hardt , we discussed the exact moment the VMkernel decides which reclamation technique it will use and specific behaviors of the reclamation techniques. This article supplements Duncan’s article on Yellow-bricks.com. Now let’s begin with when the kernel decides to reclaim memory and see how the kernel reclaims memory. So host physical memory is reclaimed based on four “free memory states”, each with a corresponding threshold. Based on the Threshold, the VMkernel chooses which reclamation technique it will use to reclaim memory from virtual machines.

RESERVATIONS AND CPU SCHEDULING

Most of my resource management articles focus more on the behavior of memory management than on CPU management. Mainly because the Memory scheduler within ESX is such an interesting complex system which comprises of memory allocation, swapping and reclamation with algorithms such as Idle Memory Tax and mechanisms like ballooning and swapping. But lately it seems that CPU scheduling seems to attract more and more my attention. The discussion Duncan and I had prior to posting his article about how CPU limits actually sparked the interest how CPU scheduling works when setting reservations, so additional to Duncan excellent article, I want to take a closer look how the ESX CPU scheduler handles CPU reservations and shares and show why CPU scheduling is more fair that memory management. Similar to memory, the resource allocation settings, reservations, shares and limits can be set on CPU level. Limits and shares have similar behavior on CPU as well as Memory. Reservation act differently, let’s take a quick look at the resource allocation settings:

VIRTUAL MACHINE MEMORY OVERHEAD

Every virtual machine running on an ESX host consumes some memory overhead additional to the current usage of its configured memory. This extra space is needed by ESX for the internal VMkernel datastructures like virtual machine frame buffer and mapping table for memory translation (mapping physical virtual machine memory to machine memory). Two kinds of virtual machine overhead exists: Static overhead Static overhead is the minimum overhead that is required for the virtual machine startup. DRS and the VMkernel uses this metric for admission control and VMotion calculations. The destination host must be able to back the virtual machine reservation and the static overhead otherwise the VMotion will fail. Dynamic overhead Once the virtual machine has started up, the virtual machine monitor (VMM) can request additional memory space. The VMM will request the space, but the VMkernel is not required to supply it. If the VMM does not obtain the extra memory space, the virtual machine will continue to function but this can lead to performance degradation. The VMkernel treats virtual machine overhead reservation the same as VM-level memory reservation and it will not reclaim this memory once it used. Overhead memory used in admission control As mentioned before, DRS and the VMkernel will not allow the virtual machine to be powered up if reservations cannot be guaranteed, this means that the effective memory reservation for a virtual machine is the user configured memory reservation (VM-level reservation) plus the overhead reservation. Resource pool memory reservations This means that during the design phase of a resource pool, the memory overhead of a virtual machine must be included in the calculation of the memory reservation specified on the resource pool. The behavior of dynamic overhead must also be taken into account. Table 3.2 of the vSphere resource management guide list the overhead memory on virtual machines. VMware vSphere Online Library - Table 3.2 overhead memory Please be aware of the fact that memory overheads are growing with each new release of ESX, so keep this in mind when upgrading to a new version. Verify the documentation of the virtual machine memory overhead and check the specified memory reservation on the resource pool.

RE: SWAPPING

Recently we had a discussion about swapping, as Duncan mentioned in his article “Swapping” Swapped memory might not have impact on the performance of the virtual machine. There are scenarios when pages can be swapped out without experiencing performance problems. One common scenario is a bootstorm, i.e startup of many virtual machines at once. Bootstorms can happen when a host failure occurs and High Availability powers on the virtual machines on other host, but are also frequently encountered in windows shops after Patch Tuesday, when the operations team need to obey a limited maintenance window timeslot. When a virtual machine guest OS starts, there will be a period of time before the VMware tools is loaded and the vmmemctl (balloon driver) is operational. During this timeslot the operating system can access a large portion of its configured memory. Windows systems are notorious for this as they tend to touch every page until it reaches the end of their configured memory. Unfortunately page sharing due to Transparent Page Sharing (TPS) is also at a minimum. Redundant memory pages are not collapsed immediately when a virtual machine is started. TPS is a VMkernel background process and uses a cycle of 60 minutes (Mem.ShareScanTime) to scan a virtual machine for page sharing opportunities. During these bootstorms many virtual machines are powered on at the same time, all claiming lots of memory or even their maximum configured memory (windows). This behavior leads to a spike in memory usage and without the help of the balloon driver and TPS, the ESX host needs to resort to swapping out memory. When referring to windows startup, windows will touch every page and this forces ESX to back all in machine memory (physical memory). These pages are filled with useless information and chances are that this might never be accessed by the virtual machine again. Now ESX will not proactively swap memory back in to physical memory when the memory pressure disappears. These pages will remain swapped until it is accessed by the virtual machine, at that point ESX will swap it into memory. Swapping during the bootstorm will delay the boot process, but these swapped out pages will not cause any performance problems during normal operation. As mentioned in Duncan’s “Swapping article”, there are a few metrics that indicates that a virtual machine is swapping or has swapped before. When encountering swapped memory, check the metrics SWCUR (Swap current) and SWTGT (Swap target). If a bootstorm occurred it is likely to have a higher value at SWCUR than at the SWTGT. The SWTGT indicates the desired amount of memory to be swapped out, this is determined by ESX by the resource entitlement calculation of the virtual machine. If there is no memory pressure, the swaptarget will be equal to 0, but because pages remain in the swap file until accessed, the SWCUR will indicate the remaining swapped out pages. If memory contention does occur, ESX will attempt to make the SWCUR equal to the SWTGT (swap target).

RESOURCE POOLS MEMORY RESERVATIONS

After publishing the article “impact of memory reservations” I received a lot of questions about setting memory reservation at resource pool level. It seems there are several common facts about resource pools and memory reservations that are often misunderstood. Because reservations are used by the VMkernel\DRS resource schedulers and (HA) admission control, the behavior of reservation can be very confusing. Before memory reservation on resource pool is addressed, let look at which mechanisms uses reservations and when reservations are used. When are reservations actually used besides admission control? If a cluster is under-committed the VM resource entitlement will be the same as its demand, in other words, the VM will be allocated whatever it wants to consume within its configured limit. When a cluster is overcommitted, the cluster experiences more resource demand than its current capacity, at this point DRS and the VMkernel will allocate resources based on the resource entitlement of the virtual machine. Resource entitlement is covered later in the article. Is there any difference between resource pool level and virtual machine level memory reservation? To keep it short, VM level reservation can be rather evil, it will hoard memory if it has been used by the virtual machine once. Even if the virtual machine becomes idle, the VMkernel will not reclaim this memory and return it to the free memory set. This means that ESX can start swapping and ballooning if no free memory is available for other virtual machines while the owning VM’s aren’t using their claimed reserved memory. It also has influence on the slot size of High availability, for more information about HA slot sizes, please visit the HA deep dive page at yellow-bricks.com. For more information about virtual machine level memory reservation, please read the article “impact of memory management”. Behavior of resource pool memory reservation Now setting a memory reservation on a resource pool level has its own weaknesses, but it is much fairer and more along the whole idea of consolidation and sharing than virtual machine memory reservations. RP level reservations are immediately active, but are not claimed. This means it will only subtract the specified amount of memory from the unreserved capacity of the cluster. RP reservations are used when children of the resource pool uses memory and the system is under contention. Reservations are not wasted and the resources can be used by other virtual machines. Be aware, using and reserving are two distinct concepts! Virtual machines can use the resource, but they cannot reserve this as well if it is already reserved by another item. It appears that resource pool memory reservations work almost similar to CPU reservations, they won’t let any resource go to waste. And to top it off, resource pool reservations don’t flow to virtual machines, they will not influence HA slot sizes. Which unfortunately can lead to (temporary) performance loss if a host failover occurs. When a virtual machine is restarted by HA they are not restarted in the correct resource pool but in the root resource pool, which can lead to starvation. Until DRS is invoked, the virtual machine need to do it without any memory reservations. How to use resource pool memory reservation? Ok so two popular strategies exist when it comes to setting memory reservation on resource pool levels: 1. CPU and Memory reservations within the resource pool is never overcommitted i.e configured memory all VM’s (40Gb) equals reservation (40GB) 2. Percentage of Cluster resources reserved i.e. memory reservation resource pool (20GB) less than configured memory virtual machines inside RP (40GB) The process of divvying is rather straightforward if the memory reservation equals the configured memory of the virtual machines inside the resource pool. All pages by the virtual machines are backed by machine pages, the resource entitlement is at least as large as its memory reservation. What I find more interesting is what happens if the resource pool is configured with a memory reservation that is less than all virtual machine configured memory? DRS will divvy memory reservations based on the virtual machine resource entitlement. So how is resource entitlement calculated? A virtual machines resource entitlement is based on various statistics and some estimation techniques. DRS computes a resource entitlement for each virtual machine, based on virtual machine and resource pool configured shares, reservations, and limits settings, as well as the current demands of the virtual machines and resource pools, the memory size, its working set and the degree of current resource contention. Now by setting a reservation on the resource pool level, the virtual machines who are actively using memory profits the most of this mechanism. Basically if no reservation is set on the VM level, the “RP” reservation is granted to all virtual machines inside the resource pool who are actively using memory. DRS and the VMkernel calculates the resource pool and the virtual machine share levels. Please read the article “the resource pool priority-pie paradox” to get more information about share levels. and use this to specify the virtual machines priority. Besides the share level the active utilization (working set) and the configured memory size are both accounted when calculating the resource entitlement. Virtual machines who are idling aren’t competing for resources, so they won’t get any new resources. If the memory is also idle the allocation get adjusted by the idle memory tax. Idle memory tax uses a progressive tax rate, the more idle memory a VM has, the more tax it will generate, this is why the configured memory size is also taken into account. (nice ammo if your customer wants to configure the DHCP server with 64GB memory!) When we create a “Diva” VM (coined by Craig Risinger), that is setting VM level reservations, this allocation setting is passed to the VMkernel. It will subtract the specified amount of the reservation pool of the RP and it will not share it with others, i.e. the Diva VM is a special creature. As stated above, RP memory reservations flow more than VM-level reservation, it will not claim\hoard memory. So basically when setting a resource pool reservation, reservations are just a part of the computation of the virtual machines resource entitlement. When the host is overcommitted, the memory usage of the virtual machine is either above or below the resource entitlement. If the memory usage exceeds its resource entitlement, the memory is ballooned or swapped from the virtual machine until it is at or below its entitlement. Disclosure Now before you think I fabricated this article all by myself I am happily to admit that I’m in the lucky position to work for VMware and to call some of the world brightest minds my colleagues. Kit Colbert, Carl Waldspurger and Chirag Bhatt took the time and explain this theory very thoroughly to me. Luckily my colleagues and good friends Duncan Epping and Craig Risinger helped me decipher some out-of-this-world emails from the crew above and participated in some excellent discussions.