RE: SWAPPING

Recently we had a discussion about swapping, as Duncan mentioned in his article “Swapping” Swapped memory might not have impact on the performance of the virtual machine. There are scenarios when pages can be swapped out without experiencing performance problems. One common scenario is a bootstorm, i.e startup of many virtual machines at once. Bootstorms can happen when a host failure occurs and High Availability powers on the virtual machines on other host, but are also frequently encountered in windows shops after Patch Tuesday, when the operations team need to obey a limited maintenance window timeslot. When a virtual machine guest OS starts, there will be a period of time before the VMware tools is loaded and the vmmemctl (balloon driver) is operational. During this timeslot the operating system can access a large portion of its configured memory. Windows systems are notorious for this as they tend to touch every page until it reaches the end of their configured memory. Unfortunately page sharing due to Transparent Page Sharing (TPS) is also at a minimum. Redundant memory pages are not collapsed immediately when a virtual machine is started. TPS is a VMkernel background process and uses a cycle of 60 minutes (Mem.ShareScanTime) to scan a virtual machine for page sharing opportunities. During these bootstorms many virtual machines are powered on at the same time, all claiming lots of memory or even their maximum configured memory (windows). This behavior leads to a spike in memory usage and without the help of the balloon driver and TPS, the ESX host needs to resort to swapping out memory. When referring to windows startup, windows will touch every page and this forces ESX to back all in machine memory (physical memory). These pages are filled with useless information and chances are that this might never be accessed by the virtual machine again. Now ESX will not proactively swap memory back in to physical memory when the memory pressure disappears. These pages will remain swapped until it is accessed by the virtual machine, at that point ESX will swap it into memory. Swapping during the bootstorm will delay the boot process, but these swapped out pages will not cause any performance problems during normal operation. As mentioned in Duncan’s “Swapping article”, there are a few metrics that indicates that a virtual machine is swapping or has swapped before. When encountering swapped memory, check the metrics SWCUR (Swap current) and SWTGT (Swap target). If a bootstorm occurred it is likely to have a higher value at SWCUR than at the SWTGT. The SWTGT indicates the desired amount of memory to be swapped out, this is determined by ESX by the resource entitlement calculation of the virtual machine. If there is no memory pressure, the swaptarget will be equal to 0, but because pages remain in the swap file until accessed, the SWCUR will indicate the remaining swapped out pages. If memory contention does occur, ESX will attempt to make the SWCUR equal to the SWTGT (swap target).

RESOURCE POOLS MEMORY RESERVATIONS

After publishing the article “impact of memory reservations” I received a lot of questions about setting memory reservation at resource pool level. It seems there are several common facts about resource pools and memory reservations that are often misunderstood. Because reservations are used by the VMkernel\DRS resource schedulers and (HA) admission control, the behavior of reservation can be very confusing. Before memory reservation on resource pool is addressed, let look at which mechanisms uses reservations and when reservations are used. When are reservations actually used besides admission control? If a cluster is under-committed the VM resource entitlement will be the same as its demand, in other words, the VM will be allocated whatever it wants to consume within its configured limit. When a cluster is overcommitted, the cluster experiences more resource demand than its current capacity, at this point DRS and the VMkernel will allocate resources based on the resource entitlement of the virtual machine. Resource entitlement is covered later in the article. Is there any difference between resource pool level and virtual machine level memory reservation? To keep it short, VM level reservation can be rather evil, it will hoard memory if it has been used by the virtual machine once. Even if the virtual machine becomes idle, the VMkernel will not reclaim this memory and return it to the free memory set. This means that ESX can start swapping and ballooning if no free memory is available for other virtual machines while the owning VM’s aren’t using their claimed reserved memory. It also has influence on the slot size of High availability, for more information about HA slot sizes, please visit the HA deep dive page at yellow-bricks.com. For more information about virtual machine level memory reservation, please read the article “impact of memory management”. Behavior of resource pool memory reservation Now setting a memory reservation on a resource pool level has its own weaknesses, but it is much fairer and more along the whole idea of consolidation and sharing than virtual machine memory reservations. RP level reservations are immediately active, but are not claimed. This means it will only subtract the specified amount of memory from the unreserved capacity of the cluster. RP reservations are used when children of the resource pool uses memory and the system is under contention. Reservations are not wasted and the resources can be used by other virtual machines. Be aware, using and reserving are two distinct concepts! Virtual machines can use the resource, but they cannot reserve this as well if it is already reserved by another item. It appears that resource pool memory reservations work almost similar to CPU reservations, they won’t let any resource go to waste. And to top it off, resource pool reservations don’t flow to virtual machines, they will not influence HA slot sizes. Which unfortunately can lead to (temporary) performance loss if a host failover occurs. When a virtual machine is restarted by HA they are not restarted in the correct resource pool but in the root resource pool, which can lead to starvation. Until DRS is invoked, the virtual machine need to do it without any memory reservations. How to use resource pool memory reservation? Ok so two popular strategies exist when it comes to setting memory reservation on resource pool levels: 1. CPU and Memory reservations within the resource pool is never overcommitted i.e configured memory all VM’s (40Gb) equals reservation (40GB) 2. Percentage of Cluster resources reserved i.e. memory reservation resource pool (20GB) less than configured memory virtual machines inside RP (40GB) The process of divvying is rather straightforward if the memory reservation equals the configured memory of the virtual machines inside the resource pool. All pages by the virtual machines are backed by machine pages, the resource entitlement is at least as large as its memory reservation. What I find more interesting is what happens if the resource pool is configured with a memory reservation that is less than all virtual machine configured memory? DRS will divvy memory reservations based on the virtual machine resource entitlement. So how is resource entitlement calculated? A virtual machines resource entitlement is based on various statistics and some estimation techniques. DRS computes a resource entitlement for each virtual machine, based on virtual machine and resource pool configured shares, reservations, and limits settings, as well as the current demands of the virtual machines and resource pools, the memory size, its working set and the degree of current resource contention. Now by setting a reservation on the resource pool level, the virtual machines who are actively using memory profits the most of this mechanism. Basically if no reservation is set on the VM level, the “RP” reservation is granted to all virtual machines inside the resource pool who are actively using memory. DRS and the VMkernel calculates the resource pool and the virtual machine share levels. Please read the article “the resource pool priority-pie paradox” to get more information about share levels. and use this to specify the virtual machines priority. Besides the share level the active utilization (working set) and the configured memory size are both accounted when calculating the resource entitlement. Virtual machines who are idling aren’t competing for resources, so they won’t get any new resources. If the memory is also idle the allocation get adjusted by the idle memory tax. Idle memory tax uses a progressive tax rate, the more idle memory a VM has, the more tax it will generate, this is why the configured memory size is also taken into account. (nice ammo if your customer wants to configure the DHCP server with 64GB memory!) When we create a “Diva” VM (coined by Craig Risinger), that is setting VM level reservations, this allocation setting is passed to the VMkernel. It will subtract the specified amount of the reservation pool of the RP and it will not share it with others, i.e. the Diva VM is a special creature. As stated above, RP memory reservations flow more than VM-level reservation, it will not claim\hoard memory. So basically when setting a resource pool reservation, reservations are just a part of the computation of the virtual machines resource entitlement. When the host is overcommitted, the memory usage of the virtual machine is either above or below the resource entitlement. If the memory usage exceeds its resource entitlement, the memory is ballooned or swapped from the virtual machine until it is at or below its entitlement. Disclosure Now before you think I fabricated this article all by myself I am happily to admit that I’m in the lucky position to work for VMware and to call some of the world brightest minds my colleagues. Kit Colbert, Carl Waldspurger and Chirag Bhatt took the time and explain this theory very thoroughly to me. Luckily my colleagues and good friends Duncan Epping and Craig Risinger helped me decipher some out-of-this-world emails from the crew above and participated in some excellent discussions.

VMWARE TOOLS DISK TIMEOUT VALUE LINUX GOS

After I posted the “VMtools increases TimeOutValue article” I received a lot of questions if the VMware Tools automatically adjust the timeout value for Linux machines as well. Well, VMware Tools of versions ESX 3.5 Update 5 and ESX 4.0 install a udev rule file on Linux operating systems with kernel version equal or greater then 2.6.13. This rule file changes the default timeout value of VMware virtual disks to 180 seconds. This helps the guest operating system to better survive a SAN failure and keep the linux system disk from becoming read only. Because of the requirement of updates related to udev featured in the 2.6.13 kernel, the SCSI timeout value in other Linux kernels is not touched by the installation of VMware tools and the default value remains active. The two major Linux Kernel version each have a different timeout value: Linux 2.4 - 60 seconds Linux 2.6 - 30 seconds You can set the timeout value manually listed in /sys/block/disk/device/timeout. The problem is the distinction VMtools make between certain Linux Kernels, if you do not know this caveat you might end up with an Linux environment which is not configured exactly the same. This can lead to different behaviour during a SAN outage. Standardization is key when managing virtual infrastructure environments and a uniform environment eases troubleshooting A while ago Jason wrote an excellent article about the values and benefit of increasing the guest os timeout

ESX4 ALUA, TPGS AND HP CA

In my blog post: “HP CA and the use of LUN balancing scripts” I tried to cover the possible impact of using HP continuous Access EVA on the LUN path load balancing scheme in ESX 3.x. I received a lot of questions about this and wanted to address some issues again and try to clarify them. Let’s begin with a recap of the HP CA article; The impact of CA on the load-balancing scheme is due to the fact that an EVA is an asymmetric Active-Active array that uses the Asymmetric Logical Unit Access protocol (ALUA). ESX3 is not ALUA aware and does not recognize the different specific access characteristics of the array’s target ports. VMware addressed this shortcoming and added ALUA support in the new storage stack of ESX4. The ALUA support is a great feature of the new storage architecture, it reduces a lot of extra manual steps of creating a proper load-balanced environment. But how exactly does ALUA identifies which path is optimized and will HP Continuous Access still have an impact on ESX4 environments as well?

IDENTIFY STORAGE PERFORMANCE ISSUES

VMware has recently updated the kb article " Using esxtop to identify storage performance issues Details" (KB1008205). The KB article provides information about how to use esxtop to determine the latency statistics across various devices. The article contain easy to follow, step-by-step instructions on how to setup ESXtop to monitor storage performance per HBA, LUN and virtual machine. It also list generic acceptable values to put your measured values in perspective. It’s a great article, bookmark it for future reference. If you want to learn about threshold of certain metrics in ESXtop, please check out the ESXtop metric bible featured on Yellow-bricks.com. ESXtop is a great tool to view and measure certain criteria in real time, but sometimes you want to collect metrics for later reference. If this is the case, the tool vscsiStats might be helpful. vscsiStats is a tool to profile your storage environment and collects info such as outstanding IO, seekdistance and many many more. Check out Duncan’s excellent article on how to use vscsiStats. Because vscsiStats will collect data in a .csv file you can create diagrams, Gabe written an article how to convert the vscsiStats data into excel charts.

VCDX TIP: VMTOOLS INCREASES TIMEOUTVALUE

This is just a small heads-up post for all the VCDX candidates. Almost every VCDX application I read mentions the fact that they needed to increase the Disk TimeOutValue (HKEY_LOCAL_MACHINE/System/CurrentControlSet/Services/Disk) by to 60 seconds on Windows machines. The truth is that the VMware Tools installation (ESX version 3.0.2 and up) will change this registry value automatically. You might want to check your operational procedures documentation and update this! VMware KB 1014

REMOVING ORPHANED NEXUS DVS

During the test of the Cisco Nexus 1000V the customer deleted the VSM first without removing the DVS using commands from within the VSM, ending up with an orphaned DVS. One can directly delete the DVS from the DB, but there are bunch of rows in multiple tables that need to be deleted. This is risky and may render DB in some inconsistent state if an error is made while deleting any rows. Luckily there is a more elegant way to remove an orphaned DVS without hacking and possibly breaking the vCenter DB. A little background first: When installing the Cisco Nexus 1000V VSM, the VSM uses an extension-key for identification. During the configuration process the VSM spawns a DVS and will configure it with the same extension-key. Due to the matching extension keys (extension session) the VSM owns the DVS essentially. And only the VSM with the same extension-key as the DVS can delete the DVS. So to be able to delete a DVS, a VSM must exist registered with the same extension key. If you deleted the VSM and are stuck with an orphaned DVS, the first thing to do is to install and configure a new VSM. Use a different switch name than the first (deleted) VSM. The new VSM will spawn a new DVS matching the switch name configured within the VSM. The first step is to remove the new spawned DVS and do this the proper way using commands from within the VSM virtual machine.

DRS RESOURCE DISTRIBUTION CHART

A customer of mine wanted more information about the new DRS Resource Distribution Chart in vCenter 4.0, so I thought after writing the text for the customer, why not share this? The DRS Resource Distribution Chart was overhauled in vCenter 4.0 and is quite an improvement over the resource distribution chart featured in vCenter 2.5. Not only does it use a better format, the new charts produce more in-depth information.

RESOURCE POOLS AND AVOIDING HA SLOT SIZING

Virtual machines configured with large amounts of memory (16GB+) are not uncommon these days. Most of the time these “heavy hitters” run mission critical applications so it’s not unusual setting memory reservations to guarantee the availability of memory resources. If such a virtual machine is placed in a HA cluster, these significant memory reservations can lead to a very conservative consolidation ratio, due to the impact on HA slot size calculation. (For more information about slot size calculation, please review the HA deep dive page on yellow-bricks.com.) There are options to avoid creation of large slot sizes. Such as not setting reservations, disabling strict admission control, using vSphere new admission control policy “percentage of cluster resources reserved” or creating a custom slot size by altering the advanced settings das.vmMemoryMinMB. But what if you are still using ESX 3.5, must guarantee memory resources for that specific VM, do not want to disable strict admission control or don’t like tinkering with the custom slot size setting? Maybe using the resource pool workaround can be an option. Resource pool workaround During a conversation with my colleague Craig Risinger, author of the very interesting article “The resource pool priority pie paradox”, we discussed the lack of relation between resource pools reservation settings and High Availability. As Craig so eloquently put it:

IMPACT OF HOST LOCAL VM SWAP ON HA AND DRS

On a regular basis I come across NFS based environments where the decision is made to store the virtual machine swap files on local VMFS datastores. Using host-local swap can affect DRS load balancing and HA failover in certain situations. So when designing an environment using host-local swap, some areas must be focused on to guarantee HA and DRS functionality. VM swap file Lets start with some basics, by default a VM swap file is created when a virtual machine starts, the formula to calculate the swap file size is: configured memory – memory reservation = swap file. For example a virtual machine configured with 2GB and a 1GB memory reservation will have a 1GB swap file. Reservations will guarantee that the specified amount of virtual machine memory is (always) backed by ESX machine memory. Swap space must be reserved on the ESX host for the virtual machine memory that is not guaranteed to be backed by ESX machine memory. For more information on memory management of the ESX host, please the article on the impact of memory reservation. During start up of the virtual machine, the VMkernel will pre-allocate the swap file blocks to ensure that all pages can be swapped out safely. A VM swap file is a static file and will not grow or shrink not matter how much memory is paged. If there is not enough disk space to create the swap file, the host admission control will not allow the VM to be powered up. Note: If the local VMFS does not have enough space, the VMkernel tries to store the VM swap file in the working directory of the virtual machine. You need to ensure enough free space is available in the working directory otherwise the VM is still not allowed to be powered up. Let alone ignoring the fact that you initially didn’t want the VM swap stored on the shared storage in the first place. This rule also applies when migrating a VM configured with a host-local VM swap file as the swap file needs to be created on the local VMFS volume of the destination host. Besides creating a new swap file, the swapped out pages must be copied out to the destination host. It’s not uncommon that a VM has pages swapped out, even if there is not memory pressure at that moment. ESX does not proactively return swapped pages back into machine memory. Swapped pages always stays swapped, the VM needs to actively access the page in the swap file to be transferred back to machine memory but this only occurs if the ESX host is not under memory pressure (more than 6% free physical memory). Copying host-swap local pages between source- and destination host is a disk-to-disk copy process, this is one of the reasons why VMotion takes longer when host-local swap is used. Real-life scenario A customer of mine was not aware of this behavior and had discarded the multiple warnings of full local VMFS datastores on some of their ESX hosts. All the virtual machines were up and running and all seemed well. Certain ESX servers seemed to be low on resource utilization and had a few active VMs, while other hosts were highly utilized. DRS was active on all the clusters, fully automated and a default (3 stars) migration threshold. It looked like we had a major DRS problem. DRS If DRS decide to rebalance the cluster, it will migrate virtual machines to low utilized hosts. VMkernel tries to create a new swap file on the destination host during the VMotion process. In my scenario the host did not contain any free space in the VMFS datastore and DRS could not VMotion any virtual machine to that host because the lack of free space. But the host CPU active and host memory active metrics were still monitored by DRS to calculate the load standard deviation used for its recommendations to balance the cluster. (More info about the DRS algorithm can be found on the DRS deepdive page). The lack of disk space on the local VMFS datastores influenced the effectiveness of DRS and limited the options for DRS to balance the cluster. High availability failover The same applies when a HA isolation response occurs, when not enough space is available to create the virtual machine swap files, no virtual machines are started on the host. If a host fails, the virtual machines will only power-up on host containing enough free space on their local VMFS datastores. It might be possible that virtual machines will not power-up at-all if not enough free disk space is available. Failover capacity planning When using host local swap setting to store the VM swap files, the following factors must be considered. • Amount of ESX hosts inside cluster. • HA configured host failover capacity. • Amount of active virtual machines inside cluster. • Consolidation ratio (VM per host). • Average swap file size. • Free disk space local VMFS datastores.