Another new feature of vSphere 4.1 is the DRS-Fault Tolerance integration. vSphere 4.1 allows DRS not only to perform initial placement of the Fault Tolerance (FT) virtual machines, but also migrate the primary and secondary virtual machine during DRS load balancing operations. In vSphere 4.0 DRS is disabled on the FT primary and secondary virtual machines. When FT is enabled on a virtual machine in 4.0, the existing virtual machine becomes the primary virtual machine and is powered-on onto its registered host, the newly spawned virtual machine, called the secondary virtual machine is automatically placed on another host. DRS will refrain from generating load balancing recommendations for both virtual machines.
The new DRS integration removes both the initial placement- and the load-balancing limitation. DRS is able to select the best suitable host for initial placement and generate migration recommendations for the FT virtual machines based on the current workload inside the cluster. This will result in a more load-balanced cluster which likely has positive effect on the performance of the FT virtual machines. In vSphere 4.0 an anti-affinity rule prohibited both the FT primary- and secondary virtual machine to run on the same ESX hosts based on an anti-affinity rule, vSphere 4.1 offers the possibility to create a VM-host affinity rule ensuring that the FT primary and secondary virtual machine do not run on ESX hosts in the same blade chassis if the design requires this. For more information about VM-Host affinity rules please visit this article.
Not only has the DRS-FT integration a positive impact on the performance of the FT enabled virtual machines and arguably all other VMs in the cluster but it will also reduce the impact of FT-enabled virtual machines on the virtual infrastructure. For example, DPM is now able to move the FT virtual machine to other hosts if DPM decides to place the current ESX host in standby mode, in vSphere 4.0, DPM needs to be disabled on at least two ESX host because of the DRS disable limitation which I mentioned in this article.
Because DRS is able to migrate the FT-enabled virtual machines, DRS can evacuate all the virtual machines automatically if the ESX host is placed into maintenance mode. The administrator does not need to manually select an appropriate ESX host and migrate the virtual machines to it, DRS will automatically select a suitable host to run the FT-enabled virtual machines. This reduces the need of both manual operations and creating very “exiting” operational procedures on how to deal with FT-enabled virtual machines during the maintenance window.
DRS FT integration requires having EVC enabled on the cluster. Many companies do not enable EVC on their ESX clusters based on either FUD on performance loss or arguements that they do not intend to expand their clusters with new types of hardware and creating homogenous clusters. The advantages and improvement DRS-FT integration offers on both performance and reduction of complexity in cluster design and operational procedures shed some new light on the discussion to enable EVC in a homogeneous cluster. If EVC is not enabled, vCenter will revert back to vSphere 4.0 behavior and enables the DRS disable setting on the FT virtual machines.
Related Posts
DRS 4.1 Adaptive MaxMovesPerHost
DRS Resource Distribution Chart
Impact of host local VM swap on HA and DRS






February 15th, 2010 at 12:09
Great article, it’s all about understanding the impact of your decisions…
February 15th, 2010 at 12:44
One little side note, the formula to calculate the swap file size is: memory limit – memory reservation. Usually the configured memory is the same as the limit, but the limit can be lower.
February 15th, 2010 at 12:55
Sorry, I’m wrong.
February 15th, 2010 at 13:12
an Excellent article, it certainly is food for thought
February 15th, 2010 at 13:19
Eric,
Thanks for the reply, but the correct calculation is configured memory- reservation.
See page 31 of the vSphere resource management guide:
You must reserve swap space for any unreserved virtual machine memory (the difference between the
reservation and the configured memory size) on per-virtual machine swap files.
I understand your point, as a limit will reduce the amount of memory allowed to be backed by machine memory, but you cannot configure a virtual machine with a limit less than its reservation setting. A limit will not have any effect on the size of the swap file. It only restricts the VM ability to use machine memory.
For example a machine configured with a 2GB and a 1GB memory reservation will end up with 1GB. A swap file is created with a size of 1GB. The minimum limit of that VM is 1GB. In this situation, all pages above 1GB are not allowed to be backed by physical memory and are paged in the swap file per default.
When no reservation is set, the swap file will be equal to the configured memory of the virtual machine.
February 15th, 2010 at 13:20
Ah just spotted your correction.
But your question was a good exercise!
February 15th, 2010 at 13:23
Nice article Frank! Can you think of any reasons why people want to store swap on local vmfs (beside the reason there is no shared storage)?
February 15th, 2010 at 13:30
Most of the time it’s not the case that shared storage is unavailable, but just a reduction of IO load towards the shared storage environment.
VMware used to recommend placement of swap files on local VMFS datastores when NFS was used as shared storage.
February 15th, 2010 at 14:02
vConsult also made a good point on Twitter, when your shared storage is replicated you don’t want to replicate the swap files to the disaster recovery site. That’s one reason to place the swap files a non replicated LUN or local VMFS.
February 15th, 2010 at 14:10
Frank you’re right, it’s not the limit. I actually tested it in my lab. There’s still an error in the Fast-Track Guide which states (p.435).
“The size of the VMKernel swap file is determined by the difference between how much memory the virtual machine can use (the virtual machine’s maximum configured memory or its memory limit) and how much RAM is reserved for it (its reservation).”
That’s when I decided to test it.
February 15th, 2010 at 14:23
If you don’t want to replicate your swap file why not store the swap file on a separate shared VMFS datastore?
And why even care about replication? If you’re not overcommitting the .vswp file is more or less static and should not increase replication traffic after the initial first replica.
February 15th, 2010 at 14:52
Great article Frank. I don’t have any experience with local swap files, but this is something that could be easily overlooked.
I fully agree with Duncan. When you do a proper sizing of your environment you hardly get any ESX swapping, so replication overhead is minimized.
When you care about storage replication consider placing the Guest’s page/swap on a seperate datastore that is not replicated. The guest’s page/swap file is utilized more than the ESX swap.
Oh and before your host hits swapping it will try ballooning first, which utilizes the guest’s swap/page even more.
February 15th, 2010 at 14:53
But every time you power off or power on a virtual machine the replication will be triggered again. Best practice: Locate the swap file on shared storage but not on replicated storage.
February 15th, 2010 at 15:55
Great Article !
I wrote a PowerCLI one-liner to display the VMs and their average memory swapped and the amount of memory used by memory control, I was supprised to see some VMs with swapped memory even when the hosts look fine, this article helps to explain this, Thanks Frank !
How much would you expect was a non worying amount of data in a swap file ?
One-liner for those interested:
Get-VM | Where {$_.PowerState -eq “PoweredOn” }| Select Name, Host, @{N=”SwapKB”;E={(Get-Stat -Entity $_ -Stat mem.swapped.average -Realtime -MaxSamples 1 -ErrorAction SilentlyContinue).Value}}, @{N=”MemBalloonKB”;E={(Get-Stat -Entity $_ -Stat mem.vmmemctl.average -Realtime -MaxSamples 1 -ErrorAction SilentlyContinue).Value}} | Out-GridView
February 15th, 2010 at 15:57
Good stuff that I’ll keep in mind when I’m thinking about putting swap’s locally.
February 15th, 2010 at 16:00
That’s true Eric, but will that really saturate your link? if that’s the case you will need to do the math.
Question is though, will you also replicate your Windows swap?
February 15th, 2010 at 16:06
Eric,
How often is a vm powered down in a production environment?
Guest OS reboots don’t count.
Allthough powering off/on a vm will trigger replication it will not necessarily replicate the complete swap file. Replication is active on the block level of the backend storage. Deleting the swapfile will only impact a small change in the allocation table. No disk scrubbing is done.
The same holds true for creating the swapfile. The swap file isn’t eagerzeroed, but zeroed out on first write AFAIK. Therefore the impact on replication is minimal.
I’m no storage specialist though.
Interesting topic.
February 15th, 2010 at 16:08
Arnim,
Not entirely true, the VMkernel does not zero-out the swap file, it will reserve only the blocks.
February 15th, 2010 at 16:08
For the case Frank was talking about, we actualy looked in to not replicating the windows swap, but decided that would make the setup too complex. Even with sufficient bandwith, you do not want to replicated data that is useless on the other side, especialy when you get a performance penalty by replicating.
February 15th, 2010 at 16:14
Frank,
I mentioned that the swap file is NOT eagerzeroed. But aren’t the reserved blocks zeroed out before they are written to?
February 15th, 2010 at 16:15
That’s also true Arnim. The vswp file isn’t zeroed out. only when a write occurs it’s zeroed.
February 15th, 2010 at 16:17
Oh apologies, I misunderstood your comment.
I think a block is zeroed out before writing, will come back on this.
February 15th, 2010 at 16:24
I can agree with Arnim on this one. In our production environment, the VMs are almost never shutdown and booted again. Instead, the OS inside of them are rebooted. However, now that I’m thinking of it, there is an option in Windows to clear the pagefile during boot. That could certainly impact the performance when rebooting 3000 VMs….. I would disable that setting. Great article Frank, it’s going to be a classic!
February 15th, 2010 at 16:46
I know a large NFS shop that places all VMkernel swap on a no-snap volume, while the VMs are typically placed on snap volumes for backup purposes. The logic? No need to snap VMkernel swap wasting Tier 1 disk space. Furthermore, VMkernel swap doesn’t need to roll off to snap vault for the same reasons.
February 15th, 2010 at 18:25
This discussion got me thinking and I provisioned a new thin VMFS from my LeftHand array and parked a new VM out there with 4 GB RAM. I put it into a resource pool that is capped at 1 GB and fired up MemTest x86. This was so I could ensure that the swap file was being used — simply starting the VM did allocate a file on the VMFS, but the array didn’t grab any more storage for the volume.
Once I started swapping, the array allocated a couple GB and I took a snapshot and powered off the VM. When I brought it back up, no additional space was allocated for the VSWP file until I again started using it… then the array grabbed another couple GB.
I guess the moral of the story is that the act of allocating the swap file does not seem to allocate much disk space, but that may depend on how your array handles provisioning. If it looks at changed blocks, you should be fine.
February 17th, 2010 at 12:57
Still not sure if hosting swap files on separate non-replicated storage is such a wise idea.
If you have a setup in which you have two sites, then you must take into account the possibility of a site failure. In that case, VMs should be started up on the working site, and therefore, you also need space for those VM swapfiles. Because of this situation, I think it’s better to store swap on replicated storage.
Also of this, it’s better not to store swap files on local storage, because you need more local storage on each ESX host to facilitate a site failure.
Please, shoot me on this! xD
ps: Found some more interesting information on local swap (little outdated, but still applies I think): http://www.vmware.com/files/pdf/new_storage_features_3_5_v6.pdf#5
February 18th, 2010 at 05:48
This blog was awesome. This site was very informative.Thank you for this information.
February 18th, 2010 at 19:14
Windows pagefile: Bouke, Windows has an option to clear the pagefile at shutdown, but it’s disabled by default (I’m not aware of any setting that clears the pagefile at bootup). So this probably isn’t a concern in practice.
Replication traffic: there seems to be a soft consensus here that a VM startup reserves disk blocks but no block-level replication is triggered until there’s an actual write — which may well never occur since memory ballooning always precedes use of the VSWP file. It’s essential to verify this on your particular array, though.
Array storage: yeah, it does seem awfully expensive to “waste” Tier 1 storage on mere swap files. The idea Jason mentioned of using non-replicated NAS/NFS storage has a certain appeal but if the storage is too much cheaper, it may not be as well engineered (for example, against single-points-of-failure) as the Tier 1 stuff. Though presumably the NFS shops Frank is referring to have already done their homework.
Host-local may have slight performance benefits vs. SAN, but one could argue that if the host is so loaded that you’re in swapping territory, the location of your VSWP files is probably the least of your worries.
There’s no perfect solution, just tradeoffs, but personally I like the idea of using non-replicated, but well-engineered, NFS/NAS for swap files — you get the benefits of central storage (no performance hit when doing VMotion) at the cost of having to install a NAS in each site (production, D/R). That is, use Tier 1 (snapshots, replication) storage for VMs and Tier 2 (fast and no single-points-of-failure, but no snapshots or replication) for swap.
February 18th, 2010 at 23:37
Richard, why waste often valuable WAN bandwidth replicating state that won’t be used at the other site? Sure you need to have a target for the swap at the other side, but it doesn’t need to be replicated from the first site.
February 19th, 2010 at 21:31
That was a awesome post! I agree with your post. Great job again, and I hope you have a great day!
February 21st, 2010 at 02:49
What I do with swap is to create thin provisioned NFS volumes that are the size of my aggregate (Netapp). This way I get the benefit of fast shared storage for swap and the storgae array only stores what swap actually needs. By oversubscribing to the size of the aggregate I get the protection that I’ll always have enough capacity available for swap. Of course I need to monitor the available space in my aggregate, but that’s not a problem.
So, thin provision swap on non-replicated, non snapshotted volumes is what I like to do.
February 21st, 2010 at 17:47
Excellent info and banter, thanks Frank!
February 22nd, 2010 at 20:10
We’ve been running VMWare since 2.5 and I didn’t like when VMWare forced us to use SAN storage in 3.0 (since changed I realize). In my environment SAN storage is expensive and if you look about the amount that would be taken if all my swaps were moved to it would be in the TBs. All my ESX servers have a pair of 146 GB drives mirrored (with a few with 300s). As of our current servers (soon to change) the most RAM I have is 128 GB. If you subtract the ESX layer that leaves enough space on that local drive for a VMFS partition to hold a lot of swap space. I load our ESX servers as to be able to take an entire site down and still run all of production with very little to no over commit on RAM. I haven’t, as of yet, seen a reason to move all that useless space to the SAN, mirrored or not.
Now, as we start to eval going to ESXi and thus the removal of need for local storage things change, but at least in our environment to date I can’t justify spending the extra $$ on SAN storage.
As has been said, it goes back to knowing your environment, your needs and ramifications of your design decisions. Like most everything in this world one size does not fit all.
September 9th, 2010 at 17:20