A couple of weeks ago I was fortunate enough to attend a tech preview of PernixData Flash Virtualization Platform (FVP). Today PernixData exited stealth mode so we can finally talk about FVP. Duncan already posted a lengthy article about PernixData and FVP and I recommend you to read it.
At this moment a lot of companies are focusing on flash based solutions. PernixData distinguishes itself in today’s flash focused world by providing a new flash based technology but that is not a storage array based solution or a server-bound service. I’ll expand on what FVP does in a bit, let’s take a look at the aforementioned solutions. The solutions have drawbacks. A Storage array based flash solution is plagued by common physics. Distance between the workload and the fast medium (flash) generates a higher latency than when the flash disk is placed near the workload. Placing the flash inside a server provides the best performance but it must be shared between the hosts in the cluster to become a true enterprise solution. If the solution breaks important functions such as DRS and vMotion than the use case of this technology remains limited.
FVP solves these problems by providing a flash based data tier that becomes a cluster-based resource. FVP virtualizes server side flash devices such as SSD drives or PCIe flash devices (or both) and pools these resources into a data tier that is accessible to all the hosts in the cluster. One feature that stands out is remote access. By allowing access to remote devices, FVP allows the cluster to migrate virtual machines around while still offering performance acceleration. Therefor cluster features such as HA, DRS and Storage DRS are fully supported when using FVP.
Unlike other server based flash solutions, FVP accelerates both read and write operations. Turning the flash pool in to a “data-in-motion-tier”. All hot data exists in this tier, thus turning the compute layer into an all-IOPS-providing platform. Data that is at rest is moved to the storage array level, turning this layer into the capacity platform. By keeping the I/O operations as close to the source (virtual machines) as possible, performance is increased while reducing the traffic load to the storage platform as well. By filtering out read I/Os the traffic pattern to the array is changed as well, allow the array to focus more on the writes
Another great option is the ability to configure multiple protection levels when using write-back. Data is synchronously replicated to remote devices. During the tech preview Satyam and Poojan provided some insights on the available protection levels, however I’m not sure if I’m allowed to share these publically. For more information about FVP visit Pernixdata.com
The beauty of FVP is that its not a virtual appliance and that it does not require any agents installed in the guest OS. FVP is embedded inside the hypervisor. Now this for me is the key to believe that this ”data-in-motion-tier” is only the beginning of PernixData. By having insights in the hypervisor and understanding the dataflow of the virtual machines, FVP can become a true platform that accelerates all types of IOPS. I do not see any reasons why FVP is not able to replicate/encrypt/duplicate any type of input and output of a virtual machine. 🙂
As you can see I’m quite excited by this technology. I believe FVP is as revolutionary/disruptive as vMotion. It might not be as “flashy” (forgive the pun) as vMotion but it sure is exciting to know that the limitation of use-cases is actually the limitation of your imagination. I truly believe this technology will revolutionize virtual infrastructure ecosystem design.
Voting for the 2013 top virtualization blogs – A year in review
When Eric Siebert opens up the voting for the top VMware & virtualization blogs you know another (blogging) year has passed. First of all I want to thank Eric for organizing this year in year out. I know he spends an awful lot of time on this. Thanks Eric!
Its amazing to see that there are more than 200 blogs dedicated to virtualization and that each month new blogs appear. Unfortunately I don’t have the time to read them all but I do want to show my appreciation for the blog sites that I usually visit. Best newcomer is an easy one, Cormac Hogan. The content is absolutely great and he should be in the top 10. Then we have the usually suspects, my technical marketing colleagues and buddies: Alan Renouf, Rawlinson Rivera and William Lam. I start of the day by making coffee, checking my email and logging into yellow-bricks.com. It’s the de facto standard of the virtualization blogs. Duncan’s blog provide not only technical in-depth articles, but also insights in the industry. Who else? Eric Sloof of course! Always nice to read to find out that your white paper is published before you get the official word through company channels. 😉 Two relative unknown blog sites but quality content: Erik Bussink and Rickard Nobel. These guys create awesome material. One blog that I’m missing in the list is the one from Josh Odgers. Great content. Hope to be able to vote for him next year.
When reviewing content from others you end up reviewing the stuff you did yourself and 2012 was a very busy year for me. During the year I published and co-authored a couple of white papers such as the vSphere Metro Cluster Case Study, Storage DRS interoperability guide and vCloud Director Resource Allocation Models.
I presented at a couple of VMUGS and at both VMword San Francisco and Europe. The resource pool best practice session was voted as one of the top 10 presentations of VMworld. And of course Duncan and I released the vSphere 5.1 Clustering Deepdive, also know as 50 shades of Orange. ☺ I believe it’s the best one of the series.

In the mean time I ended up writing for the vSphere blog, appearing on a couple of podcast and writing a little over a 100 blog articles on frankdenneman.nl. I then to focus on DRS, Storage DRS, SIOC and vMotion but once in a while I like to write about something that gives a little insight peek of my life such as the whiteboard desk or the documentaries I like to watch. It seems you like these articles also as they are frequently visited.
In my articles I try to give insights in the behavior of the features of vSphere, this to help you understand the impact of these features. Understanding the behavior allows you to match your design to the requirements and constrains of the project/virtual infrastructure your working on. During my years in the field I was always looking for this type of information, by providing this material I hope to help out my fellow architects.
When publishing more than over 100 articles you tend to like some more than others. While it’s very difficult to choose individual articles, I enjoyed spending time on writing a series of articles on the same topic, such as the series Architecture and design of datastore clusters (5 posts) and Designing your (Multi-NIC) vMotion network (5 posts). But I also like the individual post:
• vSphere 5.1 vMotion Deepdive
• A primer on Network I/O Control
• vSphere 5.1 Storage DRS load balancing and SIOC threshold enhancements
• HA admission control is not a capacity management tool
• Limiting the number of storage vMotions
I hope you can spare a couple of minutes to cast your vote and show your appreciation for the effort these bloggers put into their work. Instead of picking the customary names please look back and review last year, think about the cool articles you read that helped you or sparked your interest to dive into the technology yourself. Thanks
I can’t wait to watch the Top 25 countdown show Eric, John and Simon did in the previous years.
Implicit anti-affinity rules and DRS placement behavior
Yesterday I had an interesting conversation with a colleague about affinity rules and if DRS reviews the complete state of the cluster and affinity rules when placing a virtual machine. The following scenario was used to illustrate to question:
The following affinity rules are defined:
1. VM1 and VM2 must stay on the same host
2. VM3 and VM4 must stay on the same host
3. VM1 and VM3 must NOT stay on the same host
If VM1 and VM3 is deployed first, everything will be fine. Because VM1 and
VM3 will be placed on 2 different hosts, and VM2 and VM 4 will also be
placed accordingly
However, if VM1 is deployed first, and then VM4, there isn’t an explicit
rule to say these two need to be on separate hosts, this is implied by
looking into dependencies of the 3 rules created above. Would DRS be
intelligent enough to recognize this? Or will it place VM1 and VM4 on the
same host, but by the time VM3 needs to be placed, there is a clear
deadlock.
The situation where its not logical to place VM4 and VM1 on the same host can be deemed as a implicit anti-affinity rule. It’s not a real rule, but if all virtual machines are operational, VM4 should not be on the same host as VM1. DRS doesn’t react to these implicit rules. Here’s why:
When provisioning a virtual machine, DRS sorts the available hosts on utilization first. Then it goes through a series of checks such as the compatibility between the virtual machine and the host. Does the host have a connection to the datastore? Is the vNetwork available at the host? And then it will check to see if placing the virtual machine violates any constraints. A constraint could be a VM-VM affinity/anti-affinity rule or a VM-Host affinity/anti-affinity rule.
In the scenario where VM1 is running, DRS is safe to place VM4 on the same host as it does not violate any affinity rule. When DRS wants to place VM3, it determines that placing VM3 on the same host VM4 is running violates the anti-affinity rule VM1 and VM3. Therefor it will migrate VM4 the moment VM3 is deployed.
During placement DRS only checks the current affinity rules and determines if placement violates any affinity rules. If not, then the host with the most connections and the lowest utilization is selected. DRS cannot be aware of any future power-on operations, there is no vCrystal bowl. The next power-on operation might be 1 minute away or might be 4 days away. By allowing DRS to select the best possible placement, the virtual machine is provided an operating environment that has the most resources available at that time. If DRS took al the possible placement configurations into account, it could either end up in gridlock or place the virtual machine on a higher utilized host for a long time in order to prevent a vMotion operation of another virtual machine to satisfy the affinity rule. All that time that virtual machine could be performing beter if it was placed on a lower utilized host. On the long run, dealing with constraints the moment they occur is far more economical.
Similar behavior occurs when creating a rule. DRS will not display a warning when creating a collections of rules that create a conflict when all virtual machines are turned on. As DRS is unaware of the intentions of the user, it cannot throw a warning. Maybe the virtual machines will not be powered on in the current cluster state. Or maybe this ruleset is in preparation for the new hosts that will be added to the cluster shortly. Also understands that if a host is in maintenance mode, this host is considered to be external to the cluster. It does not count as an valid destination and the resources are not used in the equation. However we as users still see the host part of the cluster. If those rule sets are created while a host is in maintenance mode, than according to the previous logic DRS must throw an error, while the user assumes the rules are correct as the cluster provides enough placement options. As clusters can grow and shrink dynamically, DRS deals only with violations when the rules become active and that is during power-on operations (DRS placement).
HA Percentage based admission control from a resource management perspective – Part 1
Disclaimer: This article contains references to the words master and slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment.
HA admission control is quite challenging to understand as it interacts with multiple layers of resource management. In the upcoming series of articles I want to focus on the HA Percentage based admission control and how it interacts with vCenter and Host management. Let’s cover the basis first before diving into percentage based admission control.
Virtual machine service level agreements
HA provides the service to start up a virtual machine when the host it’s running on fails. That’s what HA is designed to do, however, HA admission control is tightly interlinked with virtual machine reservations and that because virtual machines is a hard SLA for the entire virtual infrastructure. Let’s focus on the two different “service level agreements” you can define for your virtual machine.
Share-based priority: A shared based priority allows the virtual machine to get all the resources it demands until the demand exceeds supply on the ESXi host. Depending on the number of shares and activity, the resource manager will determine the relative priority to resource access. Resources will be reclaimed from inactive virtual machines and distributed to high priority active virtual machines. If all virtual machines are active, then the share values determine the distribution of resources. Let’s coin the term “Soft SLA” for shared based priority as the resource manager allows a virtual machine to be powered on even if there are not enough resources available to provide an adequate user experience/performance of the application running inside the virtual machine.
The resource manager just distributes resources based on the shares value set by the administrator, it expects that correct shares were set to provide an adequate performance level at all times.
Reservation based priority: Reservations can be defined as a “Hard SLA”. Under all circumstances, the resources protected by a reservation must be available to that particular virtual machine. Even if every other virtual machine is in need of resources, the resource manager cannot reclaim these resources as it is restricted by this Hard SLA. It must provide. In order to meet the SLA, the host checks to see if it has enough free unreserved resources available during the power-on operation of the virtual machine. If it doesn’t have the necessary unreserved resources available, the host cannot meet the Hard SLA and therefore the host rejects the virtual machine. Go look somewhere else buddy 😉
Percentage based admission control
The percentage-based admission control is my favorite HA admission control policy because it gets rid of the over-conservative slot mechanism used by the “host failure tolerates” policy. With percentage-based admission control, you set a percentage of the cluster resources that will be reserved for failover capacity.
For example, when you set a 25% percent of reserved failover memory capacity, 25% of the cluster resources are reserved. In a 4-host cluster, this makes sense as the 25% embodies the available resources of a single host, thus 25% equals a failover tolerance of one host. If one host fails, the remaining three hosts can restart the virtual machines that were potentially using 75% of resources of the failed host.

For the sake of simplicity, this diagrams show an equal load distribution, however, due to the VM reservations and other factors, the distribution of virtual machines might differ.
Let’s take a closer look at that 25% and that 75 %. The 25% of reserved failover memory capacity is not done on a per-host basis; therefore the previous diagram is not completely accurate. It’s because this failover capacity is tracked and enforced by HA at the vCenter layer, to be more precise it’s on the HA cluster level. This is crucial information to understand the difference between admission control during normal provisioning/ power-on operations and admission control during restart operations done by HA.
25% reserved failover memory capacity
The resource allocation tab of the cluster shows what happening after enabling HA. The first screenshot is the resource allocation of the cluster before HA is enabled. Notice the 1 GB reservation.


When setting the reserved failover memory capacity to 25% the following thing happens:
25% of the cluster capacity (363.19*0.25=90.79) is added to the reserved capacity, plus the existing 1.03GB, totaling the reserved capacity to 91.83GB. This means that this cluster has 271.37 of available capacity left. Exactly what is this available capacity? This is capacity what’s often revered as “unreserved capacity”. What will happen with this capacity when we power-on a 16GB virtual machine without a reservation? Will it reduce the available capacity to 255.37? No, it will not. This graph shows only how much of the total capacity is assigned to an object with a hard SLA (reservations).

Thus when a virtual machine is powered on or provisioned into the cluster via vCenter by the user it goes through HA admission controls first:

After HA accepts the virtual machine, DRS admissions control and Host admission control review the virtual machine first before powering it on. The article admission control family describes the admission control workflow in-depth.
75% unreserved capacity
What happened to that 90 GB? Is it gone? Is a part of this capacity reserved on each host in the cluster and unavailable for virtual machines to use? No luckily the 90GBs are not gone, HA just reduced the available capacity so that during a placement operation (deployment or power-on of an existing VM) vCenter knows if one of the clusters can meet the hard SLA of a reservation. To illustrate this behavior I took a screenshot of ESXtop output of a host:

In this capture, you can see that the host is serving 5 virtual machines (W2K8-00 to W2K8-04). Each virtual machine is configured with 16GB (Memsz) and the resource manager has assigned a size target of memory above the 16GB (SZTGT). This size target is the number of resources the resource manager has allocated. The reason why it’s higher than the memsize is because of the overhead memory reservation. The memory needed by the VMkernel to run the virtual machine. As you can see these 5 virtual machines use up 82GB, which is more than the 67.5 GB is supposed to have if 25% was reserved as failover capacity on each host.
Failover process and the role of host admission control
This is the key to understand why HA “ignores” reserved failover capacity during a failover process. As HA consists of FDM agents running on each host, it is the master FDM agent who reviews the protected list and initiates a power-on operation of a virtual machine that is listed as protected but is not running. The FDM agent ensures that the virtual machines are powered on. As you can see this all happens on a host-level basis, vCenter is not included in this party. Therefore the virtual machine start-up operation is reviewed by the host admission control. If the virtual machine is configured with a soft SLA, host admission control only checks if it can satisfy the VM overhead reservation. If the VM is protected by a VM reservation, host admission control checks if it can satisfy both the VM reservation as well as the VM overhead reservation. If it cannot if will fail the startup and FDM has to find another host that can run this virtual machine. If all host fail to power-on the virtual machine, HA will request DRS to “defragment” the cluster by moving virtual machines around to make room on a host and free up some unreserved capacity.
But remember, if a virtual machine has a soft SLA, HA will restart the virtual machine regardless of the amount of capacity to run the virtual machines providing adequate performance to the users. This behavior is covered in-depth in the article: “HA admission control is not a capacity management tool”. To ensure virtual machine performance during a host failure, one must focus on capacity planning and/or configuration of resource reservations.
Part 2 of this series will take a closer look at how to configure a proper percentage value that avoids memory overcommitment.
Have you signed up for the Benelux Software Defined Datacenter Roadshow yet?
In less than 3 weeks time, the Benelux Software Defined Datacenter Roadshow starts. Industry-recognized experts from both IBM and VMware share their vision and insights on how to build a unified datacenter platform that provides automation, flexibility and efficiency to transform the way you deliver IT. Not only can you attend their sessions and learn how to abstract, pool and automate your IT services, the SDDC roadshow provides you to meet the expert, sit down and discus technology.
The speakers and their field of expertise:
VMware
Frank Denneman – Resource Management Expert
Cormac Hogan – Storage Expert
Kamau Wanguhu – Software Defined Networking Expert
Mike Laverick – Cloud Infrastructure Expert
Ton Hermes – End User Computing Expert
IBM
Tikiri Wanduragala – IBM PureSystems Expert
Dennis Lauwers – Converged Systems Expert
Geordy Korte – Software Defined Networking Expert
Andreas Groth – End User Computing Expert
The roadshow is held in three different countries:
Netherlands – IBM forum in Amsterdam – March 5th 2013
Belgium – IBM forum in Brussels – March 7th 2013
Luxembourg – March 8th 2013
The Software Defined Datacenter Roadshow is a full day event and best of all it is free!
Sign up now!