HA Archives - frankdenneman.nl

Playing tonight: DRS and the IO controllers

December 1, 2014 by frankdenneman

Ever wondered why the band is always mentioned second, is the band replaceable? Is the sound of the instruments so ambiguous that you can swap out any musician with another? Apparently the front man is the headliner of the show and if he does he job well he will never be forgotten. The people who truly recognize talent are the ones that care about the musicians. They understand that the artist backing the singer create the true sound of the song. And I think this is also the case when it comes to DRS and his supporting act the Storage controllers. Namely SIOC and NETIOC. If you do it right, the combination creates the music in your virtual datacenter, well at least from a resource management perspective. 😉

Last week Chris Wahl started a discussion about DRS and its inability to not load-balance perfectly the VMs amongst host. Chris knows about the fact that DRS is not a VM distribution mechanism, his argument is more focused on the distribution of load on the backend; the north-south and east-west uplinks. And for this I would recommend SIOC and NETIOC. Let’s do a 10.000 foot flyby over the different mechanisms.

Distributed Resource Scheduler (DRS)
DRS distributes the virtual machines – the consumers – across the ESXi hosts, the producers. Whenever the virtual machine wants to consume more resources, DRS attempts to provide these resources to this virtual machine. It can do this by moving other virtual machines to different hosts, or move the virtual machine to another host. Trying to create an environment where the consumers can consume as much as possible. As workload patterns differ from time to time, from day to day, an equal number of VMs per host does not provide a balanced resource offering. It’s best to create a combination of idle and active virtual machines per host. And now think about the size of virtual machines, most environments do not have a virtual machine configuration landscape to utilizes a identical hardware configuration. And if that was the case, think about the applications, Some are memory bound, some applications are CPU bound. And to make it worse, think load correlation and load synchronicity. Load correlation defines the relationship between loads running in different machines. If an event initiates multiple loads, for example, a search query on front-end webserver resulting in commands in the supporting stack and backend. Load synchronicity is often caused by load correlation but can also exist due to user activity. It’s very common to see spikes in workload at specific hours, for example think about log-on activity in the morning. And for every action, there is an equal and opposite re-action, quite often load correlation and load synchronicity will introduce periods of collective non-or low utilization, which reduce the displayed resource utilization. All these things, all this coordination is done by DRS, fixating an identical number of VMs per host is in my opinion lobotomizing DRS.

But DRS is only focused on CPU and Memory. Arguably you can treat network and storage somewhat CPU consumption as well, but lets not go that deep. Some applications are storage bound some applications are network bound. For this other components are available in your vSphere infrastructure. The forgotten heroes, SIOC and NETIOC.

Storage IO Control (SIOC)
Storage I/O Control (SIOC) provides a method to fairly distribute storage I/O resources during times of contention. SIOC provides a datastore-wide scheduling using virtual disk shares to calculate priority. In a healthy and properly designed environment, every host that is part of the cluster should have a connection to the datastore and all host should have an equal amount of paths to the datastore. SIOC monitors the consumption and if the latency experienced by the virtual machine exceeds the user-defined threshold, SIOC distributes priority amongst the virtual machines hitting that datastore. By default every virtual machine receives the same priority per VMDK per datastore, but this can be modified if the application requires this from a service level perspective.

Network I/O Control (NETIOC)
The east-west equivalent of its north-south brother SIOC. NETIOC provides control for predictable networking performance while different network traffic streams are contending for the same bandwidth. Similar controls are offered, but are now done on traffic patterns instead of a per virtual machine basis. Similar architecture design hygiene applies here as well. All hosts across the cluster should have the same connection configuration and amount of bandwidth available to them. The article “A primer on Network I/O Control” provides more info on how NETIOC works, VMware published a NETIOC Best Practice white paper a while ago, but most of it is still accurate.

And the bass guitar player of the virtual datacenter, Storage DRS.
Storage DRS provides virtual machine disk placement and load balancing mechanisms based on both space and I/O capacity. Where SIOC reactively throttles hosts and virtual machines to ensure fairness, SDRS proactively generates recommendations to prevent imbalances from both space utilization and latency perspectives. More simply, Storage DRS does for storage what DRS does for compute resources.

These mechanism combined with a healthy – well architected – environment will help you distribute the consumers across the producers with the proper context in mind. Which virtual machines are hot and which are not? Much better than playing the numbers game! Now, one might argue but what about failure scenarios? If a have an equal number of VMs running on my host, my failover time decreases as well. Well it depends. HA distributes virtual machines across the cluster and if DRS is up and running, it moves virtual machines around if it cannot satisfy the resource entitlement of the virtual machines (VM level reservations). Duncan wrote about DRS and HA behavior a while ago, and of course we touched upon this in our book the 5.1 clustering deepdive. (still fully applicable for 5.5 environments)

In my opinion, trying to outsmart advanced and adaptive computer algorithms with basic math reasoning is really weird. Especially when most people are talking about Software defined datacenters and whether you are managing pets versus cattle. When your environment is healthy and layed-out in a homogenous way , you cannot beat computer algorithms. The thing you should focus on is the alignment of resource priority to business service levels. And that’s what you achieve by applying the correct share levels at DRS, SIOC and NETIOC levels. Maybe you can devops your way into leveraging various scripting languages. 😉

Which HA admission control policy do you use?

April 4, 2014 by frankdenneman

Yesterday Duncan and I where discussing the 5.5 update of the vSphere clustering deepdive book and we were debating which HA admission control policy is the most popular. Last week I asked around on twitter, but hopefully a short poll will give us better insights. Please cast your vote.
[socialpoll id=”2195435″]

HA Percentage based admission control from a resource management perspective – Part 1

February 15, 2013 by frankdenneman

Disclaimer: This article contains references to the words master and slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment.

HA admission control is quite challenging to understand as it interacts with multiple layers of resource management. In the upcoming series of articles I want to focus on the HA Percentage based admission control and how it interacts with vCenter and Host management. Let’s cover the basis first before diving into percentage based admission control.

Virtual machine service level agreements
HA provides the service to start up a virtual machine when the host it’s running on fails. That’s what HA is designed to do, however, HA admission control is tightly interlinked with virtual machine reservations and that because virtual machines is a hard SLA for the entire virtual infrastructure. Let’s focus on the two different “service level agreements” you can define for your virtual machine.

Share-based priority: A shared based priority allows the virtual machine to get all the resources it demands until the demand exceeds supply on the ESXi host. Depending on the number of shares and activity, the resource manager will determine the relative priority to resource access. Resources will be reclaimed from inactive virtual machines and distributed to high priority active virtual machines. If all virtual machines are active, then the share values determine the distribution of resources. Let’s coin the term “Soft SLA” for shared based priority as the resource manager allows a virtual machine to be powered on even if there are not enough resources available to provide an adequate user experience/performance of the application running inside the virtual machine.

The resource manager just distributes resources based on the shares value set by the administrator, it expects that correct shares were set to provide an adequate performance level at all times.

Reservation based priority: Reservations can be defined as a “Hard SLA”. Under all circumstances, the resources protected by a reservation must be available to that particular virtual machine. Even if every other virtual machine is in need of resources, the resource manager cannot reclaim these resources as it is restricted by this Hard SLA. It must provide. In order to meet the SLA, the host checks to see if it has enough free unreserved resources available during the power-on operation of the virtual machine. If it doesn’t have the necessary unreserved resources available, the host cannot meet the Hard SLA and therefore the host rejects the virtual machine. Go look somewhere else buddy 😉

Percentage based admission control
The percentage-based admission control is my favorite HA admission control policy because it gets rid of the over-conservative slot mechanism used by the “host failure tolerates” policy. With percentage-based admission control, you set a percentage of the cluster resources that will be reserved for failover capacity.

For example, when you set a 25% percent of reserved failover memory capacity, 25% of the cluster resources are reserved. In a 4-host cluster, this makes sense as the 25% embodies the available resources of a single host, thus 25% equals a failover tolerance of one host. If one host fails, the remaining three hosts can restart the virtual machines that were potentially using 75% of resources of the failed host.

For the sake of simplicity, this diagrams show an equal load distribution, however, due to the VM reservations and other factors, the distribution of virtual machines might differ.

Let’s take a closer look at that 25% and that 75 %. The 25% of reserved failover memory capacity is not done on a per-host basis; therefore the previous diagram is not completely accurate. It’s because this failover capacity is tracked and enforced by HA at the vCenter layer, to be more precise it’s on the HA cluster level. This is crucial information to understand the difference between admission control during normal provisioning/ power-on operations and admission control during restart operations done by HA.

25% reserved failover memory capacity
The resource allocation tab of the cluster shows what happening after enabling HA. The first screenshot is the resource allocation of the cluster before HA is enabled. Notice the 1 GB reservation.

When setting the reserved failover memory capacity to 25% the following thing happens:
25% of the cluster capacity (363.19*0.25=90.79) is added to the reserved capacity, plus the existing 1.03GB, totaling the reserved capacity to 91.83GB. This means that this cluster has 271.37 of available capacity left. Exactly what is this available capacity? This is capacity what’s often revered as “unreserved capacity”. What will happen with this capacity when we power-on a 16GB virtual machine without a reservation? Will it reduce the available capacity to 255.37? No, it will not. This graph shows only how much of the total capacity is assigned to an object with a hard SLA (reservations).

Thus when a virtual machine is powered on or provisioned into the cluster via vCenter by the user it goes through HA admission controls first:

After HA accepts the virtual machine, DRS admissions control and Host admission control review the virtual machine first before powering it on. The article admission control family describes the admission control workflow in-depth.

75% unreserved capacity
What happened to that 90 GB? Is it gone? Is a part of this capacity reserved on each host in the cluster and unavailable for virtual machines to use? No luckily the 90GBs are not gone, HA just reduced the available capacity so that during a placement operation (deployment or power-on of an existing VM) vCenter knows if one of the clusters can meet the hard SLA of a reservation. To illustrate this behavior I took a screenshot of ESXtop output of a host:

In this capture, you can see that the host is serving 5 virtual machines (W2K8-00 to W2K8-04). Each virtual machine is configured with 16GB (Memsz) and the resource manager has assigned a size target of memory above the 16GB (SZTGT). This size target is the number of resources the resource manager has allocated. The reason why it’s higher than the memsize is because of the overhead memory reservation. The memory needed by the VMkernel to run the virtual machine. As you can see these 5 virtual machines use up 82GB, which is more than the 67.5 GB is supposed to have if 25% was reserved as failover capacity on each host.

Failover process and the role of host admission control
This is the key to understand why HA “ignores” reserved failover capacity during a failover process. As HA consists of FDM agents running on each host, it is the master FDM agent who reviews the protected list and initiates a power-on operation of a virtual machine that is listed as protected but is not running. The FDM agent ensures that the virtual machines are powered on. As you can see this all happens on a host-level basis, vCenter is not included in this party. Therefore the virtual machine start-up operation is reviewed by the host admission control. If the virtual machine is configured with a soft SLA, host admission control only checks if it can satisfy the VM overhead reservation. If the VM is protected by a VM reservation, host admission control checks if it can satisfy both the VM reservation as well as the VM overhead reservation. If it cannot if will fail the startup and FDM has to find another host that can run this virtual machine. If all host fail to power-on the virtual machine, HA will request DRS to “defragment” the cluster by moving virtual machines around to make room on a host and free up some unreserved capacity.

But remember, if a virtual machine has a soft SLA, HA will restart the virtual machine regardless of the amount of capacity to run the virtual machines providing adequate performance to the users. This behavior is covered in-depth in the article: “HA admission control is not a capacity management tool”. To ensure virtual machine performance during a host failure, one must focus on capacity planning and/or configuration of resource reservations.

Part 2 of this series will take a closer look at how to configure a proper percentage value that avoids memory overcommitment.