In todays world it’s quite common to virtualize higher priority / tier-1 applications and services. These applications and services are usually subject to service level agreements that typically include requirements for strong performance guarantees. For the compute resources (CPU and Memory) we are relying on the virtualization layer to give us that resource allocation solution by setting reservation, shares and limits. You might want to ensure that the storage requirements of these virtual machines are met and when contention for storage resources occurs these workloads are not impacted.
Today vSphere offers Storage I/O Control (SIOC) to allocates I/O resources based on the virtual machine priority if datastore latency is exceeded. Shares identify priority while limits restrict the amount of IOPS for a virtual machine. Although these are useful controls it does not provide a method to define a minimum amount of IOPS that is available all the time to the application. Providing lots of shares to these virtual machines can solve help to meet the SLA, however continuously calculating the correct share value in a highly dynamic virtual datacenter is cumbersome and complex job.
Storage level reservations
Therefore we are working on Storage level reservations. A storage reservation allows you to specify a minimum number of IOPS that should be available to the virtual machine at all times. This allows the virtual machine to make minimum progress in order to comply with the service level agreement.
In a relative closed environment such as the compute layer its fairly easy to guarantee a minimum level of resource availability, but when it comes to a shared storage platform new challenges arise. The hypervisor owns the computes resource and distributes it to the virtual machine itβs hosting. In a shared storage environment we are dealing with multiple layers of infrastructure, each susceptible to congestion and contention. And then there is the possibility of multiple external storage resource consumers such as non-virtualized workloads using the same array impacting the availability of resources and the control of distributing the resources. These challenges must be taken into account when developing storage reservations and we must understand how stringent you want the guarantee to be.
One of the questions we are dealing with is whether you would like a strict admission control or a relaxed admission control. With strict admission control, a virtual machine power-on operation is denied when vSphere cannot guarantee the storage reservation (similar to compute reservations). Relaxed admission control turns storage reservations into a share-like construct, defining relative priority at times where not enough IOPS are available at power-on. For example: Storage reservation on VM1 = 800 and VM2 = 200. At boot 600 IOPS are available; therefore VM1 gets 80% of 600 = 480, while VM2 gets 20%, i.e. 120 IOPS. When the array is able to provide more IOPS the correct number of IOPS are distributed to the virtual machines in order to to satisfy the storage reservation.
In order to decide which features to include and define the behavior of storage reservation we are very interested in your opinion. We have created a short list of questions and by answering you can help us define our priorities during the development process. I intentionally kept the question to a minimum so that it would not take more than 5 minutes of your time to complete the survey.
Disclaimer
As always, this article provides information about a feature that is currently under development. This means this feature is subject to change and nor VMware nor I in no way promises to deliver on any features mentioned in this article or survey.
Any other ideas about storage reservations? Please leave a comment below.
The survey is closed, thanks for your interest in participating
Would you be interested in Storage-level reservations?
2 min read
Hi Frank,
This sounds great. But please do not forget we also need the feature to “limit” a group for virtual machines to a maximum number of IOPS. This could be very useful in a vCloud environment. Right now their is no options to define how many IOPS a given organisation can use.
I think you should implement both options π
Hi Frank,
Question 7 seems to be a the wrong way round, surely answering “yes” to allowing VM power on would be relaxed admission control?
Q7 – Would you allow power-on of the virtual machine if the storage reservation cannot be guaranteed?
Yes (Strict Admission Control)
No (Relaxed Admission Control)
Cheers!
Sam
Thanks for the heads up, spotted it already so changed it to:
Yes (Relaxed Admission Control)
No (Strict Admission Control)
Sounds interesting.
Although quite a challenge. I think the guarantee is limited to the hypervisor since it has no control over the amount of IOPS once it leaves the host. Not only taking into account shared resources like a SAN infrastructure or hardware like blade architecture. Also, how does it detect the capabilities of the current SAN and attached storage array (to be able to make a reservation, you need to know the limit), and the total amount of IOPS is very dependent on read/write IO and size of the IOs.
I think it would be cool to see it working, but for now, let’s just make the infrastructure stable, redundant and fast, so it can handle the load you require, without the need for setting a reservation.
Great article.
In addition to Bouke about the total amount of IOPS is very dependent on read/write IO and size of the IOs. You can include Bandwith and type of write and read like sequential or random too. How do you deal with an ‘optimize’ Tiered storage which very depends on it’s hardware like FLASH, SAS or NL-SAS disks. They are often policy driven and could collision with eachother. Or am I wrong here? Does VMware has a application or storage point of view related to IOPS. Are we talking Front-end or backend-end IOPS?
The most difficult part is to get an IOPS figure from your organisation or app developer. π
Anyway sounds cool!
i for one and am against this… here’s my reason why.
Resource Pools, limits, reservations, etc. are a nuisance. It’s gotten to the point where there should be a level of automation or analytics involved.
How about I do this… I have 3 VMs, 1 Exchange Server, 1 AD Server, 1 Print Server. I set High, Medium, Low “Priority”s to these following VMs. As resources become constrained (which is the only reason why we use limits, reservations, etc) vSphere should take into account the choice of our priority and make sure that VM gets the resources its needs while others wait in a queue. Limits and reservations are created “on the fly”, and doesn’t take human intervention. Without resource constraint, dealing with resource pools and the breakdown of shares becomes a pain. If there were a mechanism to kick in the certain limits and reservations on the fly by analyzing a workload and triggering the automated setting by a pre-defined priority, that would be much better feature.
Add limits and reservations to the storage side IO is only going to further complicate things. I like how SIOC already has the “priority” feature. Are there really that many instances where the array is getting strained for I/O? If this is a normal problem, solving it with limiting I/O may not be your best solution, maybe you need more spindles or flash to handle those spikes. Adding in limits and reservations to the storage side IO is only a band-aid to an existing problem.
RUN DRS,
Kendrick
Kendrick: With everything you said we still need storage resource pools for vCloud Director. vCloud takes care of CPU and Memory, but there is no way to control storage IOPS. One organisation could eat up all the iops and make everyone else run like s***
Hi Frank,
Greate article and is something that is “missing” from the vSphere Future list.
However we need a “good” way to determine the IOPs capability of a Lun. Few things that we need to take into account here is :
1. Luns on the same Raid Group – Need to be able to create some sort of Lun grouping to take this into account
2. MAX IOPs per said Lun – How will this be determined ? Either have a setting that can be set for each lun (Taking into account point 1 and u have asked it in the survey) or…use vC Ops method of using the Max observed IOPs and have it dynamically ajust. Maybe some vC Ops integration would be good to show the observed capabilities of the IOPs of the Lun as this is already collected. This could be a good option is you use a % based calulation for VM’s on a lun (Linking to Kendrick’s comment about automation)
3. Mb/sec – maybe use vC Ops and look at the VM on Lun’s demand over a period to determan a fair value of usage during diff period of time in the day. vC Ops can give insight to when VM’s have demanded more Mb/sec or even IOPS and can then based on these observations allocate dynamicly more Mb/sec or IOPS during those periods(when there is contents for Mb/sec of IOPs)
I think we could use vC Ops more effectivly to determine when and how much IOPS is needed and based on these patterns maybe allocate the correct resources and determine more correctly the capabilities of a Lun.
Hugo
Hey Frank,
Great writeup. We @solidfire are keenly interested in this subject. Storage QoS are core to our architecture and the resulting SLAs are imperative to our customers. We actually just blogged about the importance of fine-grained QoS controls last week. http://solidfire.com/blog/take-total-control/
Working for a service provider, I would say this would be a brilliant feature which would enable us to provide a relevant SLA for shared storage infrastructure as well as reduce the noisy neighbour impact further, with more granular control.
Frank, great post and interesting concept,
we also need to be sure that user can’t set unrealistic values. If the underlying datastore can’t support the workload for the reservation, then integration with VASA/VAAI and prevention would be a good addition too.
I would love to prefer to see a mechanism with more transparency from the storage to vCenter and advertising of performance capability of datastores e.g. this datastore supports up to 3000 IOPS, with Reservations of X IOPS. A mechanism of this order would be ideal. If no reservations exist, then average usage metrics could be used as a guide to state that datastore is X% subscribed of it’s max performance potential. Taking 1 step further when deploying vAPPs if IOPS requirements were input at time of deployment best datastore placement could be based on these values. As Kendrick has already stated this needs to be dynamic, so perhaps integraion with VASA is the way to go, if more spindles are added to the pool or more flash added then the datastore capabilities will change and this needs to be flexible.
Anyway like I said great post and get you to thinking about the potential of such mechanisms there is a lot that can be done here, and what happens when vVols some along.
Frank, Kaminario product management would love to see the results of the study as we aim to support VMware customers in any way we can. The results may help drive our roadmap and make it easier to integrate with VMware. Any chance we can get a copy?
Hi John,
Unfortunately the results were captured in a VMware database. As I no longer work for VMware I’m unable to provide any results.
Once more, this time with follow-up email turned on…