HA ADMISSION CONTROL IS NOT A CAPACITY MANAGEMENT TOOL.
I receive a lot of questions on why HA doesn’t work when virtual machines are not configured with VM-level reservations. If no VM-level reservations are used, the cluster will indicate a fail over capacity of 99%, ignoring the CPU and memory configuration of the virtual machines. Usually my reply is that HA admission control is not a capacity management tool and I noticed I have been using this statement more and more lately. As it doesn’t scale well explaining it on a per customer basis, it might be a good idea to write a blog article about it. The basics Sometimes it’s better to review the basics again and understand where the perception of HA and the actual intended purpose of the product part ways. Let’s start of what HA admission control is designed for. In the availability guide the two following statement can be found: Quote 1:
WANT TO HAVE A VSPHERE 5.1 CLUSTERING DEEPDIVE BOOK FOR FREE?
Want to have a vSphere 5.1 clustering deepdive book for free? CloudPhysics are giving away some vSphere 5.1 clustering deepdive books, do the following if you want to receive a copy: Action required Email info@cloudphysics.com with a subject of “Book”. No message is needed. Register at http://www.cloudphysics.com/ by clicking “SIGN UP”. Install the CloudPhysics Observer vApp to activate your dashboard. Eligibility rules You are a new CloudPhysics user. You fully install the CloudPhysics ‘Observer’ vApp in your vSphere environment. The first 150 users gets a free book, but what’s even better, the Cloudphysics service gives you great insights on your current environment. For more info read the following blogposts: CloudPhysics in a nutshell and VM reservations and limits card - a closer look
PARTIALLY CONNECTED DATASTORE CLUSTERS - WHERE CAN I FIND THE WARNINGS AND HOW TO SOLVE IT VIA THE WEB CLIENT?
During my Storage DRS presentation at VMworld I talked about datastore cluster architecture and covered the impact of partially connected datastore clusters. In short – when a datastore in a datastore cluster is not connected to all hosts of the connected DRS cluster, the datastore cluster is considered partially connected. This situation can occur when not all hosts are configured identically, or when new ESXi hosts are added to the DRS cluster. The problem I/O load balancing does not support partially connected datastores in a datastore cluster and Storage DRS disables the IO load balancing for the entire datastore cluster. Not only on that single partially connected datastore, but the entire cluster. Effectively degrading a complete feature set of your virtual infrastructure. Therefore having an homogenous configuration throughout the cluster is imperative. Warning messages An entry is listed in the Storage DRS Faults window. In the web vSphere client: 1. Go to Storage 2. Select the datastore cluster 3. Select Monitor 4. Storage DRS 5. Faults. The connectivity menu option shows the Datastore Connection Status, in the case of a partially connected datastore, the message Datastore Connection Missing is listed. When clicking on the entry, the details are shown in the lower part of the view: Returning to a fully connected state To solve the problem, you must connect or mount the datastores to the newly added hosts. In the web client this is considered a host-operation, therefore select the datacenter view and select the hosts menu option. 1. Right-click on a newly added host 2. Select New Datastore 3. Provide the name of the existing datastore 4. Click on Yes when the warning “Duplicate NFS Datastore Name” is displayed. 5. As the UI is using existing information, select next until Finish. 6. Repeat steps for other new hosts. After connecting all the new hosts to the datastore, check the connectivity view in the monitor menu of the of the datastore cluster Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman
VSPHERE 5.1 DRS ADVANCED OPTION LIMITVMSPERESXHOST
During the Resource Management Group Discussion here in VMworld Barcelona a customer asked me about limiting the number of VMs per Host. vSphere 5.1 contains an advanced option on DRS clusters to do this. If the advanced option: “LimitVMsPerESXHost” is set, DRS will not admit or migrate more VMs to the host than that number. For example, when setting the LimitVMsPerESXHost to 40, each host allows up to 40 virtual machines. No correction for existing violation Please note that DRS will not correct any existing violation if the advanced feature is set while virtual machines are active in the cluster. This means that if you set LimitVMsPerESXHost to 40 and at the time 45 virtual machines are running on an ESX host, DRS will not migrate the virtual machines out of that host. However It does not allow any more virtual machines on the host. DRS will not allow any power-ons or migration to the host, both manual (by administrator) and automatic (by DRS). High Availability As this is a DRS cluster setting, HA will not honor the setting during a host failover operation. This means that HA can power on as many virtual machines on a host it deems necessary. This is to avoid any denial of service by not allowing virtual machines to power-on if the “LimitVMsPerESXHost” is set too conservative. Impact on load balancing Please be aware that this setting can impact VM happiness. This setting can restrict DRS in finding a balance with regards to CPU and Memory distribution. Use cases This setting is primary intended to contain the failure domain. A popular analogy to describe this setting would be “Limiting the number of eggs in one basket”. As virtual infrastructures are generally dynamic, try to find a setting that restricts the impact of a host failure without restricting growth of the virtual machines. I’m really interested in feedback on this advanced setting, especially if you consider implementing it, the use case and if you want to see this setting to be further developed. Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman
CHANGING YOUR VCENTER LOGGING LEVEL USING THE WEBCLIENT
In order to monitor some load behavior I needed to increase the logging level of the vCenter server. The logging level still is included in the vCenter Server Settings, however it takes a few more clicks to get the Statistics option compared to the old vSphere client. 1. In the home screen click on vCenter. 2. click on the vCenter Servers link in the Inventory list. 3. Select the vCenter (probably Localhost). 4. Select the Manage tab in the right-pane. 5. Select Setting. 6. Click on the Edit button located on the far end right side of the screen. 7. Change the appropriate Statistic setting. Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman
CLOUDPHYSICS VM RESERVATION & LIMITS CARD – A CLOSER LOOK
The VM Reservation and Limits card was released yesterday. CloudPhysics decided to create this card based on the popularity of this topic in the contest. So what does this card do? Let’s have a closer look. This card provides you an easy overview of all the virtual machines configured with any reservation or limits for CPU and memory. Reservations are a great tool to guarantee the virtual machine continuous access to physical resources. When running business critical applications reservations could provide a constant performance baseline that helps you meet your SLA. However reservations can impact your environment as the VM reservations impacts the resource availability of other virtual machines in your virtual infrastructure. It can lower your consolidation ratio: The Admission Control Family and it can even impact other vSphere features such as vSphere High Availability. The CloudPhysics HA Simulation card can help you understand the impact of reservations on HA. Besides reservations virtual machine limits are displayed. A limit restricts the use of physical access of the virtual machine. A limit could be helpful to test the application during various level of resource availability. However virtual machine limits are not visible to the Guest OS, therefor it cannot scale and size its own memory management (or even worse the application memory management) to reflect the availability of physical memory. For more information about memory limits, please read this post by Duncan: Memory limits. As the VMkernel is forced to provide alternative memory resources limits can lead to the increased use of VM swap files. This can lead to performance problems of the application but can also impact other virtual machines and subsystems used in the virtual infrastructure. The following article zooms into one of the many problems when relying on swap files: Impact of host local VM swap on HA and DRS. Color indicators As virtual machine level limits can impact the performance of the entire virtual infrastructure, the CloudPhysics engineers decided to add an additional indicator to help you easily detect limits. When a virtual machine is configured with a memory limit still greater than 50% of its configured size an Amber dot is displayed next to the configured limit size. If the limit is smaller or equal to 50% of its configured size than a red dot is displayed next to the limit size. Similar for CPU limits, an amber dot is displayed when the limit of a virtual machine is set but is more than 500MHz, a red dot indicates that the virtual machine is configured with a CPU limit of 500MHz or less. For example: Virtual Machine Load06 is configured with 16GB of memory. A limit is set to 8GB (8192MB), this limit is equal to 50% of the configured size. Therefore the VM reservation and Limits card displays the configured limit in red and presents an additional red dot. Flow of information The indicators are also a natural divider between the memory resource controls and the CPU controls. As memory resource control impacts the virtual infrastructure more than the CPU resource controls, the card displays the memory resource controls at the left side of the screen. We are very interested in hearing feedback about this card, please leave a comment. Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman
CLOUDPHYSICS VM RESERVATION & LIMITS CARD – A CLOSER LOOK
The VM Reservation and Limits card was released yesterday. CloudPhysics decided to create this card based on the popularity of this topic in the contest. So what does this card do? Let’s have a closer look. This card provides you an easy overview of all the virtual machines configured with any reservation or limits for CPU and memory. Reservations are a great tool to guarantee the virtual machine continuous access to physical resources. When running business critical applications reservations could provide a constant performance baseline that helps you meet your SLA. However reservations can impact your environment as the VM reservations impacts the resource availability of other virtual machines in your virtual infrastructure. It can lower your consolidation ratio: The Admission Control Family and it can even impact other vSphere features such as vSphere High Availability. The CloudPhysics HA Simulation card can help you understand the impact of reservations on HA. Besides reservations virtual machine limits are displayed. A limit restricts the use of physical access of the virtual machine. A limit could be helpful to test the application during various level of resource availability. However virtual machine limits are not visible to the Guest OS, therefor it cannot scale and size its own memory management (or even worse the application memory management) to reflect the availability of physical memory. For more information about memory limits, please read this post by Duncan: Memory limits. As the VMkernel is forced to provide alternative memory resources limits can lead to the increased use of VM swap files. This can lead to performance problems of the application but can also impact other virtual machines and subsystems used in the virtual infrastructure. The following article zooms into one of the many problems when relying on swap files: Impact of host local VM swap on HA and DRS. Color indicators As virtual machine level limits can impact the performance of the entire virtual infrastructure, the CloudPhysics engineers decided to add an additional indicator to help you easily detect limits. When a virtual machine is configured with a memory limit still greater than 50% of its configured size an Amber dot is displayed next to the configured limit size. If the limit is smaller or equal to 50% of its configured size than a red dot is displayed next to the limit size. Similar for CPU limits, an amber dot is displayed when the limit of a virtual machine is set but is more than 500MHz, a red dot indicates that the virtual machine is configured with a CPU limit of 500MHz or less. For example: Virtual Machine Load06 is configured with 16GB of memory. A limit is set to 8GB (8192MB), this limit is equal to 50% of the configured size. Therefore the VM reservation and Limits card displays the configured limit in red and presents an additional red dot. Flow of information The indicators are also a natural divider between the memory resource controls and the CPU controls. As memory resource control impacts the virtual infrastructure more than the CPU resource controls, the card displays the memory resource controls at the left side of the screen. We are very interested in hearing feedback about this card, please leave a comment. Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman
FROM THE ARCHIVES - AN OLD ISOMETRIC DIAGRAM
While searching for a diagram I stumbled upon an old diagram I made in 2007. I think this diagram started my whole obsession with diagrams and to add “cleanness” to my diagrams. This diagram depicts a virtual infrastructure located in two datacenters with replication between them. This infrastructure is no longer in use, but to make absolutely sure, I changed the device names into generic text labels such as ESX host, array, SW switch, etc. Back then I really liked to draw Isometric style. Now I’m more focused onto block diagrams and trying to minimalize the number of components in a diagram. In essence I follow the words from Colin Chapman: Simplify, then add lightness. But then applied to diagrams :) The fact that this diagram is still stored on my system tells me that I’m still very proud of this diagram. So that made me wonder, which diagram did you design and are you proud of? Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman
STORAGE DRS AUTOMATION LEVEL AND INITIAL PLACEMENT BEHAVIOR
Recently I was asked why Storage DRS was missing a “Partially Automated mode”. Storage DRS has two automation levels, no automation (Manual Mode) and Fully Automated mode. When comparing this with DRS, we notice that Storage DRS is missing a “Partially Automated mode”. But in reality the modes of Storage DRS cannot be compared to DRS at all. This article explains the difference in behavior. DRS automation modes: There are three cluster automation levels: Manual automation level: When a virtual machine is configured with the manual automation level, DRS generate both initial placement and load balancing migration recommendations, however the user needs to manual approve these recommendations. Partially automation level: DRS automatically places a virtual machine with a partially automation level, however it will generate a migration recommendation which requires manual approval. Fully automated level: DRS automatically places a virtual machine on a host and vCenter automatically applies migration recommendation generated by DRS Storage DRS automation modes: There are two datastore cluster automation levels: No Automation (Manual mode): Storage DRS will make migration recommendations for virtual machine storage, but will not perform automatic migrations. Fully Automated: Storage DRS will make migration recommendations for virtual machine storage, vCenter automatically confirms migration recommendations. No automatic Initial placement in Storage DRS Storage DRS does not provide placement recommendations for vCenter to automatically apply. (Remember that DRS and Storage DRS only generate recommendations, it is vCenter that actually approves these recommendations if set to Automatic). The automation level only applies to migration recommendation of exisiting virtual machines inside the datastore cluster. However, Storage DRS does analyze the current state of the datastore cluster and generates initial placement recommendations based on space utilization and I/O load of the datastore and disk footprint and affinity rule set of the virtual machine. When provisioning a virtual machine, the summary screen provided in the user interface displays a datastore recommendation. When clicking on the “more recommendations” less optimal recommendations are displayed. This screen provides information about the Space Utilization % before placement, the Space Utilization % after the virtual machine is placed and the measured I/O Latency before placement. Please note that even when I/O load balancing is disabled, Storage DRS uses overall vCenter I/O statistics to determine the best placement for the virtual machine. In this case the I/O Latency metric is a secondary metric, which means that Storage DRS applies weighing to the space utilization and overall I/O latency. It will satisfy space utilization first before selecting a datastore with an overall lower I/O latency. Adding new hard-disks to a existing VM in a datastore cluster As vCenter does not apply initial placement recommendations automatically, adding new disks to an existing virtual machine will also generate an initial placement recommendation. The placement of the disk is determined by the default affinity cluster rule. The datastore recommendation depicted below shows that the new hard disk is placed on datastore nfs-f-01, why? Because it needs to satisfy storage initial placement requests and in this case this means satisfying the datastore cluster default affinity rule. If the datastore cluster were configured with a VMDK anti-affinity rule, the datastore recommendation would show any other datastore except datastore nfs-f-01. Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman
STORAGE DRS DEVICE MODELING BEHAVIOR
During a recent meeting the behavior of Storage DRS device modeling was discussed. When I/O load balancing is enabled, Storage DRS leverages the SIOC injector to determine the device characteristics of the disks backing the datastore. Because the injector stops when there is activity detected on the datastore, the customer was afraid that Storage DRS wasn’t able to get a proper model of his array due to the high levels of activity seen on the array. Storage DRS was designed to cope with these environments, as the customer was reassured after explaining the behavior I thought it might be interesting enough for to share it with you too. The purpose of device modeling Device modeling is used by Storage DRS to characterize the performance levels of the datastore. This information is used when Storage DRS needs to predict the benefit of a possible migration of a virtual machine. The workload model provides information about the I/O behavior of the VM, Storage DRS uses that as input and mixes this with the device model of the datastore in order to predict the increase of latency after the move. The device modeling of the datastore is done with the SIOC injector The workload To get a proper model, the SIOC injector injects random read I/O to the disk. SIOC uses different amounts of outstanding IO to measure the latency. The duration of the complete cycle is 30 seconds and is trigger once a day per datastore. Although it’s a short-lived process, this workload does generate some overhead on the array and Storage DRS is designed to enable storage performance for your virtual machines, not to interfere with them. Therefor this workload will not run when activity is detected on the devices backing the datastore. Timer As mentioned, the device modeling process runs for 30 seconds in order to characterize the device. If the IO injector starts and the datastore is active or becomes active, the IO injector will wait for 1 minute to start again. If the datastore is still busy, it will try again in 2 minutes, after that it idles for 4 minutes, after that 8 minutes, 16 minutes, 32 minutes, 1 hour and finally 2 hours. When the datastore is still busy after two hours after the initial start it will try to start the device modeling with an interval of 2 hours until the end of the day. If SIOC is not able to characterize the disk during that day, it will use the average value of all the other datastores in other not to influence the load balancing operations with false information and provide information that would favor this disk over other datastores that did provide actual data. The next day SIOC injector will try model the device again, but uses a skew back and forth of 2 hours from the previous period, this way during the year, Storage DRS will retrieve info across every period of the day. Key takeway Overall we do not expect the array to be busy 24/7, there is always a window of 30 seconds where the datastore is idling. Having troubleshooting many storage related problems I know arrays are not stressed all day long, therefor I’m more than confident that Storage DRS will have accurate device models to use for its prediction models. Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman