VMware Archives - Page 4 of 8

Should or Must VM-Host affinity rules?

December 1, 2010 by frankdenneman

VMware vSphere 4.1 introduces a new affinity rule, called “Virtual Machines to Hosts” (VM-Host), which I described in the article “VM to Host affinity rule”. A short recap: VM-Host affinity rules are available in two flavors:

Must run rules (Mandatory)
Should run rules (Preferential)

By providing these two options a new problem arises for the administrator\architect, when will the need occur for using the mandatory rule and when is it desired to use preferential rules?
I think it all depends on the risk and limitations introduced by each rule. Let’s review difference between the rules, the behavior of each rule and the impact they have on cluster services and maintenance mode.
What is the difference between a mandatory and a preferential rule?

A mandatory rule limits HA, DRS and the user in such a way that a virtual machine may not be powered on or moved to a ESX host that does not belong to the associated DRS host group.
A preferential rule defines a preference to DRS to run virtual machine on the host specified in the associated DRS host group.

How does HA treat preferential rules?
VMware High Availability respects mandatory rules and obey mandatory rules when placing virtual machines after a host failover. It can only place virtual machines on the ESX hosts that are specified in the DRS host group. DRS does not communicate the existence of preferential rules to HA, therefore HA is not aware of these rules. HA cannot prevent placing the virtual machine on a ESX host that is not a part of the DRS host group, thereby violating the affinity rule. DRS will correct this violation during the next invocation.
How does DRS treat preferential rules?
During a DRS invocation, DRS runs the algorithm with preferential rules as mandatory rules and will evaluate the result. If the result contains violations of cluster constraints; such as over-reserving a host or over-utilizing a host leading to 100% CPU or Memory utilization, the preferential rules will be dropped and the algorithm is run again.
Limitations
In essence a VM-Host affinity rule restricts the number of hosts on which the virtual machines may be powered-on or to which virtual machines may migrate. Setting VM-Host affinity rules can limit load-balancing and evacuation for maintenance mode.
Load-balancing limitations
A certain level of risk is introduced when Mandatory VM-Host affinity rules are used. As a result of the restrictive behavior by only allowing virtual machines to start on ESX host associated in the DRS host group, it impacts HA’s ability to select a compatible ESX host to place the virtual machine.
In addition, using mandatory VM-Host affinity rules reduce the virtual machine placement options used by DRS when defragmenting the cluster. When using the HA “Percentage based” admission control resource fragmentation could occur.
During a failover a defragmentation will be requested by HA from DRS. DRS tries to migrate virtual machines to regain enough unfragmented resources to fit and start all virtual machines. Because DRS is allowed to use “multi-hop” migrations, DRS calculations usually creates “chain” of migrations during defragmentation of a host. For example: VM-A migrates to host 2 and VM-B migrates from host 2 to host 3. Mandatory rules narrow the playing field, allowing VM to only move around in associated DRS host group, reducing the overall options to transport virtual machines around the cluster, regardless of association with VM-host affinity rules.
Maintenance mode
DRS will not violate CPU and memory reservation to obey mandatory VM-Host affinity rules and it will not violate mandatory rules to allow reservations to be honored. During placement both requirements must be met and therefore DRS will only place a virtual machine if its reservation and the mandatory rule can be satisfied. This behavior will impact the ability of DRS to select a suitable compatible host to the place virtual machines during maintenance mode automated evacuations.
Conclusion
Well, it’s up to you to decide which rule is appropriate to use for separating workloads across the ESX hosts in the cluster. By knowing the impact and limitations introduced by mandatory rules, one might be able to make an informed decision.

vSwitch Failback and High Availability

October 22, 2010 by frankdenneman

One setting most admins get caught off-guard is vSwitch Failback setting in combination with HA. If the management network vSwitch is configured with Active/Standby NICs and the HA isolation response is set to “Shutdown” VM or “Power-off” VM it is advised to set the vSwitch Failback mode to No. If left at default (Yes), all the ESX hosts in the cluster or entire virtual infrastructure might issue an Isolation response if one of the management network physical switches is rebooted. Here’s why:
Just a quick rehash:
Active\Standby
One NIC (vmnic0) is assigned as active to the management\service console portgroup, the second NIC (vmnic1) is configured as standby. The vMotion portgroup is configured with the first NIC (vmnic0) in standby mode and the second NIC as Active (vmnic1).

Active Standby setup management network vSwitch0

Failback
The Failback setting determines if the VMkernel will return the uplink (NIC) to active duty after recovery of a downed link or failed NIC. If the Failback setting is set to Yes the NIC will return to active duty, when Failback is set to No the failed NIC is assigned the Standby role and the administrator must manually reconfigure the NIC to the active state.

Effect of Failback yes setting on environment
When using the default setting of Failback unexpected behavior can occur during maintenance of a physical switch. Most switches, like those from Cisco, initiate the port after boot, so called Lights on. The port is active but is still unable to receive or transmitting data. The process from Lights-on to forwarding mode can take up to 50 seconds; unfortunately ESX is not able to distinguish between Lights-on status and forwarding mode, there for treating the link as usable and will return the NIC to active status again.
High Availability will proceed to transmit heartbeats and expect to receive heartbeats, after missing 13 seconds of heartbeats HA will try to ping its Isolation Address, due to the specified Isolation respond it will shut down or power-off the virtual machines two seconds later to allow other ESX hosts to power-up the virtual machines. But because it is common – recommended even – to configure each host in the cluster in an identical manner, each active NIC used by the management network of every ESX host connect to the same physical switch. Due to this design, once the switch is booted, a cluster wide Isolation response occurs resulting in a cluster wide outage.
To allow switch maintenance, it’s better to set the vSwitch failback mode to No. Selecting this setting introduces an increase of manual operations after failure or certain maintenance operations, but will reduce the change of “false positives” and cluster-wide isolation responses.

NUMA, Hyperthreading and NUMA.PreferHT

October 7, 2010 by frankdenneman

I received a lot of questions about Hyperthreading and NUMA in ESX 4.1 after writing the ESX 4.1 NUMA scheduling article.

A common misconception is that Hyperthreading is ignored and therefore not used on a NUMA system. This is not entirely true and due to the improved Hyperthreading code on Nehalems, the CPU scheduler is programmed to use the HT feature more aggressively than the previous releases of ESX. The main reason why I think this misconception exists is the way the NUMA load balancer handles vCPU placement of vSMP virtual machine. Before continuing, let’s get our CPU elements nomenclature aligned, I’ve created a diagram showing all the elements:

The Nehalem Hyperthreading feature is officially called Symmetric MultiThreading (SMT), the term HT and SMT are interchangeable.

1. An Intel Nehalem processor often called a CPU or package.
2. An Intel Nehalem processor contains 4 cores in one package.
3. Each core contains 2 threads if Hyperthreading is enabled.
4. A SMT Thread equals a logical processor.
5. A logical processor is translated in esxtop as a PCPU.
6. A vCPU is scheduled on a PCPU.
7. NUMA= Non-uniform Memory Access (Each Processor has its own local memory assigned)
8. LLC= Last Level Cache: Shared by Cores is last on-die cache memory before turning to Local memory.

NUMA load balancer virtual machine placement
During placement of a vSMP virtual machine, the NUMA load balancer assigns a single vCPU per CPU core and “ignores” the availability of SMT threads. As a result a 4-way vSMP virtual machine will be placed on four cores. In ESX 4.1 this virtual machine can be placed on one processor or on two processors, depending on the amount of cores on the processor or if set the advanced option numa.vcpu.maxPerMachineNode.

When a virtual machine contains more vCPUs than the amount of cores the processor, this virtual machine will span across multiple processors (Wide-VM). The default policy is to span the virtual machine across as few processors (NUMA nodes) as possible, but this can be overridden by an advanced option called numa.vcpu.maxPerMachineNode, which defines the maximum amount of vCPUs of a virtual machine per NUMA client. But as always, only use advanced options if you know the full impact of this setting on your environment. But I digress; let’s go back to NUMA and Hyperthreading.

Now the key to understand is that only during placement the SMT threads are ignored by the NUMA load balancer. It is the up to the CPU scheduler to decide in which way it will schedule the vCPUs within the core. It can allow the vCPU to use the full core or schedule it on a SMT thread depending on the workload, resource entitlement, the amount of active vCPUs and available pCPUs in the system.

Because SMT threads share resources within a core will result into lesser performance than running a vCPU on a dedicated singe core. The ESX scheduler is designed in such a way that it will try to spread the load across all the cores in the NUMA node or in the server.
But basically, If the workload is low it will try to schedule the vCPU on a complete core, if that’s not possible, it will schedule the vCPU on a SMT thread.

As mentioned before, running a vCPU on a SMT thread will not offer the same progress than running on a complete core; therefore a different charging scheme is used for each scenario. This charging scheme is used to keep track of the delivered resources and to check if the VM gets it entitled resources, more on this topic can be found in the article “Reservations and CPU scheduling”.

NUMA.preferHT=One NUMA node to rule them all?
Although the CPU scheduler can decide how to schedule the vCPU within the core, it will only schedule one vCPU of a vSMP virtual machine onto one core. Scott Drummonds article about numa.preferHT might offer a solution. Setting the advanced parameter numa.preferHT=1 allows the NUMA load balancer to assign vCPU to SMT thread and if possible “contain” one vSMP VM into a single NUMA node. However the amount of vCPU must be less or equal than the amount of pCPUs within the NUMA node.
By placing all vCPUs within a processor a virtual machine with a “intensive-cache-footprint” workload can benefit from a “warmed-up” cache. The vCPUs can fetch the memory from Last Level Cache instead of turning to local memory resulting in less latency. And this is exactly why this setting might not be beneficial to most environments.

The numa.preferHT setting is a CPU scheduler wide setting, that means that the NUMA load balancer will place every vSMP virtual machine inside a processor i.e. both intensive cache workloads and low-cache footprint workloads. Currently the ESX 4.1 CPU scheduler does not detect different workloads so it cannot distinguish virtual machines from each other and select an appropriate placement method i.e. place the virtual machine within one processor and use SMT threads or use wide-VM numa placement and “isolate” a vCPU per core.

It is crucial to know that by placing all vCPU on one processor doesn’t guarantee it to have all its memory in local memory, the main goal is to use LLC as much as possible, but if there is a cache miss (memory not available in cache) it will fetch it from local memory. The VMkernel tries to keep memory as local as possible but if there is not enough room inside local memory, it will place the memory into remote memory. Storing memory in remote memory is still faster than swapping it out to disk but inter-socket communication is noticeable slower than intra-socket communications.

This brings me to migration of virtual machines between NUMA nodes, if a virtual machines home node is more heavily loaded than other NUMA nodes, it will be migrated to a less loaded NUMA node. During the migration phase, local memory turns into remote memory. This newly remote memory is moved gradually because moving memory has high overhead.
By using the numa.preferHT option forces you to scope the maximum amount of memory assigned to a virtual machine and the consolidation ratio. Having multiple virtual machine traverse the quick path interlink to fetch memory stored in remote memory defeats the purpose of containing the virtual machines inside a processor.

VM settings: Prefer Partially Automated over Disabled

October 1, 2010 by frankdenneman

Due to requirements or constraints it might be necessary to exclude a virtual machine from automatic migration and stop it from moving around by DRS. Use the “Partially automated” setting instead of “Disabled” at the individual virtual machine automation level. Partially automated blocks automated migration by DRS, but keep the initial placement function. During startup, DRS is still able to select the most optimal host for the virtual machine. By selecting the “Disabled” function, the virtual machine is started on the ESX server it is registered and chances of getting an optimal placement are low(er).
An exception for this recommendation might be a virtualized vCenter server, most admins like to keep track of the vCenter server in case a disaster happens. After a disaster occurs, for example a datacenter-wide power-outage, they only need to power-up the ESX host on which the vCenter VM is registered and manually power-up the vCenter VM. An alternative to this method is to keep track of the datastore vCenter is placed on and register and power-on the VM on a (random) ESX host after a disaster. Slightly more work than disabling DRS for vCenter, but offers probably better performance of the vCenter Virtual Machine during normal operations.
Due to expanding virtual infrastructures and new additional features, vCenter is becoming more and more important for day-to-day operational management today. Assuring good performance outweighs any additional effort necessary after a (hopefully) rare occasion, but both methods have merits.

Resource pools and simultaneous vMotions

September 20, 2010 by frankdenneman

Many organizations have the bad habit to use resource pools to create a folder structure in the host and cluster view of vCenter. Virtual machines are being placed inside a resource pool to show some kind of relation or sorting order like operating system or types of application. This is not reason why VMware invented resource pools. Resource pools are meant to prioritize virtual machine workloads, guarantee and/or limit the amount of resources available to a group of virtual machines.
During design workshops I always try to convince the customer why resource pools should not to be used to create a folder structure. The main object I have for this is the sibling share level of resource pools and virtual machines.

Shares specify the priority for the virtual machine or resource pool relative to other resource pools and/or virtual machines with the same parent in the resource hierarchy. The key point is that shares values can be compared directly only among siblings: the ratios of shares of VM6:VM7 tells which VM is higher priority, but the shares of VM4:VM6 does not tell which VM has higher priority.
Many articles have been written about this, such as: “The resource pool priority-pie paradox”, (Craig Risinger) “Resource pools and shares” (Duncan Epping), “Don’t add resource pools for fun” (Eric Sloof) and “Resource pools caveats” (Bouke Groenescheij).
But another reason not to use resource pools as a folder structure is the limitation resource pools inflict on vMotion operations. Depending on the network speed, vSphere 4.1 allows 8 simultaneous vMotion operations, however simultaneous migrations with vMotion can only occur if the virtual machine is moving between hosts in the same cluster and is not changing its resource pool. This is recently confirmed in Knowledge Base article 1026102.
Fortunately simultaneous cross-resource-pool vMotions can occur if the virtual machines are migrating to different resource pools, but still one vMotion operation per target resource pool. Because clusters are actually implicit resource pools (the root resource pool), migrations between clusters are also limited to a single concurrent vMotion operation.

Using resource pools to create a folder structure can not only impact the availability of resources for the virtual machines, but can also hinder your daily (maintenance) operations if batches of virtual machines are being migrated to other resource pools.