How to setup Multi-NIC vMotion on a distributed vSwitch

This article provides you an overview of the steps required to setup a Multi-NIC vMotion configuration on an existing distributed Switch with the vSphere 5.1 web client. This article is created to act as reference material for the designing your vMotion network series.

Configuring Multi-NIC vMotion is done at two layers, first the distributed switch layer where we are going to create two distributed port groups and the second layer is the host layer. At the host layer we are going to configure two VMkernel NICs and connect them to the appropriate distributed port group.

00-distributed-switch-and-host-levels

Before you start you need to have ready two ip-addresses for the VMkernel NICs, their respective subnet and their VLAN ID.

Distributed switch level
The first two steps are done at the distributed switch level, click on the networking icon in the home screen and select the distributed switch.

Step 1: Create the vMotion distributed port groups on the distributed switch
The initial configuration is pretty much basic, just provide a name and use the defaults:

 

1: Select the distributed switch, right click and select “New Distributed Port Group”.
01-New-Distributed-Port-Group

2: Provide a name, call it “vMotion-01” and confirm it’s the correct distributed switch.
02-Name-and-Location

3: Keep the defaults at Configure settings and click next.
03-dPortgroup-settings

4: Review the settings and click finish.

Do the same for the second distributed port group, name that vMotion-02

Step 2: Configuring the vMotion distributed port groups
Configuring the vMotion distributed port groups consist of two changes. Enter the VLAN ID and set the correct failover order.

1: Select distributed Port Group vMotion-01 in the left side of your screen and right click and select edit settings.
2: Go to VLAN, select VLAN as VLAN type and enter the first VLAN used by the first VMkernel NIC.
04-VLAN-ID

3: Select “Teaming and failover” , move the second dvUplink down to mark it as a “Standby uplink”. Verify that load balancing is set to “Route based on originating virtual port”.
05-teaming-and-failover

4: Click OK

Repeat the instructions of step 2 for distributed Portgroup vMotion-02, but use the VLAN ID used by the IP-address of the second VMkernel NIC.
06-VLAN-ID-2

Go to teaming and failover and configure the uplinks in an alternate order, ensuring that the second vMotion VMkernel NIC is using dvUplink2.
07-teaming-and-failover

Host level
We are done at the distributed switch level, the distributed switch now updates all connected hosts and each host has access to the distributed port groups. Two vMotion enabled VMkernel NICs are configured at host level. Go to Hosts and Clusters view.

Step 3: Create vMotion enabled VMkernel NICs
1: Select the first host in the cluster, go to manage, networking and “add host networking”.
08-add-host-networking

2: Select VMkernel Network Adapter.
09-VMkernel-Network-Adapter

3: Select an existing distributed portgroup, click on Browse and select distributed Port Group “vMotion-01” Click on OK and click on Next.
10-select-network

11-select-target-device

4: Select vMotion traffic and click on Next.
12-Port-Properties

5: Select static IPv4 settings, Enter the IP-address of the first VMkernel NIC corresponding with the VLAN ID set on distributed Port Group vMotion-01.
13-ip-address

6: Click on next and review the settings.

Create the second vMotion enabled VMkernel NIC. Configure identically except:
1: Select vMotion-02 portgroup
2: Enter IP-address corresponding with the VLAN ID on distributed Port Group vMotion-02.

The setup of a Multi-NiC vMotion configuration on a single host is complete. Repeat Step 3 on each host in the cluster.

Direct IP-storage and using NetIOC User-defined network resource pools for QoS

Some customers use iSCSI initiators inside the Guest OS to connect directly to a datastore on the array or using an NFS client inside the Guest OS to access remote NFS storage directly. Thereby circumventing the VMkernel storage stack of the ESX host. The virtual machine connects to the remote storage system via a VM network portgroup and therefore the VMkernel classifies this network traffic as virtual machine traffic. This “indifference” or non-discriminating behavior of the VMkernel might not suit you or might not help you to maintain service level agreements.

Isolate traffic
In the 1Gbe adapter world, having redundant and isolated uplinks assigned for different sorts of traffic is a simple way of not to worry about traffic congestion. However when using a small number of 10GbE adapters you need to be able to partition network bandwidth among the different types of network traffic flows. This is where NetIOC comes into play. Please read the “Primer on Network I/O Control” article to quickly brush up on your knowledge of NetIOC.

System network resource pools
By default NetIOC provides seven different system network resource pools. Six network pools are used to bind VMkernel traffic, such as NFS and iSCSI. One system network resource pool is used for virtual machine network traffic.

03-system network resource pools

The network adapters you use to connect your IP-Storage from within the Guest OS connect to a virtual machine network portgroup. Therefor NetIOC binds this traffic to the virtual machine network resource pool. In result this traffic shares the bandwidth and prioritization with “common” virtual machine network traffic.

01-vm-portgroup-mapping-to-virtual-machine-portgroup

User-defined network resource pool
Most customers tend to prioritize IP storage traffic over network traffic induced by applications and the guest-OS. To ensure the IP-Storage traffic created by the NFS client or iSCSI initiator inside the Guest OS create a user-defined network resource pool. User-defined network resource pools are available from vSphere 5.0 and upwards. Make sure your distributed switch is at least version 5.0.

Shares: User-defined network resource pools are available to isolate and prioritize virtual machine network traffic. Configure the User-defined network resource pool with an appropriate number of shares. The number of shares will reflect the relative priority of this network pool compared to the other traffic streams using the same dvUplink.

QoS tag: Another benefit of creating a separate User-defined network resource pool is the ability to set a QoS tag specifically for this traffic stream. If you are using IEEE 802.1p tagging end-to-end throughout your virtual infrastructure ecosystem, setting the QoS tag on the User-defined network resource pool helps you to maintain the service level for your storage traffic.

011-user-defined-network-resource-pool

Setup
In a greenfield scenario setup the User-defined resource pool first, that allows you to select the correct network pool during the creation of the dvPortgroups. If you already created dvPortgroups, you can assign the correct network resource pool once you create the network resource pool.

Create a user defined network resource pool:
1. Open your vSphere web client and go to networking.
2. Select the dvSwitch
3. Go to Manage
4. Select Resource Allocation tab
5. Click on the new icon.
6. Configure the network resource pool and click on OK

04-New-Network-Resource-Pool

I already made a User-defined network resource pool called dNFS, the overview of available network resource pools on the dvSwitch looks like this:

02-network-pools-overview

To map the network resource pool to the Distributed Port Group, create a new Distributed Port group, or edit an existing one and select the appropriate network resource pool:

03-new-distributed-portgroup

IP-Hash versus LBT

vSwitch configuration and load-balancing policy selection are major parts of a virtual infrastructure design. Selecting a load-balancing policy can have impact on the performance of the virtual machine and can introduce additional requirements at the physical network layer. Not only do I spend lots of time discussing the various options during design sessions, it is also an often discussed topic during the VCDX defense panels.

More and more companies seem to use IP-hash as there load balancing policy. The main argument seems to be increased bandwidth and better redundancy. Even when the distributed vSwitch is used, most organizations still choose IP-hash over the new load balancing policy “Route based on physical NIC load”. This article compares both load-balancing policies and lists the characteristics, requirements and constraints of both load-balancing policies.

IP-Hash
The main reason for selecting IP-Hash seems to be increased bandwidth as you aggregate multiple uplinks, unfortunately adding more uplinks does not proportionally increase the available bandwidth for the virtual machines.

How IP-Hash works
Based on the source and destination IP address together the VMkernel distributes the load across the available NICs in the vSwitch. The calculation of outbound NIC selection is described in KB article 1007371. To calculate the IP-hash yourself convert both the source and destination IP-addresses to a Hex value and compute the modulo over the number of available uplinks in the team. For example

Virtual Machine 1 opens two connections, one connection to a backup server and one connection server to an application server.

Virtual Machine IP-Address Hex Value
VM1 164.18.1.84 A4120154
Backup Server 164.18.1.160 A41201A0
Application Server 164.18.1.195 A41201C3

The vSwitch is configured with two uplinks.

Connection 1: VM1 > Backup Server (A4120154 Xor A41201A0 = F4) % 2 = 0
Connection 2: VM1 > Application Server (A4120154 Xor A41201C3 = 97) % 2 = 1

IP-Hash treats each connection between a source and destination IP address as a unique route and the vSwitch will distributed each connection across the available uplinks in the vSwitch. However due to the pNIC to vNIC affiliation, any connection is on a per flow basis. A flow can’t overflow to another uplink; this means that a connection is still limited to the speed of a single physical NIC. A real-world user case for IP-hash would be a backup server which requires a lot of bandwidth across multiple connections other than that; there are very few workloads that require bandwidth that can’t be satisfied by a single adapter.

Complexity –In order for IP-hash to function correctly additional configuration at the network layer is required:

EtherChannel: IP-hash needs to be configured on the vSwitch if EtherChannel technology is used at the physical switch layer. With EtherChannel the switch will load balance connections over multiple ports in the EtherChannel. Without IP-hash, the VMkernel only expects to receive information on a specific MAC address on a single vNIC. Resulting in some sessions go through to the virtual machine while other sessions will be dropped. When IP-hash is selected, then the VMkernel will accept inbound mac addresses on both active NICs

EtherChannel configuration: As vSphere does not support dynamic link aggregation (LACP), none of the members can be set up to auto-negotiate membership and therefore physical switches have to be configured with static EtherChannel.

Switch configuration: vSphere supports EtherChannel from one switch to the vSwitch. This switch can be a single switch or a stack of individual switches that act as one, but vSphere does not support EtherChannel from two separate – non stacked – switches, when the EtherChannel connect to the same vSwitch.

Additional overhead – For each connection the VMkernel needs to select the appropriate uplink. If a virtual machine is running a front-end application and communicates 95% of its time to the backend database, the IP-Hash calculation is almost pointless. The VMkernel needs to perform the math for every connection and 95% of the connections will use the same uplink because the Algorithm will always result in the same hash.

Utilization-unaware – It is possible that a second virtual machine is assigned to use the same uplink as the virtual machine that is already saturating the link. Let’s use the first example and introduce a new virtual machine VM3. Due to the backup window, VM3 connects to the backup server.

Virtual Machine IP-Address Hex Value
VM3 164.18.1.86 A4120156

Connection 3: VM3> Backup Server (A4120156 Xor A41201A0 = F6) % 2 = 0

Due to IP-HASH load balancing policy being unaware of utilization it will not rebalance if the uplink is saturated or if virtual machine are added or removed due to power-on or (DRS) migrations. DRS is unaware of network utilization and does not initiate a rebalance if a virtual machine cannot send or receive packets due to physical NIC saturation. In worst-case scenario DRS can migrate virtual machines to other ESX servers, leaving all the virtual machine that are saturating a NIC while the other virtual machines utilizing the other NICs are migrated. Admitted it’s a little bit of a stretch, but being aware of this behavior allows you to see the true beauty of the Load-Based Teaming team policy.

Possible Denial of Service –Due to the pNIC-to-vNIC affiliation per connection a misbehaving virtual machine generating many connections can cause some sort of denial of service on all uplinks on the vSwitch. If this application would connect to a vSwitch with “Port-ID” or “based on physical load” only one uplink would be affected.

Network failover detection Beacon Probing – Beacon probe does not work correctly if EtherChannel is used. ESX broadcast beacon packets out of all uplinks in a team. The physical switch is expected to forward all packets to other ports. In EtherChannel mode, the physical switch will not send the packets because it’s considered as one link. No beacon packets will be received and can interrupt network connections. Cisco switches will report flapping errors. See KB article 1012819.

Route based on physical NIC Load

VMware vSphere 4.1 introduced a new load-balancing policy available on distributed vSwitches. Route based on physical NIC load, also known as Load Based Teaming (LBT) takes the virtual machine network I/O load into account and tries to avoid congestion by dynamically reassigning and balancing the virtual switch port to physical NIC mappings.

How LBT works
Load Based Teaming maps vNICs to pNICs and remaps the vNIC-to-PNIC affiliation if the load exceeds specific thresholds on an uplink.
LBT uses the same initial port assignment as the “originating port id” load balancing policy, resulting in the first vNIC being affiliated to the first pNIC, the second vNIC to the second pNIC, etc. After initial placement, LBT examines both ingress and egress load of each uplink in the team and will adjust the vNIC to pNIC mapping if an uplink is congested. The NIC team load balancer flags a congestion condition if an uplink experiences a mean utilization of 75% or more over a 30-second period.

Complexity – LBT requires standard Access or Trunk ports. LBT does not support EtherChannels. Because LBT is moving flows among the available uplinks of the vSwitch, it may create packets re-ordering. Even though the reshuffling process is not done often (worst case scenario every 30 seconds) it is recommended to enable PortFast or TrunkFast on the switch ports.

Additional overhead – The VMkernel will examine the congestion condition after each time window, this calculation creates a minor overhead opposed to using the static load-balancing policy “originating port-id”.

Utilization aware – vNIC to pNIC mappings will be adjusted if the VMkernel detects congestion on an uplink. In the previous example both VM1 and VM3 shared the same connection due to the IP-hash calculation. Both connections can share the same physical NIC as long as the utilization stays below the threshold. It is likely that both vNICs are mapped to separate physical NICs.
In the next example a third virtual machine is powered up and is mapped to NIC1. Utilization of NIC1 exceeds the mean utilization of 70% over a period of more than 30 seconds. After identifying congestion LBT remaps VM2 to NIC2 to decrease the utilization of NIC1.

Although LBT is not integrated in DRS it can be viewed as complimentary technology next to DRS. When DRS migrates virtual machines onto a host, it is possible that congestion is introduced on a particular physical NIC. Due to vNIC to pNIC mapping based on actual load, LBT actively tries to avoid congestion at physical NIC level and attempts to reallocate virtual machines. By remapping vNiCs to pNICs it will attempt to make as much bandwidth available to the virtual machine, which ultimately benefits the overall performance of the virtual machine.

Recommendations
When using distributed virtual Switches it is recommended to use Load-Based teaming instead of IP-hash. LBT has no additional requirements on the physical network layer, reduces complexity and is able to adjust to fluctuating workloads. Due to the remapping of vNICs to pNICs based on actual load, LBT attempts to allocate as much bandwidth possible where IP-hash just simply distributes connections across the available physical NICs.

Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman

vSwitch Failback and High Availability

One setting most admins get caught off-guard is vSwitch Failback setting in combination with HA. If the management network vSwitch is configured with Active/Standby NICs and the HA isolation response is set to “Shutdown” VM or “Power-off” VM it is advised to set the vSwitch Failback mode to No. If left at default (Yes), all the ESX hosts in the cluster or entire virtual infrastructure might issue an Isolation response if one of the management network physical switches is rebooted. Here’s why:

Just a quick rehash:

Active\Standby
One NIC (vmnic0) is assigned as active to the management\service console portgroup, the second NIC (vmnic1) is configured as standby. The vMotion portgroup is configured with the first NIC (vmnic0) in standby mode and the second NIC as Active (vmnic1).

Active Standby setup management network vSwitch0

Active Standby setup management network vSwitch0

Failback
The Failback setting determines if the VMkernel will return the uplink (NIC) to active duty after recovery of a downed link or failed NIC. If the Failback setting is set to Yes the NIC will return to active duty, when Failback is set to No the failed NIC is assigned the Standby role and the administrator must manually reconfigure the NIC to the active state.

Effect of Failback yes setting on environment

When using the default setting of Failback unexpected behavior can occur during maintenance of a physical switch. Most switches, like those from Cisco, initiate the port after boot, so called Lights on. The port is active but is still unable to receive or transmitting data. The process from Lights-on to forwarding mode can take up to 50 seconds; unfortunately ESX is not able to distinguish between Lights-on status and forwarding mode, there for treating the link as usable and will return the NIC to active status again.

High Availability will proceed to transmit heartbeats and expect to receive heartbeats, after missing 13 seconds of heartbeats HA will try to ping its Isolation Address, due to the specified Isolation respond it will shut down or power-off the virtual machines two seconds later to allow other ESX hosts to power-up the virtual machines. But because it is common – recommended even – to configure each host in the cluster in an identical manner, each active NIC used by the management network of every ESX host connect to the same physical switch. Due to this design, once the switch is booted, a cluster wide Isolation response occurs resulting in a cluster wide outage.

To allow switch maintenance, it’s better to set the vSwitch failback mode to No. Selecting this setting introduces an increase of manual operations after failure or certain maintenance operations, but will reduce the change of “false positives” and cluster-wide isolation responses.

Load Based Teaming

In vSphere 4.1 a new network Load Based Teaming (LBT) algorithm is available on the distributed virtual switch dvPort groups. The option “Route based on physical NIC load” takes the virtual machine network I/O load into account and tries to avoid congestion by dynamically reassigning and balancing the virtual switch port to physical NIC mappings.

The three existing load-balancing policies, Port-ID, Mac-Based and IP-hash use a static mapping between virtual switch ports and the connected uplinks. The VMkernel assigns a virtual switch port during the power-on of a virtual machine, this virtual switch port gets assigned to a physical NIC based on either a round-robin- or hashing algorithm, but all algorithms do not take overall utilization of the pNIC into account. This can lead to a scenario where several virtual machines mapped to the same physical adapter saturate the physical NIC and fight for bandwidth while the other adapters are underutilized. LBT solves this by remapping the virtual switch ports to a physical NIC when congestion is detected.

After the initial virtual switch port to physical port assignment is completed, Load Based teaming checks the load on the dvUplinks at a 30 second interval and dynamically reassigns port bindings based on the current network load and the level of saturation of the dvUplinks. The VMkernel indicates the network I/O load as congested if transmit (Tx) or receive (Rx) network traffic is exceeding a 75% mean over a 30 second period. (The mean is the sum of the observations divided by the number of observations).

An interval period of 30 seconds is used to avoid MAC address flapping issues with the physical switches. Although an interval of 30 seconds is used, it is recommended to enable port fast (trunk fast) on the physical switches, all switches must be a part of the same layer 2 domain.