Load Based Teaming

July 21, 2010 by frankdenneman

In vSphere 4.1 a new network Load Based Teaming (LBT) algorithm is available on the distributed virtual switch dvPort groups. The option “Route based on physical NIC load” takes the virtual machine network I/O load into account and tries to avoid congestion by dynamically reassigning and balancing the virtual switch port to physical NIC mappings.
The three existing load-balancing policies, Port-ID, Mac-Based and IP-hash use a static mapping between virtual switch ports and the connected uplinks. The VMkernel assigns a virtual switch port during the power-on of a virtual machine, this virtual switch port gets assigned to a physical NIC based on either a round-robin- or hashing algorithm, but all algorithms do not take overall utilization of the pNIC into account. This can lead to a scenario where several virtual machines mapped to the same physical adapter saturate the physical NIC and fight for bandwidth while the other adapters are underutilized. LBT solves this by remapping the virtual switch ports to a physical NIC when congestion is detected.
After the initial virtual switch port to physical port assignment is completed, Load Based teaming checks the load on the dvUplinks at a 30 second interval and dynamically reassigns port bindings based on the current network load and the level of saturation of the dvUplinks. The VMkernel indicates the network I/O load as congested if transmit (Tx) or receive (Rx) network traffic is exceeding a 75% mean over a 30 second period. (The mean is the sum of the observations divided by the number of observations).
An interval period of 30 seconds is used to avoid MAC address flapping issues with the physical switches. Although an interval of 30 seconds is used, it is recommended to enable port fast (trunk fast) on the physical switches, all switches must be a part of the same layer 2 domain.

Comments

thattommyhall says

July 21, 2010 at 6:05 pm

I dont think “IP-hash use(s) a static mapping between virtual switch ports and the connected uplinks” is true, traffic to different target IPs may go out over different pNICS (its src-dst right?)
- Frank Denneman says
  
  July 21, 2010 at 7:00 pm
  
  Hi Tommy,
  In absolute sense you are correct, but IP-hash uses a static mapping depending on the hashresult of the source-destination calculation. (src-dst)
  The VMkernel does not dynamically remap the connection based on network I/O load and it can continue to suffer performance impact due to other network traffic flowing through this pNIC.
  Although IP-hash creates an outbound load-balancing pattern for the virtual machine itself, it ignores the overall utilization of the pNIC. From a utilization point of view, it remains static.
Rob B says

July 21, 2010 at 7:35 pm

Hi Frank, I enjoy your articles!
I am not sure how this threshold is calculated … “The VMkernel indicates the network I/O load as congested if transmit (Tx) or receive (Rx) network traffic is exceeding a 75% mean over a 30 second period.”
What is the sampling rate?
What counters? Throughput (KB/MB) or IOPS or an either/combo?
Does this mean that the threshold is reached when a sample is in the 75-76% percentile over the last 30 seconds or is it calculated in some other way?
A little confused,
Rob
- Frank Denneman says
  
  July 21, 2010 at 11:31 pm
  
  Hi Rob,
  Thanks for the compliment!
  Unfortunately, the information about this new feature is very sparse.
  I think it does not really matter what unit (KB/MB or IOPS) is used, it’s still 75% of a total number.
  The exact calcalution is unknown to me, but the info i receives is that If transmit or receive traffic exceeds a 75% mean over a 30 second period, VMkernel will signal this link as congested. Mean equals the sum of the observations divided by the number of observations. Now how this relates to a total number is unknown to me. I already requested more information, If i receive this information and I’m allowed to post it, I certainly will.
Dan says

July 21, 2010 at 7:52 pm

Hi Frank,
Is this feature available with all the ESX licenses or is it from a certain level up?
Thanks!
- Frank Denneman says
  
  July 21, 2010 at 11:15 pm
  
  Dan, it’s only available on distributed vswitches, I’m not sure but I think you need an enterprise plus license for dvSwitches.
Rob B says

July 22, 2010 at 8:05 pm

Thanks for the response Frank
Rob
LucD says

July 22, 2010 at 9:48 pm

Hi Frank, great post as always.
I have updated my dvSwitch post, see http://www.lucd.info/2009/10/12/dvswitch-scripting-part-2-dvportgroup/.
The script now supports teaming, including load based teaming as described in your post.
NiTRo says

July 23, 2010 at 12:16 am

Thanks a lot Frank, it looks like you read my mind because i was thinking since few days to build a setup of some vESXi to test the efficiency of this new feature. Now i know how it works 🙂 I’ll give you the results if you’re interested.
BTW, did you heard about some plans to implement that in standard vSwitches ?
Thanks again for sharing your knowledge !
Andrew Miller says

July 25, 2010 at 9:40 pm

From a consulting perspective, I’m both extremely happy to see this…and also extremely annoyed.
Simply put, much of my customer base (I work in the channel) does not have or need Enterprise Plus (nor can they necessarily justify affording it when we’re up against Hyper-V).
So….I’m really happy to see this and plan to work it into my best practices setup….for maybe 20% of my customers that have Enterprise Plus. 🙁
Dan says

May 4, 2011 at 8:26 pm

“all switches must be a part of the same layer 2 domain”
This means that you cannot connect the ESX nics to separated switches, only switches that are stacked?
Thank you!
Tom Howarth says

May 16, 2011 at 5:32 pm

Dan,
a Layer 2 domain effectively relates to a VLAN, if that VLAN spans switches then the domain spans those switches.
KFM says

January 16, 2013 at 8:01 am

Hi Frank,
Can you please elaborate on where you found the resource that supports your sentence: ”
An interval period of 30 seconds is used to avoid MAC address flapping issues with the physical switches.”
I don’t doubt your accuracy but I just cannot find anywhere that says “30 seconds”. The closest I can find is at this blog http://routing-bits.com/2012/10/24/detecting-layer2-loops/ where the author says “When NX-OS detect a series of MAC flap events that exceeds an Cisco defined limit”. Unfortunately I cannot find what the “Cisco defined limit” is! 🙁
The reason why this came up is we were having a discussion at work about how misconfigured etherchannels in a vSwitch can cause (very) frequent mac flaps. But then so can LBT yet this is a VMware “recommended” setting. Obviously if we never hit the 75% threshold for a period of longer than 30 sec then no MAC flaps will occur and all will be good 🙂
Many thanks,
KFM
- Frank Denneman says
  
  January 27, 2013 at 4:21 pm
  
  The 30 seconds interval period stems from research by our engineers. As VMware we need to make sure it works on all the supporting products. Engineering conducted research and made sure that moving flows among uplinks would cause any problems at the physical switch side. LBT is designed to avoid mac flapping as we send an RARP packet to update the physical switch when a vNIC to PNIC mapping changes. Therefor the mac address is only seen on a single port.
  Changing mappings at a more frequent pace might be beneficial to the vSphere layer but might cause problems at the physical layer. 30 seconds turned out to be a suitable interval.