Future direction of disabling TPS by default and its impact on capacity planning
Eric Sloofs tweet alerted me to the following announcement of TPS being disabled by default in the upcoming vSphere release
Upcoming ESXi Update releases will no longer enable TPS - http://t.co/UaMLK2mUIj
— Eric Sloof (@esloof) October 17, 2014
In short TPS will no longer be enabled by default due to security concerns starting with the following releases:
ESXi 5.5 Update release - Q1 2015
ESXi 5.1 Update release - Q4 2014
ESXi 5.0 Update release - Q1 2015
The next major version of ESXi
After reading this announcement I hope architects review the commonly (mis) used over-commitment ratios during capacity planning exercises. It was always one of favorites topics to discuss at VCDX defense sessions.
It’s common to see a 20 to 30% over-commitment ratio in a vSphere design attributed to TPS. But in reality these ratios are never seen due to IT organization monitoring processes. Why? Because TPS is not used in the same frequency as in the older pre-vSphere infrastructures (ESX 2.x and 3.x) anymore. In reality vSphere have disintegrated the out-of-the-box over-commitment ratios. It only leverages TPS when certain memory usage thresholds are exceeded. Typically architects do not design their environments to reach the memory usage thresholds at 96%.
Large pages and processor architectures
When AMD and Intel introduced hardware-assisted memory virtualization features (RVI and EPT) VMware engineers quickly discovered that lead to increased virtual machine performance while reducing the memory consumption of the kernel. However there was some overhead involved but this could be solved by using large pages. A normal memory page is 4KB a large page is 2MB in size.
However large pages could not be combined with TPS as of the overhead introduced by scanning these 2MB block regions. The probability of finding identical large pages made them realize that the overhead was not worth the low potential of memory saving. The performance increase was calculated around 30% while the impact of sharing loss was perceived minimal, as memory footprints in physical machines tend to increase every year. Therefore virtual machines provisioned on vSphere are using a hardware-MMU leveraging the CPU hardware assisted memory virtualization features.
Although vSphere uses large pages, TPS still is active. It scans and hashes all pages inside a large page to decrease memory usage pressure when a memory threshold is reached. During my time at VMware I wrote an article on the VMkernel memory thresholds in vSphere 5.x Another interesting thing about large pages is the tendency to provide the best performance. The kernel will split up Large pages and share pages during memory pressure, but when no memory pressure is present new incoming pages will be stored in large pages. Potentially creating a cyclical process of constructing and deconstructing large pages.
Another impact on the memory sharing potential is the NUMA processor architecture. NUMA allows the best memory performance by storing memory pages as close to a CPU as possible. TPS memory sharing could reduce the performance while pages are shared between two separate CPU systems. For more info about NUMA and TPS please read the article: “Sizing VMS and NUMA nodes”
Capacity planning impact
Therefor the impact of disabling TPS by default will not be as big some might expect. What I do find interesting is the attention of security. I absolutely agree that security out of the box is crucial, but when regarding probability I would rather do a man-in-the-middle attack of the vMotion network, reading clear text memory across the network then wait for TPS to collapse memory. Which leads me to wonder when to expect encryption for vMotion traffic.