Future direction of disabling TPS by default and its impact on capacity planning

Eric Sloofs tweet alerted me to the following announcement of TPS being disabled by default in the upcoming vSphere release

In short TPS will no longer be enabled by default due to security concerns starting with the following releases:

ESXi 5.5 Update release – Q1 2015
ESXi 5.1 Update release – Q4 2014
ESXi 5.0 Update release – Q1 2015
The next major version of ESXi

More information here: Security considerations and disallowing inter-Virtual Machine Transparent Page Sharing (2080735)

After reading this announcement I hope architects review the commonly (mis) used over-commitment ratios during capacity planning exercises. It was always one of favorites topics to discuss at VCDX defense sessions.

It’s common to see a 20 to 30% over-commitment ratio in a vSphere design attributed to TPS. But in reality these ratios are never seen due to IT organization monitoring processes. Why? Because TPS is not used in the same frequency as in the older pre-vSphere infrastructures (ESX 2.x and 3.x) anymore. In reality vSphere have disintegrated the out-of-the-box over-commitment ratios. It only leverages TPS when certain memory usage thresholds are exceeded. Typically architects do not design their environments to reach the memory usage thresholds at 96%.

Large pages and processor architectures
When AMD and Intel introduced hardware-assisted memory virtualization features (RVI and EPT) VMware engineers quickly discovered that lead to increased virtual machine performance while reducing the memory consumption of the kernel. However there was some overhead involved but this could be solved by using large pages. A normal memory page is 4KB a large page is 2MB in size.

However large pages could not be combined with TPS as of the overhead introduced by scanning these 2MB block regions. The probability of finding identical large pages made them realize that the overhead was not worth the low potential of memory saving. The performance increase was calculated around 30% while the impact of sharing loss was perceived minimal, as memory footprints in physical machines tend to increase every year. Therefore virtual machines provisioned on vSphere are using a hardware-MMU leveraging the CPU hardware assisted memory virtualization features.

Although vSphere uses large pages, TPS still is active. It scans and hashes all pages inside a large page to decrease memory usage pressure when a memory threshold is reached. During my time at VMware I wrote an article on the VMkernel memory thresholds in vSphere 5.x Another interesting thing about large pages is the tendency to provide the best performance. The kernel will split up Large pages and share pages during memory pressure, but when no memory pressure is present new incoming pages will be stored in large pages. Potentially creating a cyclical process of constructing and deconstructing large pages.

NUMA
Another impact on the memory sharing potential is the NUMA processor architecture. NUMA allows the best memory performance by storing memory pages as close to a CPU as possible. TPS memory sharing could reduce the performance while pages are shared between two separate CPU systems. For more info about NUMA and TPS please read the article: “Sizing VMS and NUMA nodes

Capacity planning impact
Therefor the impact of disabling TPS by default will not be as big some might expect. What I do find interesting is the attention of security. I absolutely agree that security out of the box is crucial, but when regarding probability I would rather do a man-in-the-middle attack of the vMotion network, reading clear text memory across the network then wait for TPS to collapse memory. Which leads me to wonder when to expect encryption for vMotion traffic.

99 cents Promo to celebrate a major milestone of the vSphere Clustering Deepdive series

This week Duncan was looking at the sales numbers of the vSphere Clustering Deep Dive series and he noticed that we hit a major milestone in September. In September 2014 we passed the 45000 copies distributed of the vSphere Clustering Deep Dive. Duncan and I never ever expected this or even dared to dream to hit this milestone.

vSphere-clustering-booksWhen we first started writing the 4.1 book we had discussions around what to expect from a sales point of view and we placed a bet, I was happy if we sold 100 books, Duncan was more ambitious with 400 books. Needless to say we reset our expectations many times since then… We didn’t really follow it closely in the last 12-18 months, and as today we were discussing a potentially update of the book we figured it was time to look at the numbers again just to get an idea. 45000 copies distributed (ebook + printed) is just remarkable.

We’ve noticed that the ebook is still very popular, and decided to do a promo. As of Monday the 13th of October the 5.1 e-book will be available for only $ 0.99 for 72 hours, then after 72 hours the price will go up to $ 3.99 and then after 72 hours it will be back to the normal price. So make sure to get it while it is low priced!

Pick it up here on Amazon.com! The only other kindle store we could open the promotion up for was amazon.co.uk, so that is also an option!

Multi-FVP cluster design – using RAM and FLASH in the same vSphere cluster

A frequently asked question is whether RAM and Flash resources can be mixed in the same FVP Cluster? In FVP 2.0 we allow hosts to provide both RAM and Flash to FVP, time to provide some design considerations about FVP clusters.

One host resource per cluster
A FVP cluster only accepts one single type of acceleration resource from a host. If a host contains RAM and Flash, you can decide which resource is assigned to that particular cluster. When selecting one type of resource, FVP automatically removes the option of selecting the other resource available in the host.

01-Add acceleration resource

RAM and Flash in a single FVP cluster
A FVP Cluster can be comprised of different acceleration resources
You can have a FVP cluster configuration that one host provides RAM as an acceleration resource and another host provides Flash as an acceleration resource to the FVP Cluster.

02-Multiple acceleration resources in FVP Cluster

Symmetry equals predictability
Common architectural best practice is symmetry in resource design. Identical host, component and software configuration design provides reduction in management operations, simpler troubleshooting and above all consistent and predictable performance. Although a FVP cluster can contain RAM and Flash resources from multiple hosts, I would recommend only a mixed configuration as a transition state migrating to a new acceleration resource standard in the FVP cluster (moving from flash to RAM or vice versa).

One vSphere cluster, multiple FVP clusters
To leverage multiple acceleration resources, FVP allows you to create multiple FVP clusters within the same vSphere cluster. This allows you to create multiple acceleration tiers. Assign memory resources to a separate FVP cluster, for example “FVP Memory Cluster” and assign the Flash resources to the “FVP Flash Cluster”. As the atomic level of acceleration is the virtual machine level, a virtual machine can only be part of a single FVP cluster.

03-multi-fvp-clusters

Per VM-level Stats
One cool thing about FVP is the retention of stats. FVP collects stats on a per-VM level basis and retains it regardless of FVP cluster membership. This means that if you create a multiple FVP cluster design you can easily track the difference in performance. As FVP primary goal is to provide non-disruptive services, you can move virtual machines between different FVP clusters without having to reboot the virtual machine. Everything can be done on the fly without impacting service up times.

04-Add VM to cluster

One great use case is to set up a monitor FVP cluster, which contains no acceleration resources and allow FVP to monitor the I/O operations of that particular application.

00-Before-and-After

Once you decide which acceleration resource provides the best performance, you can easily move this virtual machine to the appropriate FVP cluster. To learn more about Monitor mode, please read the article: “Investigate your application performance by using FVP monitor capabilities”.

What’s new in PernixData FVP 2.0 – Adaptive network compression

Although FVP relies on the crossbar architecture of the network, the fast point-to-point connection we noticed that in some scenarios the network performance could be detrimental to the storage performance. This was especially true if a lot of virtual machines ran in write back with two replicas. Particularly 1Gb networks are susceptible for this when a lot of replication traffic was introduced.

32kb-iops

The diagram provided by Pete Koehler (vmpete.com) is showing this behavior, on the left the virtual machine is leveraging the power of the flash device. The application generates a workload of more than 5000 IOPS and the flash device (orange curve) happily provides that storage performance to the application. Once fault tolerance is enabled, sometimes the network becomes the bottleneck when the data is written to the remote flash device. The green curve, the network performance, overlays the orange (flash) curve and dictates the IO performance.

Introducing Adaptive Network Compression

What we wanted to do was to make sure that the bottleneck simply did not move from storage to network. Therefor we developed network compression, providing the ability to compress redundant write data before sending it over the network to remote flash devices. This allows FVP to consume the least amount of bandwidth possible; a byproduct of compression is the reduction of latency compared to sending uncompressed bigger blocks.

The thing that comes to the mind immediately, when thinking about compression is CPU cost. There is a CPU cost associated with compressing data and there is a CPU cost associated with decompressing data at the peer end. And then you have to figure out which cost is more detrimental to performance, CPU cost or network bandwidth cost? The graph shown below is the CPU compression cost introduced by FVP adaptive network compression. It shows that the cost incurred is minimal. And that is awesome because now you are able to get saving on network bandwidth without taking cost in terms of CPU.

ANC-cpu

The blue curve is the CPU cost the virtual machine incurs when using write back without replication. The orange curve is the CPU cost seen by the virtual machine with compression enabled. As you see the blue and orange curves are very close together, indicating that the CPU cost are minor.

To keep the CPU cost on the source host as low as possible is to use an adaptive algorithm that measures costs and benefit. Adaptive network compression makes sure that data can be compressed and that the cost of the compression does not exceed the benefit of bandwidth saved. Funny thing is that sometimes small blocks increase in size when compressing data, therefor it will review every bit of data and make sure compression provides benefits to the virtualized datacenter.

An interesting fact is that we do not decompress the data when it is written to the remote flash device. This provides two benefits. No CPU overhead involved and a reduction of remote flash footprint. The reason why we keep it in a compressed state is very simple. Typically environments perform well, meaning that the majority of time your environment should be up and running. Outages should be very infrequent. When a failure occurs and the task of writing data to the storage array falls on the peer host, we will incur the CPU cost of decompression. During normal operations, redundant write data lands on the flash device in a compressed state and will be moved out of the flash device once the data is written to the storage array. This process should impact the peer host as little as possible and keeping it in a compressed state accomplish that requirement.

Default-replication-traffic

The chart above illustrates the before and after state of write back with replication using a 1GbE connection. The performance on the left shows an average of 2700 IOPS, once replication is enabled the performance dropped to 1700 writes per second. The flat line perfectly shows the bottleneck introduced by bandwidth constraints. When running the same workload on a 10GbE network, it shows that the network provides compatible speed and bandwidth.

WB1-10GbE

The same workload test was done on a 1GbE network with network compression enabled. The compression showed performance to near native flash only performance results.

ANC-on-1GbE

To illustrate the benefit of adaptive network compression, the engineers disabled and enabled the feature. In FVP 2.0 adaptive network compression is enabled automatically when a 1GbE network connection is detected. No user intervention is required and for the curious minds, we do not offer a U.I. or Powershell option to enable or disable it. The adaptive nature will provide you the best performance with the lowest amount of overhead.

Hero Numbers

One of the cool things on the FVP cluster summary are the hero numbers, this shows the data saved from the storage array, the bandwidth saved from the storage area network and in FVP 2.0 you can monitor the amount of network bandwidth saved by Adaptive Network Compression. In order to make the screenshot I had to change my FVP network configuration from using a 10GbE to a 1GbE network. With FVP you can do this on the fly without impacting any uptime of hosts or virtual machines. I ran a test very quickly to generate some numbers hence the underwhelming total. In reality when using a 1GbE network with real application workload you will see quite impressive numbers.

What’s new in PernixData FVP 2.0 – Distributed Fault Tolerant Memory

PernixData FVP 2.0 allows the use of multiple acceleration resources. In FVP 1.x various types of flash devices could be leveraged to accelerate virtual machine I/O operations. FVP 2.0 introduces the support of server side RAM.

RAM bound world

With recent chipset advances and software developments it is now possible to support up terabytes of memory in a vSphere host. At VMworld VMware announced 6TB memory support for vSphere 6.0 and recently announced the same support for vSphere 5.5 update 2. Intel’s newest processors supports up to 1536GB memory support per CPU, allowing a 4 four-way server to easily reach the maximum supported memory by vSphere.

But what do you do with all this memory? As of now, you can use memory provided by the virtual infrastructure to accelerate virtual machine I/O. Other application vendors and Independent Software Vendors (ISV) are leveraging these massive amounts of memory as well, although they requirements impact IT operations and services.

Figurexxx-memory-pyramid

It starts at the top, applications can leverage vast amounts of memory to accelerate data however the user needs to change the application and implementing this is not typically considered a walk in the park. ISV’s caught up on this trend and did the heavy lifting for their user base, however you still need to run these specific apps to operationalize memory performance for storage. Distributed Fault Tolerant Memory (DTFM) allows every application in the virtualized datacenter to benefit from incredible storage performance with no operational or management overhead. Think of the introduction of DFTM as similar to the introduction of vSphere High Availability. You either had application level HA capabilities or clustering services such as Veritas or Microsoft Clustering Services. HA provided fail over capabilities to every virtual machine, every application the moment you configured a simple vSphere cluster service.

Scaling capabilities
DFTM rests on the pillars that FVP provides, a clustered, fault tolerant platform that scale out performances independent from storage capacity. DTFM allows for seamless hot-add and hot-shrink the FVP cluster with RAM resources. When more acceleration resources are required, just add more RAM to the FVP cluster. If RAM is needed for memory resources or you have other plans with that memory, just shrink the amount of host RAM provided to the FVP cluster. Host memory now becomes a multipurpose resource, providing virtual machine compute memory or I/O acceleration for virtual machines. Its up to you to decide what role it performs. When the virtual datacenter needs to run new virtual machines, add new hosts to the cluster and assign a portion of host memory to the FVP cluster to scale out storage performance as well.

Fault Tolerant write acceleration
FVP provides the same fault tolerance and data integrity guarantees for RAM as for Flash. FVP provides the ability to store replicas of write data to flash or RAM acceleration resources of other hosts in the cluster. FVP 2.0 provides the ability to align your FVP cluster configuration with your datacenter fault domain topology. For more information please read, “What’s new in PernixData FVP 2.0 – User Defined Fault Domains”.

Figurexxx-triple-FD-design

Clustered solution
FVP provides fault tolerant write acceleration based on clustered technology and provides failure handling. If a component, host or network failure occurs, FVP seamless transitions write policies to ensure data availability for new incoming data. It automatically writes uncommitted data that is present in the FVP cluster to the storage array, either the source host or any of the peer host does this if the source host experiences problems. If a failure occurs with a peer host, FVP automatically selects a new peer host in order to resume write acceleration services while safeguarding new incoming data. All of this without any user intervention required. For more information, please read “Fault Tolerant Write Acceleration”.

A clustered platform is also necessary to support the vSphere clustering services that virtualized datacenters leveraged for many years now. Both DRS and HA are fully supported. FVP remote access allows virtual machine mobility, Data is accessible to virtual machines regardless the host it resides on. For more information please read “PernixData FVP Remote Flash Access”. During a host failure, FVP ensures all uncommitted data is written to the storage array before allowing HA to restart the virtual machine.

Ease of configuration
Besides the incredible performance benefits, the ease of configuration is a very strong point when you deciding between flash or RAM as an acceleration resource. Memory is as close to the CPU as possible. No moving parts, no third party storage controller driver, no specific configuration such as RAID or cache structures. Just install FVP, assign the amount of memory and you are in business. This reduction of moving parts and the close proximity of RAM to the flash allows for an extreme consistent and predictable performance. This results incredible amounts of bandwidth, low latency and high IO performance. The following screenshot are of an SQL DB server, notice the green flat line at the bottom, that’s the network and VM observed latency.

Screen Shot 2014-10-02 at 12.33.14

The I/O latency of RAM was 20 microsecond, the network latency of 270 microsecond was clearly the element that “slowed it down”. With some overhead incurred by the kernel the application experienced a stable and predictable latency of 320 microseconds. I zoomed in to investigate any possible fluctuations but the VM Observed latency remained constant.

Screen Shot 2014-10-02 at 12.35.10

Blue line: VM observed latency
Green line: Network latency
Yellow line: RAM latency

The network latency is incurred due to writing the data safely to another host in the cluster. Writes are done in a synchronous matter, meaning that the source host needs to receive acknowledgements from both resources before completing the I/O to the application.

This means that with DTFM you can now virtualize the most resource intensive applications with RAM providing fault tolerant storage performance. A great example is SAP-HANA. Recently I wrote an article on the storage requirements of SAP-HANA environments. Although SAP-HANA is an in-memory database platform it’s recommended to use fast storage resources, such as flash to provide performance for log operations. Logs have to be written outside the volatile memory structure to provide ACID (Atomicity, Consistency, Isolation, Durability) guarantees for the database. By using FVP DFTM, all data (DB & Logs) reside in memory and have identical performance levels while leveraging Fault tolerance write acceleration to guarantee ACID requirements. And due to the support of mobility, SAP-HANA or similar application landscapes are now free to roam freely in the vSphere cluster, breaking down the last silo’s in your virtualized datacenter.

The next big thing

Channeling the wise words of Satyam Vaghani: The net effect of this development is that you are able to get predictable and persistent microsecond storage performance. With new developments popping up in the industry every day, it is not weird to wonder when we will hit nano second latencies. When the industry is faced with the possibility of these types of speeds, we as PernixData belief that we can absolutely and fundamentally change what applications expect out of storage infrastructure. Applications used to expect to that storage platforms provided performance in the millisecond levels and use to give up improving their code as storage platforms were the bottlenecks. For the first time ever storage performance is not the bottleneck, and for the first time ever extremely fast storage is affordable with FVP and server side acceleration resources. Even an SMB-class platform can now have a million IOPS and super low latency if they want to. Now the real question for the next step becomes, if you can make a virtualized datacenter have a millions of IOPS at microsecond latency levels what would you do with that power? What new type of application will you develop; what new use cases would be possible with all that power?

We at PernixData belief that if we can change the core assumption around the storage system and the way it performs, then we could see a new revolution in terms of application development and the way application actually use infrastructures. And we think that revolution is going to be very very exciting.

Article in Japanese: PernixData 2.0の新機能 ー 分散耐障害性メモリ(Distributed Fault Tolerant Memory – DFTM)