What’s new in PernixData FVP 2.0 – Adaptive network compression

Although FVP relies on the crossbar architecture of the network, the fast point-to-point connection we noticed that in some scenarios the network performance could be detrimental to the storage performance. This was especially true if a lot of virtual machines ran in write back with two replicas. Particularly 1Gb networks are susceptible for this when a lot of replication traffic was introduced.

32kb-iops

The diagram provided by Pete Koehler (vmpete.com) is showing this behavior, on the left the virtual machine is leveraging the power of the flash device. The application generates a workload of more than 5000 IOPS and the flash device (orange curve) happily provides that storage performance to the application. Once fault tolerance is enabled, sometimes the network becomes the bottleneck when the data is written to the remote flash device. The green curve, the network performance, overlays the orange (flash) curve and dictates the IO performance.

Introducing Adaptive Network Compression

What we wanted to do was to make sure that the bottleneck simply did not move from storage to network. Therefor we developed network compression, providing the ability to compress redundant write data before sending it over the network to remote flash devices. This allows FVP to consume the least amount of bandwidth possible; a byproduct of compression is the reduction of latency compared to sending uncompressed bigger blocks.

The thing that comes to the mind immediately, when thinking about compression is CPU cost. There is a CPU cost associated with compressing data and there is a CPU cost associated with decompressing data at the peer end. And then you have to figure out which cost is more detrimental to performance, CPU cost or network bandwidth cost? The graph shown below is the CPU compression cost introduced by FVP adaptive network compression. It shows that the cost incurred is minimal. And that is awesome because now you are able to get saving on network bandwidth without taking cost in terms of CPU.

ANC-cpu

The blue curve is the CPU cost the virtual machine incurs when using write back without replication. The orange curve is the CPU cost seen by the virtual machine with compression enabled. As you see the blue and orange curves are very close together, indicating that the CPU cost are minor.

To keep the CPU cost on the source host as low as possible is to use an adaptive algorithm that measures costs and benefit. Adaptive network compression makes sure that data can be compressed and that the cost of the compression does not exceed the benefit of bandwidth saved. Funny thing is that sometimes small blocks increase in size when compressing data, therefor it will review every bit of data and make sure compression provides benefits to the virtualized datacenter.

An interesting fact is that we do not decompress the data when it is written to the remote flash device. This provides two benefits. No CPU overhead involved and a reduction of remote flash footprint. The reason why we keep it in a compressed state is very simple. Typically environments perform well, meaning that the majority of time your environment should be up and running. Outages should be very infrequent. When a failure occurs and the task of writing data to the storage array falls on the peer host, we will incur the CPU cost of decompression. During normal operations, redundant write data lands on the flash device in a compressed state and will be moved out of the flash device once the data is written to the storage array. This process should impact the peer host as little as possible and keeping it in a compressed state accomplish that requirement.

Default-replication-traffic

The chart above illustrates the before and after state of write back with replication using a 1GbE connection. The performance on the left shows an average of 2700 IOPS, once replication is enabled the performance dropped to 1700 writes per second. The flat line perfectly shows the bottleneck introduced by bandwidth constraints. When running the same workload on a 10GbE network, it shows that the network provides compatible speed and bandwidth.

WB1-10GbE

The same workload test was done on a 1GbE network with network compression enabled. The compression showed performance to near native flash only performance results.

ANC-on-1GbE

To illustrate the benefit of adaptive network compression, the engineers disabled and enabled the feature. In FVP 2.0 adaptive network compression is enabled automatically when a 1GbE network connection is detected. No user intervention is required and for the curious minds, we do not offer a U.I. or Powershell option to enable or disable it. The adaptive nature will provide you the best performance with the lowest amount of overhead.

Hero Numbers

One of the cool things on the FVP cluster summary are the hero numbers, this shows the data saved from the storage array, the bandwidth saved from the storage area network and in FVP 2.0 you can monitor the amount of network bandwidth saved by Adaptive Network Compression. In order to make the screenshot I had to change my FVP network configuration from using a 10GbE to a 1GbE network. With FVP you can do this on the fly without impacting any uptime of hosts or virtual machines. I ran a test very quickly to generate some numbers hence the underwhelming total. In reality when using a 1GbE network with real application workload you will see quite impressive numbers.

What’s new in PernixData FVP 2.0 – Distributed Fault Tolerant Memory

PernixData FVP 2.0 allows the use of multiple acceleration resources. In FVP 1.x various types of flash devices could be leveraged to accelerate virtual machine I/O operations. FVP 2.0 introduces the support of server side RAM.

RAM bound world

With recent chipset advances and software developments it is now possible to support up terabytes of memory in a vSphere host. At VMworld VMware announced 6TB memory support for vSphere 6.0 and recently announced the same support for vSphere 5.5 update 2. Intel’s newest processors supports up to 1536GB memory support per CPU, allowing a 4 four-way server to easily reach the maximum supported memory by vSphere.

But what do you do with all this memory? As of now, you can use memory provided by the virtual infrastructure to accelerate virtual machine I/O. Other application vendors and Independent Software Vendors (ISV) are leveraging these massive amounts of memory as well, although they requirements impact IT operations and services.

Figurexxx-memory-pyramid

It starts at the top, applications can leverage vast amounts of memory to accelerate data however the user needs to change the application and implementing this is not typically considered a walk in the park. ISV’s caught up on this trend and did the heavy lifting for their user base, however you still need to run these specific apps to operationalize memory performance for storage. Distributed Fault Tolerant Memory (DTFM) allows every application in the virtualized datacenter to benefit from incredible storage performance with no operational or management overhead. Think of the introduction of DFTM as similar to the introduction of vSphere High Availability. You either had application level HA capabilities or clustering services such as Veritas or Microsoft Clustering Services. HA provided fail over capabilities to every virtual machine, every application the moment you configured a simple vSphere cluster service.

Scaling capabilities
DFTM rests on the pillars that FVP provides, a clustered, fault tolerant platform that scale out performances independent from storage capacity. DTFM allows for seamless hot-add and hot-shrink the FVP cluster with RAM resources. When more acceleration resources are required, just add more RAM to the FVP cluster. If RAM is needed for memory resources or you have other plans with that memory, just shrink the amount of host RAM provided to the FVP cluster. Host memory now becomes a multipurpose resource, providing virtual machine compute memory or I/O acceleration for virtual machines. Its up to you to decide what role it performs. When the virtual datacenter needs to run new virtual machines, add new hosts to the cluster and assign a portion of host memory to the FVP cluster to scale out storage performance as well.

Fault Tolerant write acceleration
FVP provides the same fault tolerance and data integrity guarantees for RAM as for Flash. FVP provides the ability to store replicas of write data to flash or RAM acceleration resources of other hosts in the cluster. FVP 2.0 provides the ability to align your FVP cluster configuration with your datacenter fault domain topology. For more information please read, “What’s new in PernixData FVP 2.0 – User Defined Fault Domains”.

Figurexxx-triple-FD-design

Clustered solution
FVP provides fault tolerant write acceleration based on clustered technology and provides failure handling. If a component, host or network failure occurs, FVP seamless transitions write policies to ensure data availability for new incoming data. It automatically writes uncommitted data that is present in the FVP cluster to the storage array, either the source host or any of the peer host does this if the source host experiences problems. If a failure occurs with a peer host, FVP automatically selects a new peer host in order to resume write acceleration services while safeguarding new incoming data. All of this without any user intervention required. For more information, please read “Fault Tolerant Write Acceleration”.

A clustered platform is also necessary to support the vSphere clustering services that virtualized datacenters leveraged for many years now. Both DRS and HA are fully supported. FVP remote access allows virtual machine mobility, Data is accessible to virtual machines regardless the host it resides on. For more information please read “PernixData FVP Remote Flash Access”. During a host failure, FVP ensures all uncommitted data is written to the storage array before allowing HA to restart the virtual machine.

Ease of configuration
Besides the incredible performance benefits, the ease of configuration is a very strong point when you deciding between flash or RAM as an acceleration resource. Memory is as close to the CPU as possible. No moving parts, no third party storage controller driver, no specific configuration such as RAID or cache structures. Just install FVP, assign the amount of memory and you are in business. This reduction of moving parts and the close proximity of RAM to the flash allows for an extreme consistent and predictable performance. This results incredible amounts of bandwidth, low latency and high IO performance. The following screenshot are of an SQL DB server, notice the green flat line at the bottom, that’s the network and VM observed latency.

Screen Shot 2014-10-02 at 12.33.14

The I/O latency of RAM was 20 microsecond, the network latency of 270 microsecond was clearly the element that “slowed it down”. With some overhead incurred by the kernel the application experienced a stable and predictable latency of 320 microseconds. I zoomed in to investigate any possible fluctuations but the VM Observed latency remained constant.

Screen Shot 2014-10-02 at 12.35.10

Blue line: VM observed latency
Green line: Network latency
Yellow line: RAM latency

The network latency is incurred due to writing the data safely to another host in the cluster. Writes are done in a synchronous matter, meaning that the source host needs to receive acknowledgements from both resources before completing the I/O to the application.

This means that with DTFM you can now virtualize the most resource intensive applications with RAM providing fault tolerant storage performance. A great example is SAP-HANA. Recently I wrote an article on the storage requirements of SAP-HANA environments. Although SAP-HANA is an in-memory database platform it’s recommended to use fast storage resources, such as flash to provide performance for log operations. Logs have to be written outside the volatile memory structure to provide ACID (Atomicity, Consistency, Isolation, Durability) guarantees for the database. By using FVP DFTM, all data (DB & Logs) reside in memory and have identical performance levels while leveraging Fault tolerance write acceleration to guarantee ACID requirements. And due to the support of mobility, SAP-HANA or similar application landscapes are now free to roam freely in the vSphere cluster, breaking down the last silo’s in your virtualized datacenter.

The next big thing

Channeling the wise words of Satyam Vaghani: The net effect of this development is that you are able to get predictable and persistent microsecond storage performance. With new developments popping up in the industry every day, it is not weird to wonder when we will hit nano second latencies. When the industry is faced with the possibility of these types of speeds, we as PernixData belief that we can absolutely and fundamentally change what applications expect out of storage infrastructure. Applications used to expect to that storage platforms provided performance in the millisecond levels and use to give up improving their code as storage platforms were the bottlenecks. For the first time ever storage performance is not the bottleneck, and for the first time ever extremely fast storage is affordable with FVP and server side acceleration resources. Even an SMB-class platform can now have a million IOPS and super low latency if they want to. Now the real question for the next step becomes, if you can make a virtualized datacenter have a millions of IOPS at microsecond latency levels what would you do with that power? What new type of application will you develop; what new use cases would be possible with all that power?

We at PernixData belief that if we can change the core assumption around the storage system and the way it performs, then we could see a new revolution in terms of application development and the way application actually use infrastructures. And we think that revolution is going to be very very exciting.

Article in Japanese: PernixData 2.0の新機能 ー 分散耐障害性メモリ(Distributed Fault Tolerant Memory – DFTM)

What’s new in PernixData FVP 2.0 – User Defined Fault Domains

With the announcement of FVP 2.0 a lot of buzz will be around distributed fault tolerant memory and the support of NFS. This all makes sense of course since for the first time in history compute memory becomes a part of storage system and you are now able to accelerate file based storage system. But one of the new features I’m really excited about is Fault Domains.

In FVP 2.0 you group hosts to reflect your datacenter fault domain topology within FVP and ensure redundant write data is stored in external fault domains. Let’s take a closer look at this technology and review the current version of fault tolerant write acceleration first.

FVP 1.5 Fault Tolerant Write Acceleration

When accelerating a datastore or virtual machine in FVP 1.x you can select 0, 1 or 2 replicas of redundant write data. When the option “local plus two network flash devices” is selected, FVP automatically selects two hosts in the cluster that have access to the datastore and have a network connection with the source host. If a failure occurs, such as the source host disconnects, crashes or the flash device fails, one of the hosts containing the replica write data will take over and send all uncommitted data from the acceleration layer to the storage system. For more detailed information, please read Fault Tolerant Write Acceleration

Let’s use the example of a four-host vSphere cluster. All four hosts have FVP installed and participate in a FVP cluster. VM A is configured with a write policy with a local copy and 1 network flash device. In this scenario FVP selected ESXi Host 1 as the peer host and Host 2 sends the redundant write data (RWD) to Host 1.

Figurexxx-four-host-cluster

But what if Host 1 and Host 2 are part of the same fault domain topology? Fault domains are is set of components that share a single point of failure, such as a blade system or a server rack with a single power source. Many organizations treat blade systems and their enclosures as a fault domain. If the backplane of the blade system fails, all servers can be disconnected from the network unable to send data to the storage array or to other connected systems.

blade system2

In this scenario both copies of uncommitted data cannot be written to the storage array if the network connection goes down. Or even worse what if the whole blade system goes down and RAM is used as an acceleration resource.

FVP 2.0 User Defined Fault domains

Fault Domains allow you to reflect your datacenter topology within FVP. This topology can be used to control where data gets replicated to when running in Write Back mode.

Default Behavior
All hosts in the vSphere cluster are initially placed in the default fault domain. The default fault domain cannot be renamed, removed or given explicit associations. Newly added hosts will automatically be placed into this default fault domain. A host can be a member of only one fault domain, resulting in the behavior that all FVP Clusters in the vSphere Cluster share the fault domains.

03-Fault-domain-hover-over

After following the steps mentioned in the article: Configuring PernixData FVP 2.0 Fault Domains, two additional fault domains exits: Blade Center 1 and Blade Center 2.

08-Fault Domains overview

Replica placement
When configuration acceleration of a datastore or virtual machines, you are now able to control where the data is replicated to when using write back acceleration. You do not have to select the specific host or the specific fault domain, just provide the number of replica’s and whether it should be placed in the same fault domain or in an external fault domain. FVP load balances the workload across the different vSphere hosts in the cluster. Ensuring distribution of network traffic and acceleration resource consumption while still safeguarding compliancy with fault domain policies.

07-Add VMs - Commit Writes to

Be aware that FVP only selects fault domains that belong to the same FVP and vSphere cluster. FVP will not select any fault domains that belong to a different vSphere cluster. By default FVP Write Back write policy selects 1 peer host in the same fault domain, but this can be easily adjusted to any other configuration. Just selects the required number of replica copies in the appropriate fault domain. Please note that the maximum number of peer host can never exceed two. For example, if two peer hosts in different fault domains are selected, no peer hosts can be selected in the same fault domain.

For the extreme risk adverse designs, if more than two fault domains are configured, FVP will distribute the replicas across two fault domains. Thus having the data in three different fault domains (local+fault domain1+ fault domain 2)

Figurexxx-triple-FD-design

Error correction

In the scenario when the source host fails, the peer host in the designated fault domain will write the uncommitted data to the storage system. In case a networking connection failure or a peer host failure of any kind; the PernixData Management Server will select a new peer host within the fault domain. This is all done transparent and no user interaction is required.

Figurexxx-dynamic-peer-host-selection

Topology alignment with fault domains

Fault domains build upon the strong fault tolerant feature present in FVP 1.x and are an excellent way to make your environment more resilient against component, network or host failure. By aligning FVP fault domains to your datacenter topology you can leverage the deterministic placement of redundant write data to either improve resiliency or take advantage of the availability of internal network bandwidth in blade systems.

Configuring PernixData FVP 2.0 Fault Domains

This article covers the configuration of PernixData FVP 2.0 Fault Domains using the scenario of a four host vSphere cluster stretched across two blade centers:

blade system2

Default Behavior
All hosts in the vSphere cluster are initially placed in the default fault domain. The default fault domain cannot be renamed, removed or given explicit associations. Newly added hosts will automatically be placed into this default fault domain. A host can be a member of only one fault domain, resulting in the behavior that all FVP Clusters in the vSphere Cluster share the fault domains.

03-Fault-domain-hover-over

Let’s create two additional fault domains to reflect the blade system topology. I’m using the web client, when using the vSphere client navigate to the PernixData tab in your vSphere cluster.

1. In the vCenter Inventory, navigate to the PernixData FVP inventory item and select FVP Clusters
2. Click on a FVP cluster in the designated vSphere Cluster
3. Go to Manage, select Fault Domains, click on “add Fault Domains”

04-Add-Fault-Domain-Blade-center-1

4. Provide a name and click OK when finished
5. Click on the option “Add Host…” and select the hosts, click OK when finished.

05-Add hosts - Blade Center 1

6. Repeat step 3 to 5 to create additional Fault Domains, please note that the user interface displays the current Fault Domain, allowing you to easily determine which host should be moved to their own fault domain.

06-Current Fault Domain

The overview of Fault domains now shows 3 fault domains, the Default Fault Domain and Fault Domains Blade Center 1 and 2.

08-Fault Domains overview

During the configuration of datastore or VM acceleration, you are now able to control where the data is replicated to when using write back acceleration.

07-Add VMs - Commit Writes to

FVP Write Back write policy defaults to 1 peer host in the same fault domain, but this can be easily adjusted to any other configuration. Just selects the required number of replica copies in the appropriate fault domain. Please note that the maximum number of peer host can never exceed two. For example, if two peer hosts in different fault domains are selected, no peer hosts can be selected in the same fault domain. In this scenario, the VM is configured with a selection of one host peer in the same fault domain and one peer host in a different fault domain.

Figurexxx-1 local and remote peer host

For the extreme risk adverse designs, if more than two fault domains are configured, FVP will distribute the replicas across two fault domains. Thus having the data in three different fault domains (local+fault domain1+ fault domain 2)

Figurexxx-triple-FD-design

For in-depth information about PernixData FVP 2.0 please continue to read the article: “What’s new in PernixData FVP 2.0 – User Defined Fault Domains

PernixData FVP 2.0 released

It gives me great pleasure to announce that PernixData released FVP 2.0 today. Building upon the industry leading acceleration platform FVP 2.0 contains the following new features:

Distributed Fault Tolerant Memory: FVP fault tolerance makes volatile server memory part of storage for the first time ever.

NFS support: With FVP 2.0 you can now accelerate application workload while any type of storage system provide storage capacity, whether its block based or file based (NFS).

Adaptive Network compression: FVP provides its own lightweight network protocol to send redundant write data between the source and peer hosts in the FVP cluster. In 2.0 adaptive network compression analyzes in real time the write data and if the benefit exceeds the cost, data is compressed to reduce latency and consumed bandwidth.

User defined fault domains: Fault Domains allow you to reflect your datacenter topology within FVP. This topology can be used to control where data gets replicated to when running in Write Back mode.

For the official GA release notes please follow this link.

Starting today, I will cover the new features in depth in the what’s new in PernixData FVP 2.0 series.

Part 1: User Defined Fault Domains
Part 2: Distributed Fault Tolerant Memory
Part 3: Adaptive network compression