Fault Tolerant Write Acceleration

To get the best performance it is crucial to accelerate both read and write operations. However there is a fundamental difference in platform design when offering read and write acceleration than only just read acceleration.

Accelerating read operations is rather “straightforward” process. Copy the data read from the datastore into the flash layer and serve subsequent read operations from the flash device. No direct need for fault tolerance. The data on the flash device is copied version of the data on the datastore. Changes to the data are always made on the data residing on the datastore layer. Therefore other than a performance reduction, the application will not be impacted when the flash device fails. This is not the case when providing write acceleration. When accelerating a write operation the data is committed to the flash device first and then copied to the storage system in the background. During this time window there is a chance that the uncommitted data is lost in the event of a permanent host or component failure.

Data loss needs to be avoided at all times, therefore the FVP platform is designed from the ground up to provide data consistency and availability. By replicating write data to neighboring flash devices data loss caused by host or component failure is prevented. Due to the clustered nature of the platform FVP is capable to keep the state between the write data on the source and replica hosts consistent and reduce the required space to a minimum without taxing the network connection too much. Let’s take a closer look at how FVP provides fault tolerant write acceleration.

FVP integration by extending the hypervisor
The foundation of fault tolerant write acceleration is provided by extending the VMkernel. Due to the tight integration of FVP with the extremely stable ESXi core, FVP is protected and ensured proper scheduling and resource availability. The FVP kernel module extends the kernel code, that means that there are no “services” running in the VMs or a virtual appliance that can be accidentaly stopped or powered down. This means that that similar to the VMkernel, FVP can run on any CPU, therefor it scales with workload, and as long as the kernel is up, FVP is up. More info about kernel modules versus virtual appliances can be found here: Basic elements of the flash virtualization platform – Part 1

Write redundancy
When adding a virtual machine to a Flash Cluster, you may select the appropriate write policy. For more info about the difference between write-through and write-back please read the article: “Write-Back and Write-Through policies in FVP”. The write-back policy offers three options, “Local flash only”, “Local flash and 1 network flash device” and “Local flash with 2 network flash devices”.

00-Write-back-policy

Local flash only
The “Local flash only” write back policy is created to provide the best performance for applications that does not require any additional fault domain resiliency, such as Non-persistent VDI desktops or kiosk applications. Please note that due to the use of flash resources (a non-volatile resource), the data is still available if the device itself has not failed. For example, if the host fails and reboots, the data is still present on the flash device. If a permanent hardware failure occurs on host level, the flash device can be placed into another host that is a part of the flash cluster and FVP will detect the flash device and offers to destage the data to the datastore if this is necessary.

Local flash with network flash devices
Selecting a write back policy with one or two network devices all write data is sent to the appropriate number of neighboring flash devices. The process of writing data follows a synchronous replication pattern, meaning that both the source as well as the remote flash device(s) need to acknowledge before the write operation is acknowledged back to the virtual machine.

01-sync-write-ack

1. The virtual machine generates a write operation.
2. FVP sends the I/O to the local flash device and remote flash device simultaneously.
3. The remote flash device (and local flash device) acknowledges the write completion.
4. FVP acknowledges the write completion to the virtual machine.

At this stage the I/O is complete and the application can continue. FVP asynchronously writes the I/O to the storage system. This process is completely transparent to the application.

5. FVP destages it to the storage system.

Destage frequency
Although the data is safely stored on multiple devices, FVP aims to destage the data to the storage system as soon as possible. After acknowledging the write to the flash device FVP destages the data to the storage system as fast as it can without overwhelming the storage system. FVP keeps the source and replica hosts in sync about the status of the data, after the data is safely stored on the storage system FVP informs the replica host and allows the replica host to discard the data in a leisurely way. This way the replica host can reuse the space without aggressively burning through the program and erase cycles of the flash device. This method keeps the overhead of extra flash footprint to an absolute minimum.

02-local-versus-remote-footprint

The UI displays the flash footprint of the virtual machine on each device. In this scenario, VM Machine 1 is configured with a write back policy of “Local flash and 1 network flash device”. The local flash footprint of Machine1 is currently 208.5MB while the replicated write data takes up 512KB of flash space.

Failure handling
But what if a component in the architecture fails, what happens then? Components can fail in the host that runs the virtual machine but also on a host containing the replica write data. FVP is designed to handle both source as well as replica host failures.

Source host failure handling
The moment the source host experiences problems, FVP ensures that one of the replica host containing the write data becomes responsible for writing the data to the array. These errors could be a flash device failure or a complete host failure. If the flash device fails on the source host, the virtual machine remains active and becomes unaccelerated. This means that a flash device failure impacts performance and not the availability of the application itself. The host containing the replica data becomes responsible for destaging the remaining write data.

Similar behavior when the source host fails, one of the replica hosts will destage the remaining write data to the storage system. HA restarts the virtual machine on a running host and the application can continue to function without any data loss.

Replica host failure handling
What if the flash device or the entire host fails that stores the replica write data? In that scenario FVP triggers two different processes. First of all, the virtual machine write policy is dynamically set to write-through mode. As mentioned before the primary goal of FVP is to ensure data availability. Due to a failure on the replica host the current set of hosts cannot conform to the selected write redundancy requirement.

03-dynamic state of wb policies copy

During the transition from write back to write through, the source host will immediately destage all the remaining writes.

The PernixData distributed resource manager will select a different host in the Flash Cluster as a replacement replication host. Once this selection is complete the virtual machine write policy is returned to Write-Back mode. This process is done automatically and without any need of input of the administrator. FVP protects data foremost and switches back to optimal data acceleration if the environment is healthy to execute.

Datastore connection failure on the replica host will trigger the PernixData distributed resource manager to remove the host from the list of replica hosts and transition to write back policies to write through mode.

vMotion network failure handling
If the vMotion network fails between the source host and the replica host, FVP will follow the same process as a replica host failure handling.

04-WB-denied

The virtual machine write policy is set to write through and PernixData distributed resource manager will select and push a new replication host to the source host. Once the new replication host is assigned, the write policy is returned to the requested write policy:

05-WB-returned

Fault tolerant write acceleration
vSphere has a proven track record of being a very stable hypervisor and by tightly integrating FVP with the ESXi core a resilient foundation is created to service write acceleration. With the use of clustered technology, FVP replicates write data to neighboring devices to avoid single points of failures. Key focus of the platform is to keep the overhead to a minimum and avoid risk at all time, speed of destaging and replica data management are two prime examples that show that this technology provides true enterprise class storage performance functionality.

Comments

  1. says

    Great article Frank. One questions, is it possible that whilst a VM is being accelerated with local and 1 (or 2) network flash devices the SSD in the host fails and the VM falls back to write-through mode and the data is written to local storage before being de-staged from one of the network SSD’s causing the data on local storage to be inconsistent?

    Michael

Leave a Reply

Your email address will not be published. Required fields are marked *