During my recent trip to the headquarters of PernixData I had the opportunity to pick the brains of Satyam and some of the key engineers. Throughout the discussions and lectures I learned a lot about the design choices engineering faced when creating the Flash Virtualization Platform. When dealing with super low (microsecond) latency devices such as flash the last thing you want to do is add latency with your own software. Making the right choices is crucial in creating a platform that is both efficient and effective in providing performance to your applications. The following series of articles expand on the elements of which the Flash Virtualization Platform is comprised of and share the motivation why certain elements are chosen over others. Let’s start of the series by zooming in why a kernel module is chosen over a virtual appliance.
Kernel module versus virtual appliance
One of the first problems to solve is to keep the length of the I/O path as short as possible in order to leverage the performance characteristics of flash. A common solution is to use a virtual appliance that consumes the local flash resource, however when analyzing the I/O path of the virtual machine it becomes clear that this is perpendicular to the criteria of a shortest I/O path. As the diagram in the next paragraph illustrates there are multiple distinctive steps in the I/O path, each step increases the data path and introduce a potential delay on various levels.
Resource management challenges of a virtual appliance
Besides extending the I/O path, resource management of a virtual appliance is a challenge. Virtual appliances are bound by the ESX scheduler. From a kernel perspective the virtual appliance is just another virtual machine, meaning that the virtual appliance just have to wait its turn in the CPU queue along all the other virtual machines the virtual appliance is serving.
The virtual appliance might not get scheduled enough having the IO to sit and wait inside the virtual appliance (#3) or in the vmkernel (#2) before finding its way to the kernel storage stack where it affected again by queues. Similar issues arise on the IO completion path. From Flash, IO will complete (#5) to the virtual appliance, the appliance has to be scheduled to process the completion, then the appliance needs to issue a completion (#6) to the VM (#7) via the vmkernel and the VM has to be scheduled to receive the completion (#8). On a low utilized system this problem will not become apparent soon, but when growing and adding more and more virtual machines the I/O acceleration will rapidly diminish.
One might say, apply CPU reservations! But per-VM reservations introduce a whole new set of challenges all the way up to impacting a cluster wide HA-failover policy configuration. Furthermore CPU reservations do not guarantee the virtual appliance to have instant access to the CPU itself, if another vCPU is scheduled it is allowed to finish its operation causing delay in I/O operations within the virtual appliance.
Another problem is that the virtual appliance is not scheduled proportionally to the amount of traffic coming. Traffic generated by the virtual cores it is serving, making it a challenge to right size the virtual appliance appropriately. You have to deal with sizing of the virtual appliance when virtual infrastructure grows or shrinks. As we know oversized virtual machines don’t behave very well in a virtual infrastructure, while an undersized virtual machine performs badly.
It isn’t only a CPU problem! The number of memory copies involved in the virtual appliance IO path typically incur additional overhead, especially when doing IO to a high troughput device like flash, because of the number of commands flying around. VMware has provided ways to establish lower overhead communication channels among VMs (see VMCI: http://www.vmware.com/support/developer/vmci-sdk/), but this technique is not available between the core ESXi hypervisor kernel and VMs.
Taking all these factors into account, it became clear that a virtual appliance would not provide the most efficient and effective way of leveraging the low latency performance a flash device provides.
FVP uses a kernel module
Instead the engineers created a kernel module. The flash virtualization platform (FVP) extends the hypervisor by installing a kernel module into the VMkernel. The way it works is that the kernel module “hijacks” the virtual machine I/O path as it comes out of the virtual machine into the hypervisor. Before it goes out to its storage system FVP determines whether that I/O should be served from the flash device or from the external storage system. As it is a part of the hypervisor it is as fast as it gets in terms of processing I/O on behalf of the virtual machine. And as it’s a kernel module there is no need for tweaking memory and CPU configurations, and on top of that there is no danger of mistakenly powering down the virtual appliance.
Virtual appliance = bad?
Not per se. But when you deal with server side flash in the data path and access flash in memory, the problem is that every overhead you introduce in the data path will show up very very quickly. When dealing with memory you are accessing nanosecond material, flash resources latency are in the micro seconds while spindles and virtual appliances deal in milliseconds. Therefore if your goal is to separate the performance tier from your capacity tier, it makes sense to use an architecture that is suitable for each tier. A great architecture is to marry both solutions. Use FVP for your performance tier, while using a virtual appliance for the capacity tier.