Not all performance tests are created equal

This is the second installment of the series “Evaluation an acceleration platform”. In the previous article I expanded upon the the difference between testing application performance on an acceleration platform compared to testing this on a traditional persistent storage layer. This article is about expectation management. Test results published on consumer focused computer online magazines/blogs might be different than the results achieved in a virtual infrastructure. Test results by a using a synthetic test might skew your expectation of results obtained by the real application workload.

Not all SSD disks are created equal
At PernixData (and VMware will agree when talking about vFRC and vSAN) we recommend flash devices that provide consistent performance, reliability and endurance. Frequently I receive request to suggest which SSD disk they should buy for PernixData FVP. Although there are many other SSD disks, we are really fond of the Intel DC S3700 series.

02-00-tweet

When responding it’s not uncommon to receive replies that favor other SSD’s that have a better spec sheet. Although factory specs are nice to glance over, most of them use unrealistic tests to get those out-of-this world numbers, for example using 512b I/O sizes to get an extreme IOPS count. Would it be better to take your favorite tech blog/online magazine test for granted? While they are much better than the spec sheets of the vendors, we urge you to consider the following before expecting the same behavior in your virtual infrastructure:

vSphere Storage Stack versus Windows & Linux OS on bare metal
One thing to be careful about is expecting to see your vSphere environment hitting the same number of IOPS listed on the spec sheet or with any consumer focused tech blog. The primary audience of these blogs is consumers; therefore they test with a popular Guest OS, for example Windows 7 or 8. The ESXi storage stack is completely different.

The storage scheduler stack has to consider many different streams of I/Os originating from different sources and going to different destinations. ESX has an advanced queue management and path selection to cope with this requirement. From a hardware perspective typically the RAID controller in the ESX host is of a different kind that the average consumer grade PC used in these consumer focused tests. And it becomes more complex if you are using an virtual appliance to accelerate your I/O. As an I/O traverses through the storage stack it goes through different layers. When entering a new layer the I/O can be handled by the same control medium or by a different control medium. When a new control medium takes over control, a context switch is generated. And its common knowledge that context switches are killing from a performance perspective. Both the storage stack and CPU scheduler are incurring overhead due to this.

In essence, reviewing consumer performance tests and expecting the same performance on an vSphere environment is comparing apples with oranges, device driver is different, disk and cpu scheduler algo's are different, command queuing and async command issue model used by windows and ESXi are different.

Now I must note that Anandtech sometimes publishes test results with a 32 Outstanding IO workloads to simulate “typical” server workload. When reviewing online published test results, please verify if they used a queue depth of 1 or a queue depth of 32. Another problem is that these articles gravitate towards the number of IOPS. Is IOPS not important? Yes it is, but the key lies within the I/O size.

I/O sizes
However these server-based test patterns are no match to a real life test, where you test your application, with a workload that is common to your organization in your own infrastructure. Besides the mismatch of equipment and software versions, synthetic test tend to use a single I/O size. Although it is possible to create a synthetic test with multiple I/O sizes and different read and write patterns, it’s extremely complex and time consuming to mimic workloads that are common in your environment. To be honest, have you mapped out the average workload of your applications? And is observed workload pattern based on hardware limitations or actual workload demand? Please note, that if you use vFRC, I/O size behavior is crucial to understand. vFRC requires you to specify the block size for the VM cache, to ensure best performance select the most common block size of the application. For more info on vFRC, read the VFRC FAQ by Duncan

To be able to easily compare testresults, synthetic test use a common and average single block size. It’s perfect to demonstrate IOPS. Why the focus on I/O size, where most people talk about IOPS and Latency? Well in my opinion latency is the most useful metric to measure when testing acceleration platforms. In the designs I was involved with it was all about reducing time for business reports to generate, reducing the time to login and retrieve information or reducing overall response time of that six figure number mailbox environment.

But let’s get back to the theoretical part. IOPS, throughput, I/O size and latency are closely tied together. In a more mathematical way: Throughput = IOPS x I/O Size. Latency indicates how long it takes for a single I/O to complete from an application point of view. The fun thing is that applications use different I/O sizes for their operations. Typically between 4KB and 64KB, but it could also be very small (512bytes) due to a metadata update. Plus then there are factors such as random and sequential as well as combination of different blocks and access patterns in one or multiple threads. And I just highlighted the behavior of one single application; think about how the VMkernel mixes I/O streams due to fairness and locality. (For more information read the extensive post on Disk.SchedNumReqOutstanding) By using a common workload in your environment you get a better understanding how the acceleration platform benefits your organization.

Want proof? A perspective of a PernixData customer using a common workload
The company Pete Koehler works for is PernixData 1st GA customer and Pete has published some excellent articles about the use of PernixData FVP. In his latest post “Observations of PernixData FVP in a production environment” Pete demonstrates the improvements he seen in one of his most challenging workload. To illustrate the IOPS, Throughput and Latency relationship Pete shows three phases of a code compile run:

02-1-IOPS

In the first phase the application issues a lot of write IOPS (green line), compared to an average of sub 100 IOPS in the second and third phase.

02-2-Throughput

When reviewing the throughput it shows a modest amount of IOPS in the first phase, followed by a much higher throughput in the second and third phase. This means that the application has switched I/O size from a small size in the first phase to a larger I/O size in the two following phases. (Throughput = IOPS x I/O size). How will this impact latency?

02-3-Latency

The green line shows a low latency, although a lot of IOPS where pushed by the application, the acceleration platform handled them like a charm. When the application switched to a larger I/O size, the system was processing more data resulting in a higher latency. Probably cause the array couldn’t handle all the data or the bandwidth between the host and the array was saturated, but I will let Pete disclose this in his next article.

Map these real life results to a synthetic test with an uniform I/O size. If we used a small I/O size, the test results would have shown very high IOPS and ultra low latency. Which is useful for the first phase of this workload. But what if you accelerate the workload and checked the performance during the second or final phase of the workload run? The results are different from the initial test due to the difference in I/O size. These results do not match the initial test results and you might be disappointed to see the device not living up to its claim of ultra low latency and high IOPS. As a matter effect the acceleration platform is just performing fine and helping you to get the best performance the application requires.

Key takeaway
This exercise gave Pete, and hopefully you, a much more adequate view of the way to expect when running application on an acceleration platform. Please read Pete’s complete article as it contains boatloads of useful information. Therefor we say test with the application you use to understand how much you benefit from your acceleration platform. Use the applications you run in your environment, talk to the application owners and identify the bottlenecks. Benchmark the application before accelerating it and compare the differences. PernixData FVP includes some great performance graphs that helps you get insight on the throughput of your system, the IOPS issued and the experienced latency of the applications when accelerated by FVP.

Does this mean that synthetic tests are bad? No synthetic tests can be very helpful to provide insights on the behavior and characteristics of the flash device. Such as is the device read or write biased, will it provide me consistent, predictable and repeatable performance? All this will be covered in the next article of this series.

Previous articles in this series:

Frank Denneman

Follow me on Twitter, visit the facebook fanpage of FrankDenneman.nl, or add me to your Linkedin network