frankdenneman Frank Denneman is the Machine Learning Chief Technologist at VMware. He is an author of the vSphere host and clustering deep dive series, as well as podcast host for the Unexplored Territory podcast. You can follow him on Twitter @frankdenneman

Database workload characteristics and their impact on storage architecture design – part 3 – Ancillary structures for tuning databases

3 min read

Welcome to part 3 of the Database workload characteristics series. Databases are considered to be one of the biggest I/O consumers in the virtual infrastructure. Database operations and database design are a study upon themselves, but I thought it might be interested to take a small peak underneath the surface of database design land. I turned to our resident Database expert Bala Narasimhan, PernixData’s director of products to provide some insights about the database designs and their I/O preferences.
Previous instalments of the series:

  1. Part 1 – Database Structures
  2. Part 2 – Data pipelines
Question 3: You’ve talked about ancillary structures for tuning databases, what are they and what role does FVP play here?

It goes without saying that database performance, whether OLTP or data warehousing, is critical for the business. As a result, DBA use ancillary structures to enhance database performance. Examples of such ancillary structures include indexes and Materialized Views (MV). MV are called Indexed Views on SQL Server.
Index
An index is an ancillary structure that allows a table to be sorted in multiple ways. This helps with quick lookups and operations, such as Merge Joins, that require that the data be sorted. Imagine a table with many columns in it. This table can be sorted in only one way on disk. For example, consider the Customer table shown below
CREATE TABLE Customer (
CustID int,
Name Char(20),
Zipcode int,
PRIMARY KEY (CustID));

The customer ID column, CustID, is the primary key in this table. This means that all customers can be uniquely identified by their CustID value. The table will most probably be sorted on this column.
Now imagine you ran a query that wanted to find out the number of customers in ZIP code 95051. Since the table is not sorted on ZIP code you will need to scan every single row in the table, see whether its ZIP code value is 95051 and add up the number of such rows. This can be extremely expensive. Instead what you can do is build an index on the ZIP code column. This index will be sorted on ZIP code and you can do a potentially faster lookup because of this.
Materialized View
A Materialized View (MV) is different from an index because an MV is a database object that contains the results of a query. If you know that a query will be run repeatedly then you can simply cache the results of that query in an MV and return the results as opposed to running the query itself each time. Example syntax to create an MV is as follows:
CREATE MATERIALIZED VIEW FOO AS SELECT * FROM BAZ WHERE BAZ.id = ‘11’;
In the SQL statement above the materialized view FOO stores the results of the query ‘SELECT * FROM BAZ WHERE BAZ.id = 11’. So, when someone executes the query ‘SELECT * FROM BAZ WHERE BAZ.id = 11’ you can simply return the rows in FOO instead because the results of the query are already saved in FOO. Now, this example is very simplistic but you can imagine that the query can be arbitrarily complex and storing its results in an MV can therefore be hugely beneficial.
Based on this explanation, one thing is apparent about both indexes and MV. Both indexes and MV are not ephemeral structures. This means that both of them need to be persisted on disk just like the tables in a database are. This means they consume disk space but more importantly it means that accessing them requires one to potentially do a lot of I/O. Good storage performance is therefore key to making these ancillary structures do their job.
These ancillary structures also come with a number of limitations. Firstly, they consume a lot of disk space. Sometimes they consume as much space as the underlying table and so it becomes more of an overhead than a benefit. Secondly, especially in the case of the MV, refresh rates make a huge difference. What do I mean by this?
Consider my example MV above. Let’s say that everyday I load new rows into the table BAZ and some of these rows have the value ‘11’ for the column id. In other words, there are new rows being added to BAZ every day where BAZ.id = 11. Once these new rows are added the MV foo has become stale because it is no longer storing the correct rows anymore. So, each time a new row is inserted into BAZ where BAZ.id = 11 not only must we do a write into the base table BAZ but we must also refresh the MV foo that sits on it. One I/O therefore ends up becoming multiple I/O! And, now if someone tries to query the MV foo when it is being refreshed you have all sorts of storage performance problems.
Note that both of these ancillary structures are great if you know what queries you are going to run. If so, you can both create the required indexes and MV. If, however, you run a query that cannot leverage these structures you get severe performance problems. And the truth of the matter is that it is seldom that case that you know all the queries you will run up front. So, ancillary structures can take you only so far sometimes.

How can FVP help?
  • When using server-side resources such as flash and RAM not only will writes to the underlying base table go faster in Write Back mode, but, refreshes on the MV sitting on top of those base tables will go much faster too. This means better query performance, higher concurrency and better scalability.
  • FVP will allow you to run ad-hoc queries at high speed. Even if you cannot leverage the existing indexes or MV for your query, accesses to the underlying base tables will be much faster owing to the fact that FVP will use server side flash and RAM for those base tables.
  • The above point means you do not need to create as many indexes or MV as you used to. This results in both tremendous savings from a storage capacity perspective and from an operational perspective of managing and running the database.

Part 4 coming soon!

frankdenneman Frank Denneman is the Machine Learning Chief Technologist at VMware. He is an author of the vSphere host and clustering deep dive series, as well as podcast host for the Unexplored Territory podcast. You can follow him on Twitter @frankdenneman

Exploring the Core Motivation of Writing a Book

More than a week ago Niels and I released the VMware vSphere 6.5 Host Resources Deep Dive and the community has welcomed it with...
frankdenneman
4 min read