frankdenneman.nl - Page 37 of 89 -

Database workload characteristics and their impact on storage architecture design – part 2 – Data pipelines

September 24, 2014 by frankdenneman

Welcome to part 2 of the Database workload characteristics series. Databases are considered to be one of the biggest I/O consumers in the virtual infrastructure. Database operations and database design are a study upon themselves, but I thought it might be interested to take a small peak underneath the surface of database design land. I turned to our resident Database expert Bala Narasimhan, PernixData’s director of products to provide some insights about the database designs and their I/O preferences.

Question 2: You mentioned data pipelines in your previous podcast, what do you mean by this?

What I meant by data pipeline is the process by which this data flows in the enterprise. Data is not a static entity in the enterprise; it flows through the enterprise continuously and at various points is used for different things. As mentioned in part 1 of this series, data usually enters the pipeline via OLTP databases and this can be from numerous sources. For example, retailers may have Point of Sale (POS) databases that record all transactions (purchases, returns etc.). Similarly, manufacturers may have sensors that are continuously sending data about health of the machines to an OLTP database. It is very important that this data enter the system as fast as possible. In addition, these databases must be highly available, have support for high concurrency and have consistent performance. Low latency transactions are the name of the game in this part of the pipeline.
At some point, the business may be interested in analyzing this data to make better decisions. For example, a product manager at the retailer may want to analyze the Point of Sale data to better understand what products are selling at each store and why. In order to do this, he will need to run reports and analytics on the data. But as we discussed earlier, these reports and analytics are usually throughput bound and ad-hoc in nature. If we run these reports and analytics on the same OLTP database that is ingesting the low latency Point of Sale transactions then this will impact the performance of the OLTP database. Since OLTP databases are usually customer facing and interactive, a performance impact can have severe negative outcomes for the business.
As a result what enterprises usually do is Extract the data from the OLTP database, Transform the data into a new shape and Load it into another database, usually a data warehouse. This is known as the ETL process. In order to do the ETL, customers use a solution such as Informatica (ETL) (3) or Hadoop (4) between your OLTP database and data warehouse. Some times customers will simply suck in all the data of the OLTP database (Read intensive, larger block size, throughput sensitive query) and than do the ETL inside the data warehouse itself. Transforming the data into a different shape requires reading the data, modifying it, and writing the data into new tables. You’ve most probably heard of nightly loads that happen into the data warehouse. This process is what is being referring to!
As we discussed before, OLTP databases may have a normalized schema and the data warehouse may have a more denormalized schema such as a Star schema. As a result, you can’t simply do a nightly load of the data directly from the OLTP database into the data warehouse as is. Instead you have to Extract the data from the OLTP database, Transform it from a normalized schema to a Star schema and then Load it into the data warehouse. This is the data pipeline. Here is an image that explains this:

In addition, there can also be continuous small feeds of data into the data warehouse by trickle loading small subsets of data, such as most recent or freshest data. By using the freshest data in your data warehouse you make sure that the reports you run or the analytics you do is not stale and is up to date and therefore enables the most accurate decisions.
As mentioned earlier, the ETL process and the data warehouse are typically throughput bound. Server side flash and RAM can play a huge role here because the ETL process and the data warehouse can now leverage the throughput capabilities of these server side resources.

Using PernixData FVP

Some specific, key benefits of using FVP with the data pipeline include:

OLTP databases can leverage the low latency characteristics of server side flash & RAM. This means more transactions per second and higher levels of concurrency all while providing protection against data loss via FVP’s write back replication capabilities.
Trickle loads of data into the data warehouse will get tremendously faster in Write Back mode because the new rows will be added to the table as soon as it touches the server side flash or RAM.
The reports and analytics may execute joins, aggregations, sorts etc. These require rapid access to large volumes of data and can also generate large intermediate results. High read and write throughput are therefore beneficial and having this done on the server right next to the database will help performance tremendously. Again, Write Back is a huge win.
Analytics can be ad-hoc and any tuning that the DBA have done may not help. Having the base tables on flash via FVP can help performance tremendously for ad-hoc queries.
Analytics workloads tend to create and leverage temporary tables within the database. Using server side resources for read enhances performance on these temporary tables and write accesses to them.
In addition, there is also a huge operational benefit. We can now virtualize the entire data pipeline (OLTP databases, ETL, data warehouse, data marts etc.) because we are able to provide high performance and consistent performance via server side resources and FVP. This brings together the best of both workloads. Leverage the operational benefits of a virtualization platform, such as vSphere HA, DRS and vMotion, and standardize the entire data pipeline on it without sacrificing performance at all.

Other parts of this series:

Database workload characteristics and their impact on storage architecture design – part 1

September 23, 2014 by frankdenneman

Frequently PernixData FVP is used to accelerate databases. Databases are for many a black box solution. Sure we all know they consume resources like there is no tomorrow, but can we make some general statements about database resource consumption from a storage technology perspective? I asked Bala Narasimhan, our director of Products, a couple of questions to get a better understanding about the database operations and how FVP can help to provide the performance the business needs.
The reason why I asked Bala about databases is because of his rich background in database technology. After spending some time at HP writing kernel memory management software, he moved to Oracle and was responsible for memory SGA and PGA. One of his proudest achievements was to build the automatic memory management in 10G. He then went on and worked at a startup where he rewrote the open source database, Postgres, to be a scale out, columnar relational databases for data warehousing and analytics. Bala recently recorded a webinar eliminate performance bottlenecks in virtualized Databases. Bala’s twitter account can be found here. As the topic databases is an extensive one, the article is split up into a series of smaller articles, making it more digestible.

Question 1: What are the various databases use cases one typically sees?

There is a spectrum of use cases, with OLTP, Reporting, OLAP and analytics being the common ones. Reporting, OLAP (online analytical processing) and Analytics can be seen as a part of the data warehousing family. OLTP (online transaction processing) databases are typically aligned with a single application and acts as an input source for data warehouses. Therefore a data warehouse can be seen as a layer on top of the OLTP database optimized for reporting and analytics.
When you deal with setting up architectures for databases you have to ask yourself, what do you try to solve? What is technical requirement of the workload? Is it latency intensive, do you retrieve or do you want to read a lot of data as fast as possible? Is the application latency sensitive or throughput bound? Meaning that if you go from left to right in the table on average the block size grows. Hint: the larger the block size means that on average you are dealing with a more throughput bound workload instead of a latency sensitive block size. From left to right the database design go from normalized to denormalized.

OLTP

Reporting

OLAP

Analytics

Database Schema Design

OLTP is an excellent example of a normalized schema. A database schema can be seen as a container objects and allows to logically group objects such as tables, views and stored procedures. When using a normalized schema you start to split a table into smaller tables. For example, lets assume a bank database has only one table that logs all activities by all its customers. This means that there are multiple rows in this table for each customer. Now if a customer updates her address you need to update many rows in the database for the database to be consistent. This can have a impact on the performance and concurrency of the database. Instead of this, you could build out a schema for the database such that there are multiple tables and there is only one table that has customer details in it. This way when the customer changes her address you only need to update one row in this table and this improves concurrency and performance If you normalize your database enough every insert, delete and update statement will only hit a single table, very small updates that require fast responds, therefor small blocks, very latency sensitive.
While OLTP databases tend to be normalized, data warehouses tend to be denormalized and therefore have lesser number of tables. For example, when querying the DB to find out who owns account 1234, it needs to join two tables, the Account-table with the Customer-table. In this example it is a two way join but it is possible for data warehousing systems to do many way joins (that is, joining multiple tables at once) and these are generally throughput bound.

Business Processes

An interesting way to look at the databases is its place in a business process. This provides you insight about the availability, concurrency and response requirements of the database. Typically OLTP databases are at the front of the process, customer-facing process, dramatically put they are in the line of fire. You want to have fast response, you want to read, insert and update data as fast as possible therefore the database are heavily normalized for reasons described above. When the OLTP database is performing slow or is unavailable it will typically impact revenue-generating processes. Data warehousing operations generally occur away from customer facing operations. Data is typically loaded into the data warehouse from multiple sources to provide the business insights into its day-to-day operations. For example, a business may want to understand from its data how it can drive quality and cost improvements. While we talk about a data warehouse as a single entity this is seldom the case. Many times you will find that a business has one large data warehouse and many so called ‘data marts’ that hang from it. Database proliferation is a real problem in the enterprise and managing all these databases and providing them the storage performance they need can be challenging.
Let’s dive into the four database types to understand their requirements and the impact on architecture design:

OLTP

OLTP workloads have a good mix of read and write operations. It is latency sensitive, and it requires the support for high levels of concurrency. When talking about concurrency a good example are ATM machines. Each customer at an ATM machine is generating a connection doing a few simple instructions, however a bank typically has a lot of ATM machines servicing its many customers concurrently. If a customer wants to withdraw money, the process needs to read the records of the customer in the database. It needs to confirm that he or she is allowed to withdraw the money, and then it needs to record (write) the transaction. In DBA jargon that is a SQL SELECT statement followed by an UPDATE statement. A proper OLTP database should be able to handle a lot of users at the same time preferably with a low latency. It’s interactive in nature, meaning that latency impacts user experience. You cannot keep the customer waiting for a long time at the ATM machine or a bank teller. From an availability perspective you cannot afford to have the database go down, the connections cannot be lost, it just needs to be up and running all the time (24×7).

	OLTP	Reporting	OLAP	Analytics
Availability	+++
Concurrency	+++
Latency sensitivity	+++
Throughput oriented	+
Ad hoc	+
I/O Operations	Mix R/W

Reporting

Reporting databases experience predominately read intensive operations and requires more throughput than anything else. Concurrency and availability are not as important for reporting databases as they are for OLTP. Characteristically workload is repeated read of data. Reporting is usually done when the users want to understand the performance of the business, for example how many accounts were opened this week, how many accounts were closed, is the private banking account team hitting it’s quota of acquiring new customers? Think of reporting as predictable requests, the user knows what data he wants to see and has a specific report design that structures the data in order needs to understand these numbers. This means, this report is repetitive which allow the DBA to design and optimize database and schema so that this query gets executed predictable and efficiently. Database design can be optimized for this report. Typical database schema designs for reporting include the Star Schema and the Snow Flake Schema.
As it serves the back office processes, availability and concurrency are not a strict requirement of this kind of database. As long as the database is available when the report is required. Enhanced throughput helps tremendously.

	OLTP	Reporting	OLAP	Analytics
Availability	+++	+
Concurrency	+++	+
Latency sensitivity	+++	+
Throughput oriented	+	+++
Ad hoc	+	+
I/O Operations	Mix R/W	Read Intensive

OLAP

OLAP can be seen as the analytical counterpart of OLTP. Where OLTP is the original source of data, OLAP is the consolidation of data, typically originating from various OLTP databases. A common remark made in database world is that OLAP provides a multi-dimension view, meaning that you drill down the data coming from various sources and then analyze the data amongst different attributes. This workload is more ad-hoc in nature then reporting as you slice and dice the data in different ways depending on the nature of the query. The workload is primarily read intensive and can run complex queries involving aggregations of multiple databases, therefore its throughput oriented. An example of an OLAP query would be the amount of additional insurance services gold credit card customers were signing up for during the summer months.

	OLTP	Reporting	OLAP	Analytics
Availability	+++	+	+
Concurrency	+++	+	+
Latency sensitivity	+++	+	++
Throughput oriented	+	+++	+++
Ad hoc	+	+	++
I/O Operations	Mix R/W	Read Intensive	Read Intensive

Analytics

Analytical workload is truly ad-hoc in nature. Whereas reporting aims to provide perspective of the numbers that are being presented, analytics provide insights in why the numbers are what they are. Reporting provides the how many new accounts where acquired by the private banking account team, analytics aims to provide insights why the private banking account team did not hit their quota in the last quarter. Analytics can query multiple databases and can be multi-step processes. Typically analytic queries write out large temporary results. Potentially it generates large intermediate results before slicing and dicing the temp data again. This means this data needs to be stored as fast as possible, the data is read again for the next query therefor read performance is crucial as well. Output is the input of the next query and this can happen multiple times, requiring both fast read and write performance otherwise your query will slow down dramatically.
Another problem is the sort process, for example you are retrieving data that needs to be sorted however the dataset is so large that you can’t hold everything in memory during the sort process resulting in spilling data to disk.
Because analytics queries can be truly ad-hoc in nature it is difficult to design an effecient schema for it upfront. This makes analytics an especially difficult use case from a performance perspective.

	OLTP	Reporting	OLAP	Analytics
Availability	+++	+	+	+
Concurrency	+++	+	+	+
Latency sensitivity	+++	+	++	+++
Throughput oriented	+	+++	+++	+++
Ad hoc	+	+	++	+++
I/O Operations	Mix R/W	Read Intensive	Read Intensive	Mix R/W

Designing and testing your storage architecture in line with DB-workload

By having a better grasp of the storage performance requirements of each specific database you can now design your environment to suits its need. Understanding these requirements helps you to test the infrastructure more focused on the expected workload.
Instead of running “your average db workload” in Iometer this allows you to test more towards latency or throughput oriented workloads when understanding what type of database will be used. The next article of this series dives into understanding whether tuning databases or storage architectures can solve performance.

Other parts of this series

Part 2 – Data pipelines
Part 3 – Ancillary structures for tuning databases

Improve public speaking by reading a book?

September 8, 2014 by frankdenneman

Although it sounds like an oxymoron I do have the feeling that books about this topic can help you become a better public speaker, or in a matter of fact more skillful in driving home your message.

After our talk at VMworld a lot of friends complimented not only on the talk itself but also on the improvements I’ve made when it comes to public speaking. My first public speaking engagement was VMworld 2010 at Vegas, 8 o’clock Monday morning for 1200 people. Talk about a challenge! Since then I have been slowly improving my skills. Last year I’ve done more talks than the previous 3 years before combined. Although Malcolm Gladwell’s 10.000 –hour rule is heavily debated nowadays, I do believe that practice is by far the best way to improve your skill. By itself getting 10.000 hours of public speaking time is rather a challenge and just going through the motions alone will be very inefficient. To maximize efficiency I started to dive into the theory behind public speaking or even more broadly theory about communicating. Over the year I read a decent stack of books but these four stood out the most.

1: Confessions of a public speaker by Scott Berkun
Funny and highly practical. If you want to buy only one book, this one should be it. The book helps you with the act of public speaking; How to deal with stage fright, how to work a tough room, what are the things I need to take care of to make my talk go smoothly.

2: Made to Stick by Chip and Dan Heath
This book helps you structure the message you
want to convey. It helps you to dive into the core of your message and communicate them in a memorable way. It’s a great book to read, lots of interesting stories and it’s one of those books that you should read multiple times to keep on refining your skillset.

3: Talk like Ted by Carmine Gallo
To some extent a combination of the two first books. The interesting part is the focus on the listener experience and its capability to focus for 18 minutes. In addition, it gives you insights into some of the greatest TED talks.

4 Pitch Perfect by Bill and Alisa Bowman
This book helps you to enhance your communication skills. It dives deeper into the act’s verbal and non-verbal language. It helps you to become cognizant of some of the mistakes everyone makes, yet can be avoided quite easily. The book helps you to drive your point in a more confident, persuasive, and certain manner.

The beauty of these books is that you can use them, learn from them even if you are not a public speaker. In everyday life we all need to communicate, we all want our idea to be heard and possibly get a buy-in from others. I believe these books will help you achieve this. If you have found other books useful and interesting please leave a comment.

Virtual machines versus Containers who will win?

August 21, 2014 by frankdenneman

Ah round X in the battle between who will win, which technology will prevail and when will the displacement of technology happen. Can we stop with this nonsense, with this everlasting tug-of-war mimicking the characteristics of a schoolyard battle. And I can’t wait to hear these conversations at VMworld.
In reality there aren’t that many technologies that completely displaced a prevailing technology. We all remember the birth of the CD and the message of revolutionising music carriers. And in a large way it did, yet still there are many people who prefer to listen to vinyl. Experience the subtle sounds of the medium, giving it more warmth and character. The only solution I can think of that displaced the dominant technology was video disc (DVD & Blue Ray) rendering video tape completely obsolete (VHS/Betamax). There isn’t anybody (well let’s only use the subset Sane people) that prefers a good old VHS tape above a Blue ray tape. The dialog of “Nah let’s leave the blue-ray for what it is, and pop in the VHS tape, cause I like to have that blocky grainy experience” will not happen very often I expect. So in reality most technologies coexist in life.
Fast forward to today. Dockers’ popularity put Linux Containers on the map for the majority of the IT population. A lot of people are talking about it and see the merits of leveraging a container instead of using a virtual machine. To me the choice seems to stem from the layer you present and manage your services. If your application is designed to provide high availability and scalability, then a container may be the best fit. If your application doesn’t than place it in a virtual machine and leverage the services provided by the virtual infrastructure. Sure there are many other requirements and constraints to incorporate in your decision tree, but I believe the service availability argument should be one of the first steps.
Now the next step is, where do you want to run your container environment? If you are a VMware shop, are you going to invest time and money to expand your IT services with containers or are you going to leverage an online PAAS provider? Introducing an APPS centric solution into an organization that has years of experience in managing Infrastructure centric platforms might require a shift of perspective
Just my two cents.

Disable vMotion for a single VM

August 18, 2014 by frankdenneman

This question pops up regularly on the VMTN forums and reddit. It’s a viable question but the admins who request this feature usually don’t want Maintenance mode to break or any other feature that helps them to manage large scale environments. When you drill down, you discover that they only want to limit the option of a manual vMotion triggered by an administrator.
Instead of configuring complex DRS rules, connect the VM to an unique portgroup or use bus sharing configurations, you just have to add an extra permission to the VM.
The key is all about context and permission structures. When executing Maintenance mode the move of a virtual machine is done under a different context (System) then when the VM is manually migrated by the administrator. As vCenter honors the most restrictive rule you can still execute a Maintenance mode operation of a host, while being unable to migrate a specific VM.
Here is how you disable vMotion for a single VM via the Webclient:
Step 1: Add another Role let’s call it No-vMotion

Log in as a vCenter administrator
Go to the home screen
Select Roles in the Administration screen
Select Create Role Action (Green plus icon)
Add Role name (No-vMotion)
Select All Priveleges
Scroll down to Resource
Deselect the following Privileges:

Migrate powered off virtual machine
Migrate powered on virtual machine
Query vMotion

Step 2: Restrict User privilege on VM.

Select “Host and Clusters” or “VMs and Templates” view, the one you feel comfortable with.
Select the VM and click on the Manage tab
Select Permissions
Click on “Add Permissions” (Green plus icon)
Click on Add and select the User or Group who you want to restrict.
In my example I selected the user FrankD and clicks on Add on OK
On the right side of the screen in the pulldown menu select the role “No-vMotion” and click on OK.

Ensure that the role is applied to This object.

FrankD is a member of the vCenterAdmins group which has Administrator privileges propagated through the virtual datacenter and all its children.
However FrankD has an additional role on this object “No-vMotion”. Let’s check if it works. Log in with the user id you restricted and right-click the VM. As shown, the option Migrate is greyed out. The VM is running on Host ESX01

The option Mainentance Mode is still valid for Host ESX01.

Click on the option “More Tasks” in the Recent Task window, here you can verify that FrankD is the initiator of the operation Maintenance mode, and System migrated the virtual machine.