WHEN YOUR HOME LAB TURNS INTO A HOME DC
A little bit over a year ago I decide to update my lab and build two servers. My old lab had plenty of compute power, however they were lacking bandwidth, 3 Gbit/s SATA and 1 Gb network bandwidth. I turned to one of the masters of building a home lab, Erik Bussink, and we thought that the following configuration was sufficient to handle my needs. Overview Component Type Cost CPU Intel Xeon E5 1650 v2 540 EUR CPU Cooler Noctua NH-U9DX i4 67 EUR Motherboard SuperMicro X9SRH-7TF 482 EUR Memory Kingston ValueRAM KVR16R11D4/16HA 569 EUR SSD Intel DC 3700 100GB 203 EUR Power Supply Corsair RM550 90 EUR Case Fractal Design Define R4 95 EUR Price per Server (without disks) 1843 EUR The systems are great, but really quickly I started to hit some limitations. Limitations that I have addressed in the last year, and that are interesting enough to share. Adding a third host As FVP is a scale out clustered platform, having two hosts to test with simply just don’t cut it. For big scale out testing I use nested ESXi but to do simple tests I just needed one more host. The challenged I faced was the dilemma of investing in “old” tech or going with new hardware. Intel updated their Xeon line to version 3, the Intel Xeon v3 has more cache (from 12MB to 15MB) more memory bandwidth, increased max memory support up to 768GB and uses DDR4 memory. (Intel ark comparison) New shiny hardware might be better, but the main goal is to expand my cluster and one of the things I believe in is a uniform host configuration within a cluster. Time is a precious resource and the last thing I want to spend time on is to troubleshoot behavior that is caused by using non-uniform hardware. You might win some time by having a little bit more cache, more memory bandwidth but once you need to troubleshoot weird behavior you lose a lot more. A dilemma is not a dilemma if you go back and forth between the options, thus I researched if it was possible. The most predominant one is the change in CPU microarchitecture. The v3 is part of the new Intel Haswell microarchitecture. The Xeon v2 is build upon the Ivy Bridge. That means that the cluster has to run in EVC mode Ivy Bridge. The EVC dialog box of the cluster indicates that Haswell chips are supported in this EVC mode, thus DRS functionality remains available if I go for the Haswell chip The Haswell chip uses DDR4 memory and that means different memory timings and different memory bandwidth. FVP can use memory as a storage I/O acceleration resource and a lot of testing will be done with memory. That means that applications can behave differently when FVP decides to replicate fault tolerant writes to the DDR4 host or vice versa. In itself it’s a very interesting test, thus again another dilemma is faced. However, these tests are quite unique and I rather have uniform performance across the cluster and avoid any troubleshooting behavior due to hardware disparity. Due to the difference in memory type, a new Motherboard is required too. That meant that I have to find a motherboard that contains the same chipsets and network configuration. The SuperMicro X9SRH-7TF rocks. Onboard 10 GbE is excellent. Some other users in the community have reported overheating problems, Erik Bussink was hit hard by the overheat problem and bought another board just to get rid of weird errors caused by the overheating. That by itself made me wonder if I would buy another X9SRH-7TF or go for a new Supermicro board and buy a separate Intel X540-T2 dual port 10GbE NIC to get the same connectivity levels. After weighing the pros and cons I decided to go for the uniform cluster configuration. Primarily because testing and understanding software behavior is hard enough. Second-guessing whether behavior is caused by the hardware disparity is a time sucking beast and even worse, it typically kills a lot of joy in your work. Contrary to popular belief, prices of older hardware does not decline forever, due to availability of newer hardware and remaining stock, prices go up. The third host was almost 500 Euro’s, more expensive than the previous price I had to pay. Networking Networking is interesting as I changed a lot during the last year. The hosts are now equipped with an Intel PRO 1000 PT dual ports with the 82571 chip. Contrary to my initial post these are supported by vSphere 5.5. However network behavior is a large part of understanding scale out architectures, thus more NIC ports are needed. An additional HP NC365T Quad-port Ethernet Server Adapter was placed in each server. The HP NIC is based on the Intel 82580 chipset but is a lot cheaper than buying Intel branded cards. Each host has one NIC dedicated to IPMI, 2 10GbE ports and 6 1GbE. In hindsight, I would rather go for two Quad NIC cards as it allows me to setup different network configurations without having to tear them down each time. With the introduction of the third host I had to buy a 10GbE switch. The two host were directly connected to each other, however this configuration is not possible with three hosts. Thus I had to look for a nice cheap 10GbE switch that doesn’t break the bank and is quiet. Most 10GbE switches are made for the data center where noise isn’t really a big issue. My home lab is located in my home office, spending most of my day with something that sounds like a jet plane is not my idea of fun. The NETGEAR ProSafe Plus XS708E 8-port 10-Gigabit fit most of my needs. 8 ports for less than 900 euro’s, it’s kind of a steal compared to the alternatives. However I wasn’t really impressed by the noise levels (and spending 900 euro’s but that’s a different story). Again my main go-to-guy for all hardware related questions Erik Bussink provided the solution, the Noctua NF-A4-x10 FLX coolers. Designed to fit into 1U boxes they were perfect. But as you can see the design of the Netgear is a bit weird. The coolers are positioned at the far end of the PCB with all the heatsinks. When the switch is properly loaded, the thing emits a lot of heat. Regardless of what type of internal fan is used. To avoid heat buildup in the switch I used simple physics, but I will come to that later. Now having three hosts with seven 1 GbE connections, two storage systems eating up 3 ports and an uplink to the rest of the network I needed a proper switch. Lessons learned in that area, research thoroughly before pressing the buy button. I started of with buying an HP 1810-24G v2 switch. Silent, 24 ports, VLAN support. Awesome! No not awesome because it couldn’t route VLANs. And to the observant reader, 25 ports required, 24 ports offered. A VCDX’esque-like constraint. To work around the 24 ports limitation I changed my network design and wrote some scripts to build and tear down different network configurations. Not optimal, but dealing with home labs is almost like the real world. While testing network behavior and hitting the VMkernel network stack routing problem I decided it was time to upgrade my network with some proper equipment. I asked around in the community and a lot where using the Cisco SG300 series switch. Craig Kilborn on twitter blogged about his HP v1910 24G and told me that it was quite noisy. A Noctua hack might do the trick, but I actually wanted some more ports than 24. Erik pointed out the Cisco SG500-28-K9-G5 switches that are stackable and fanless. Perfect! I could finally use all the NICs in my servers and have room for some expansion. Time for a new rack So from this point on I have three 19” sized switches, the IKEA lack hack table was nice, but these babies deserved better. The third host didn’t fit the table therefor new furniture had to be bought anyways. After spending countless of hours looking at 19” racks I came across a 6U Patch case. This case had a lockable glass door (kids) and removable side panels, perfect for my little physics experiment. Just place the case in an upright position, remove the side panels and let the heat escape from the top. The fans will suck in “cold” air from the bottom. The dimensions of the patch case were perfect as it fitted exactly in my setup. The case is an Alfaco 19-6406. But with this networking equipment I’m feeling that my home lab is slowly turning into a #HomeDC. With all this compute and network power I wanted to see what you can do when you have enterprise grade flash devices. I’m already using the Intel DC S3700 SSD’s and I’m very impressed by their consistent high performance. However Intel has released the Intel SSD DC p3700 PCIe card that use NVMe. I turned to Intel and they were so generous of loaning me three of these beasts for a couple of months. The results are extremely impressive, soon I will post some cool test results, but imagine seeing more than 500.000 IOPS in your homeDC on a daily basis. Management server To keep the power bill as low as possible, all three hosts are shutdown after testing, but I would like to have the basic management VMs running. In order to do this, I used a Mac Mini. William wrote extensively about how to install ESXi on a Mac, if you are interested I would recommend to check out his work: http://www.virtuallyghetto.com/apple. Unfortunately 16Gb is quite limited when you are running three windows VMs with SQL DB’s, therefor I might expand my management cluster by adding another Mac Mini. Time to find me some additional sponsors. :)
BALLOONING, QUEUE DEPTHS AND OTHER BACK PRESSURE FEATURES REVISITED
Recently I’ve been involved in a couple conversations about ballooning, QoS and queue depths. Remarks like ballooning is bad, increase the queue depths and use QoS are just the sound bits that spark the conversation. What I learn from these conversations is that it seems we might have lost track of the original intention of these features. Hypervisor resource management Features such as ballooning and queue depths are invented to solve the gap between resource demand and resource availability. When a system experiences a state where resource demand exceeds the resources it controls the system has a problem. This is especially true in systems such as a hypervisor where you cannot control the demand directly. A guest operating system or an application in a virtual machine can demand a lot of resources. The resource management schedulers inside the hypervisor are tasked to fulfilling the demand of that particular machine while at the same time satisfy the resource demand of other virtual machines. Typically the guest OS resource schedulers are not integrated with the hypervisor resource schedulers, this can lead to a situation in which the administrator typically resorts to taking draconian measures. Measures such as disabling ballooning, increasing the queue-depth to become the digital equivalent of the Mariana trench. Sometimes it is taken for granted, but resource management inside the hypervisor is actually a though challenge to solve. Much research is done on solving this problem; a lot of research papers trying to find an answer to this challenge are published on a monthly basis. Let’s step back and take a look from an engineer perspective (or should I say developer?) and see what the problem is and how to solve it in the most elegant way. It’s an architect job to understand that this functionality is not a replacement for a proper design. Let’s start by understanding the problem. Load-shedding or back pressure When dealing with the situation where resource demand exceeds resource availability you can do two things. Well if you don’t do anything, it’s likely to encounter a system failure that can affect a lot more than only that particular resource or virtual machines. Overall you don’t design for system failure, you want to avoid it and to do so you can either drop the load or apply some form of back pressure. I think we all agree that dropping load, sometimes referred to as load-shedding is not the most elegant way of dealing with temporary overload, that’s why a lot of effort is going into back pressure features. A back pressure feature that everyone is familiar with is the memory balloon driver. Guest OS memory schedulers deal with used and free memory in such a way that this is transparent to the hypervisor. When the hypervisor is running out of physical machine memory it needs to figure out a way to retrieve memory. By using the balloon driver, the hypervisor asks the guest OS memory scheduler to provide a list of pages that it doesn’t use or doesn’t deem as important. After getting the info, the hypervisor proceeds to free up the physical memory pages to be able to satisfy incoming memory requests. Instead of dropping the new incoming workload it applies a back pressure system in the most elegant way. I don’t know why people are still talking about ballooning as bad. The feature is awesome, it’s the architect / sys admin job to come up with a plan to avoid back pressure in the system. Again the back pressure feature is not substitute for proper design and management. But the most misunderstood back pressure feature could be queue-depths. Sometimes I hear people refer to queue depths as a performance enhancement. And this is not true. It just allows you to temporarily deal with I/O overload. The best way to have a clear understanding of queue depths is to use the bathroom sink analogy. The drain is the equivalent of the data path leading to the storage array, the sink itself is the queue sitting on top of the data path / drain. The faucet represents the virtual machine workloads. Typically you open up the faucet to a level that allows the drain to cope with the flow of water. Same applies to virtual machine workloads and the underlying storage system. You run an x amount of workload that is suitable for your storage system. The moment you open up the faucet more your sink will fill up and at one point your sink will overflow. Thus you have to do some back pressure mechanism. In the bathroom sink world this typically is done by flowing the water back into the second sink. In the I/O scheduler world this typically resolves in a queue full statement. This typically bogs down the performance of the virtual machine so much that many admins/architect resolve by increasing the queue depth. Because this allows them to avoid the queue full state (temporarily) But in essence you just replaced your bathroom sink by a bigger sink, or something when people go overboard the increase the queue depth to the digital equivalent of a full size bathtub. This bathtub impacts a lot of other workloads as many workload now end up at the top of the queue instead of the deeper part, waiting their turn to go through the drain to the storage system. Result: latency increases in all applications due to improper designed systems. And remember when the bathtub overflows you typically have a bigger mess to deal with. Back pressure features are not a substitute for proper design, therefor think about implementing a bigger drain of even better multiple drains. More bandwidth or just more data paths to the same storage system lead to a short delay of seeing the same back pressure problem again, it just occurs on a different level. Typically when a storage controller fills up its cache, it sends a queue full to all the connected systems, so the problem has now evolved from a system wide problem to a cluster wide problem. This is one of the big reasons why scale out storage systems are a great fit in the virtual datacenter. You create drains to different sewer system typically in a more plan-able manner. If you are looking for more information about this topic, I published a short series on the challenge of traditional storage architectures in virtual datacenters. Quality of Service faces the same predicament. QoS is a great feature dealing with temporary overload, but again, it is not a substitute for a proper design. Back pressure features are great, it allows the system to deal with resource contentions while avoiding system failures. These features are unmissable in the dynamic virtual datacenter of today. When detecting that these features are activated on a frequent basis, one must review the virtual datacenter architecture, the current workloads and future workloads. I think overall it all boils down to understand the workload in your system and have an accurate view of the capabilities of your systems. Proper monitoring and analytics tools are therefore indispensable. Not only for daily operations but also for architects dealing with maintaining a proper service level for their current workloads while architecting an environment that can deal with unknown future workloads.
VMWARE TOOLS IS OUT OF DATE ON THIS VIRTUAL MACHINE WHILE SUMMARY STATES CURRENT
For some apparent reason all my virtual machines show an alert that the VMware tools is out of date. While the summary states that its running the current version When trying to upgrade the VMware tools, all options are grayed out: It appears to be a cosmetic error but I stil wanted to know why it shows the alert on my virtual machines. As it turns out I created my templates on my management cluster host (lab) and that host runs a newer version of vSphere 5.5 (2068190) My workload cluster host run a slightly older version of vSphere 5.5 and it uses a different version of VMtools. The VMtools version list helps you to identify which version of VMtools installed in your virtual machine belongs to which ESXi version: https://packages.vmware.com/tools/versions Hope this clarifies this weird behavior of the UI for some.
HAMMER, MAGICBANDS AND CHALLENGING STATUS QUO
One of my favorite business books is the famous book of Michael Hammer “Reengineering the Corporation”. The theme is the book is to take a hard look at business processes and radically change these old and existing processes. Hammer states that typically companies sped up their processes by implementing a newer iteration of existing technology. Many processes are dated before the advent of the computers and just by automating the process it can only optimizes performance marginally. Embedding computers in the archaic processes cannot address their fundamental performance challenges. By understanding what the process is trying to achieve one can break away from the existing design principles of the process. One great example is one of my favorite technologies is the Disney’s MagicBand and the way it’s used to radically change processes. [caption id=“attachment_5292” align=“aligncenter” width=“480”] Photo by Adam Voorhes / wired.com[/caption] Disney’s MagicBand The MagicBand is what is commonly referred to as a wearable. Inside the bracelet are a RFID chip and a radio. The parks have long range and short-range scanners along with sensors to interact with the MagicBand bracelet. One of the perks of my job is to speak to people who deliver cutting edge technology and as you can imagine I was very thrilled to speak to some of the team members who worked on the MagicBand platform. In essence the Disney MagicBand replaces every transaction between the customer and cast members or Disney parks and resorts. The MagicBand becomes your key to anything. It allows access to the park, access to the resorts and automatic payment. Its goal is to create a frictionless experience for the customer, increasing the satisfaction, which of course will increase spending. Instead of tinkering with existing processes Disney overhauled a lot of processes and many more are to follow. A great example is the overhaul of the check-in process and how it’s completely inline with the Hammer doctrine. Instead of buying newer faster desktop computers to speed up the check-in process, customers can go directly to their hotel room, bypassing the check in process all together. Disney’s sends the MagicBand to your home and prior to your visit the resort informs you via email which room is yours for your stay. Just walk up to your room and unlock the door by taping your MagicBand against the sensor on the door. MagicBands allows Disney to get rid of the ancient turnstiles that is the first port of anxiety of new parents and their strollers, instead of being funnelled into narrow cramped isles the entrance is now in the shape of a inviting V shape form with incredible process speeds. Just hold your MagicBand to the access point and wait to be greeted by an welcoming green glow, from then on its off to your favorite experience. The MagicBand platform allows for new experiences as well. What about having a more personalized interaction with cast members? What if Cinderella greets your daughter by name and tells her that she knows she is her favorite princess or that she wishes her a happy birthday? Just mind-blowing and an experience she will never forget. Range scanners, sensors, WIFI, smart phones apps, user profiles and an insane amount of data crunching make this happen. Customers use the smart phone app to access their schedules and their user profiles. Countless short and long-range sensors scattered across the park pick up the signals of the bracelets. These systems are connected to each other, they collect data, and they use the captured data to optimize the experience of the customers. Data on visitors traffic flow, food orders and waiting times can be used to realign internal resources. Ever heard about Internet of things? This is the poster child of Internet of things. Stop! Hammer Time Circling back to the opening statement, using newer iteration of existing technology only provide marginally performance increases. Advances like the MagicBand require new technologies and new ways to operationalize these technologies in datacenters. Not every company is of the same size as Disney, but one thing is certain, most companies face the same challenge Disney has. How to reduce cost, increase efficiency and provide a new experience that makes them unique in a highly competitive market? A lot of brilliant people are trying to solve these problems by creating technical solutions and it’s up to the IT team to understand if these suit their operation models. How can you reengineer your corporation and create a new service offering while your IT is stuck in the past? Stuck using systems that are designed to work in infrastructures dating back to the early ‘70’s? Where they just found out that the market was bigger than 5 computers? Everything has to align, when the business changes it models, the IT team should not take anything for granted, they too need to aim for quantum leaps of performance. Markets shift rapidly; IT needs to be able to respond almost in a way that anticipates their needs. Maybe even in a way that it doesn’t feel remarkable at all, high service standards are the norm! Within the realm of virtualized datacenters two technology advancements can create an experience that might provide a ubiquitous experience to the business but provide the magic to the IT team; Scale out storage and object based storage. Scale out storage and object based storage Proper Scale out storage systems allows you to operationalize new advancement in storage technology as soon as they are available. They allow virtual datacenters to cater to any performance requirement possible any time. While, and this is very important, without impact current workloads that are using the same platform. Many application vendors move away from a monolithic application architecture, why keep holding on to the relics of the past by using a monolithic storage architecture for performance requirements? Object based storage, such as VMware VVOLs, allows you to fundamentally change how to provide data services to systems (virtual machines). Instead of creating management constructs and aligning data services to these logical layers (LUNs and datastores), data services can be directly applied to the specific machine. Virtual machines become first class citizens on storage systems, allowing IT teams to cater to requirements that both affect the business as well as the IT team requirements. Challenge status quo Hammer stated don’t ask, “How can we do what we do faster?” But ask; Why do we do it the way we do?” In essence, challenge status quo if you want to keep on moving forward in a time that introduces new technologies and application landscapes on a daily basis!
DB DEEPDIVE PART 6: QUERY PLANS, INTERMEDIATE RESULTS, TEMPDB AND STORAGE PERFORMANCE
Welcome to part 6 of the Database workload characteristics series. Databases are considered to be one of the biggest I/O consumers in the virtual infrastructure. Database operations and database design are a study upon themselves, but I thought it might be interested to take a small peak underneath the surface of database design land. I turned to our resident Database expert Bala Narasimhan, PernixData’s VP of products to provide some insights about the database designs and their I/O preferences. Previous instalments of the series: Part 1 - Database Structures Part 2 – Data pipelines Part 3 - Ancillary structures for tuning databases Part 4 - NoSQL platforms Part 5 - Query Execution Plans In a previous article I introduced the database query optimizer and described how it works. I then used a TPC-H like query and the SQL Server database to explain how to understand the storage requirements of a query via the query optimizer. In today’s article we will deep dive into a specific aspect of query execution that severely impacts storage performance; namely intermediate results processing. For today’s discussion I will use the query optimizer within the PostgreSQL database. The reason I do this is because I want to show you that these problems are not database specific. Instead, they are storage performance problems that all databases run into. In the process I hope to make the point that these storage performance problems are best solved at the infrastructure level as opposed to doing proprietary infrastructure tweaks or rewrites within the database. After a tour of the PostgreSQL optimizer we will go back to SQL Server and talk about a persistent problem regarding intermediate results processing in SQL Server; namely tempdb. We’ll discuss how users have tried to overcome tempdb performance problems to date and introduce a better way. What are intermediate results? Databases perform many different operations such as sorts, aggregations and joins. To the extent possible a database will perform these operations in RAM. Many times the data sets are large enough and the amount of RAM available is limited enough that these operations won’t fully fit in RAM. When this happens these operations will be forced to spill to disk. The data sets that are written to and subsequently read from disk as part of executing operations such as sorts, joins and aggregations are called intermediate results. In today’s article we will use sorting as an example to drive home the point that storage performance is a key requirement for managing intermediate results processing. The use case For today’s example we will use a table called BANK that has two columns ACCTNUM and BALANCE. This table tracks the account numbers in a bank and the balance within each account. The table is created as shown below: Create Table BANK (AcctNum int, Balance int); The query we are going to analyze is one that computes the number of accounts that have a given balance and then provides this information in ascending order by balance. This query is written in SQL as follows: Select count(AcctNum), Balance from BANK GROUP BY Balance ORDER BY Balance; The ORDER BY clause is what will force a sort operation in this query. Specifically we will be sorting on the Balance column. I used the PostgreSQL database to run this query. I loaded approximately 230 million rows into the BANK table. I made sure that the cardinality of the Balance column is very high. Below I have a screenshot from the PostgreSQL optimizer for this query. Note that the query will do a disk based merge sort and will consume approximately 4 GB of disk space to do this sort. A good chunk of the query execution time was spent in the sort operation. A disk-based sort, and other database operations that generate intermediate results, is characterized by large writes of intermediate results followed by reads of those results for further processing. IOPS is therefore a key requirement. What is especially excruciating about the sort operation is that it is a materialization point. What this means is that the query cannot make progress until the sort is finished. You’ve essentially bottlenecked the entire query on the sort and the intermediate results it is processing. There is no better validation of the fact that storage performance is a huge impediment for good query times. What is tempdb? tempdb is a system database within SQL Server that is used for a number of reasons including the processing of intermediate results. This means that if we run the query above against SQL Server the sorting operation will spill intermediate results into tempdb as part of processing. It is no surprise then that tempdb performance is a serious consideration in SQL Server environments. You can read more about tempdb here. How do users manage storage performance for intermediate results including tempdb? Over the last couple of years I’ve talked to a number of SQL Server users about tempdb performance. This is a sore point as far as SQL Server performance goes. One thing I’ve seen customers do to remediate the tempdb performance problem is to host tempdb alone in arrays that have fast media, such as flash, in them in the form of either hybrid arrays or All Flash Arrays (AFA). The thought process is that while the ‘fast array’ is too expensive to standardize on, it makes sense to carve out tempdb alone from it. In this manner, customers look at the ‘fast array’ as a performance band aid for tempdb issues. On the surface this makes sense since an AFA or a hybrid array can provide a performance boost for tempdb. Yet it comes with several challenges. Here are a few:
DATABASE WORKLOAD CHARACTERISTICS AND THEIR IMPACT ON STORAGE ARCHITECTURE DESIGN – PART 5 - QUERY EXECUTION PLANS
Welcome to part 5 of the Database workload characteristics series. Databases are considered to be one of the biggest I/O consumers in the virtual infrastructure. Database operations and database design are a study upon themselves, but I thought it might be interested to take a small peak underneath the surface of database design land. I turned to our resident Database expert Bala Narasimhan, PernixData’s VP of products to provide some insights about the database designs and their I/O preferences. Previous instalments of the series: Part 1 - Database Structures Part 2 – Data pipelines Part 3 - Ancillary structures for tuning databases Part 4 - NoSQL platforms Databases are a critical application for the enterprise and usually have demanding storage performance requirements. In this blog post I will describe how to understand the storage performance requirements of a database at the query level using database tools. I’ll then explain why PernixData FVP helps not only to solve the database storage performance problem but also the database manageability problem that manifests itself when storage performance becomes a bottleneck. Throughout the discussion I will use SQL Server as an example database although the principles apply across the board. Query Execution Plans When writing code in a language such as C++ one describes the algorithm one wants to execute. For example, implementing a sorting algorithm in C++ means describing the control flow involved in that particular implementation of sorting. This will be different in a bubble sort implementation versus a merge sort implementation and the onus is on the programmer to implement the control flow for each sort algorithm correctly. In contrast, SQL is a declarative language. SQL statements simply describe what the end user wants to do. The control flow is something the database decides. For example, when joining two tables the database decides whether to execute a hash join, a merge join or a nested loop join. The user doesn’t decide this. The user simply executes a SQL statement that performs a join of two tables without any mention of the actual join algorithm to use. The component within the database that comes up with the plan on how to execute the SQL statement is usually called the query optimizer. The query optimizer searches the entire space of possible execution plans for a given SQL statement and tries to pick the optimal one. As you can imagine this problem of picking the most optimal plan out of all possible plans can be computationally intensive. SQL’s declarative nature can be sub-optimal for query performance because the query optimizer might not always pick the best possible query plan. This is usually because it doesn’t have full information regarding a number of critical components such as the kind of infrastructure in place, the load on the system when the SQL statement is run or the properties of the data. . One example of where this can manifest is called Join Ordering. Suppose you run a SQL query that joins three tables T1, T2, and T3. What order will you join these tables in? Will you join T1 and T2 first or will you join T1 and T3 first? Maybe you should join T2 and T3 first instead. Picking the wrong order can be hugely detrimental for query performance. This means that database users and DBAs usually end up tuning databases extensively. In turn this adds both an operational and a cost overhead. Query Optimization in Action Let’s take a concrete example to better understand query optimization. Below is a SQL statement from a TPC-H like benchmark. select top 20 c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment from customer, orders, lineitem, nation where c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= ':1' and o_orderdate < dateadd(mm,3,cast(':1'as datetime)) and l_returnflag = 'R' and c_nationkey = n_nationkey group by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment order by revenue; The SQL statement finds the top 20 customers, in terms of their effect on lost revenue for a given quarter, who have returned parts they bought. Before you run this query against your database you can find out what query plan the optimizer is going to choose and how much it is going to cost you. Figure 1 depicts the query plan for this SQL statement from SQL Server 2014 [You can learn how to generate a query plan for any SQL statement on SQL Server at https://msdn.microsoft.com/en-us/library/ms191194.aspx. You should read the query plan from right to left. The direction of the arrow depicts the flow of control as the query executes. Each node in the plan is an operation that the database will perform in order to execute the query. You’ll notice how this query starts off with two Scans. These are I/O operations (scans) from the tables involved in the query. These scans are I/O intensive and are usually throughput bound. In data warehousing environments block sizes could be pretty large as well. A SAN will have serious performance problems with these scans. If the data is not laid out properly on disk, you may end up with a large number of random I/O. You will also get inconsistent performance depending on what else is going on in the SAN when these scans are happening. The controller will also limit overall performance. The query begins by performing scans on the lineitem table and the orders table. Note that the database is telling what percentage of time it thinks it will spend in each operation within the statement. In our example, the database thinks that it will spend about 84% of the total execution time on the Clustered Index Scan on lineitem and 5% on the other. In other words, 89% of the execution time of this SQL statement is spent in I/O operations! It is no wonder then that users are wary of virtualizing databases such as these. You can get even more granular information from the query optimizer. In SQL Server Management Studio, if you hover your mouse over a particular operation a yellow pop up box will appear showing very interesting statistics. Below is an example of data I got from SQL Server 2014 when I hovered over the Clustered Index Scan on the lineitem able that is highlighted in Figure 1. Notice how Estimated I/O cost dominates over Estimated CPU cost. This again is an indication of how I/O bound this SQL statement is. You can learn more about the fields in the figure above here. An Operational Overhead There is a lot one can learn about one’s infrastructure needs by understanding the query execution plans that a database generates. A typical next step after understanding the query execution plans is to tune the query or database for better performance. For example, one may build new indexes or completely rewrite a query for better performance. One may decide that certain tables are frequently hit and should be stored on faster storage or pinned in RAM. Or, one may decide to simply do a complete infrastructure rehaul. All of these result in operational overheads for the enterprise. For starters, this model assumes someone is constantly evaluating queries, tuning the database and making sure performance isn’t impacted. Secondly, this model assumes a static environment. It assumes that the database schema is fixed, it assumes that all the queries that will be run are known before hand and that someone is always at hand to study the query and tune the database. That’s a lot of rigidity in this day and age where flexibility and agility are key requirements for the business to stay ahead. A solution to database performance needs without the operational overhead What if we could build out a storage performance platform that satisfies the performance requirements of the database irrespective of whether query plans are optimal, whether the schema design is appropriate or whether queries are ad-hoc or not? One imagines such a storage performance platform will completely take away the sometimes excessive tuning required to achieve acceptable query performance. The platform results in an environment where SQL is executed as needed by the business and the storage performance platform provides the required performance to meet the business SLA irrespective of query plans. This is exactly what PernixData FVP is designed to do. PernixData FVP decouples storage performance from storage capacity by building a server side performance tier using server side flash or RAM. What this means is that all the active I/O coming from the database, both reads and writes, whether sequential or random, and irrespective of block size is satisfied at the server layer by FVP right next to the database. You are longer limited by how data is laid out on the SAN, or the controller within the SAN or what else is running on the SAN when the SQL is executed. This means that even if the query optimizer generates a sub optimal query plan resulting in excessive I/O we are still okay because all of that I/O will be served from server side RAM or flash instead of network attached storage. In a future blog post we will look at a query that generates large intermediate results and explain why a server side performance platform such as FVP can make a huge difference. Post originally appeared on ToddMace.io
DON’T BACKUP. GO FORWARD WITH RUBRIK
Rubrik has set out to build a time machine for cloud infrastructures. I like the message as it shows that they are focused on bringing simplicity to the enterprise backup world. Last week I had the opportunity to catch up with them and they had some great news to share as they were planning to come out of stealth this week. And that day is today. Rubrik platform This time machine is delivered on a 2U commodity appliance that runs the Rubrik software. By installing this appliance you greatly reduce the number of machines that are necessary to provide backup and restore services today. Reducing the number of machines simplifies the infrastructure for architects and support, while the UI of Rubrik simplifies the day-to-day operations of the administrators. User Interface No agents are needed in the virtual datacenters to discover the workload and the user interface is centered on policy driven SLAs. Unfortunately I can’t show the user-interface, but trust me this is something you longed for a long time. Due to the pedigree of the co-founders it comes as no surprise that the Rubrik platform is fully programmable with REST API’s. Typically moving to a new backup system introduces risk and cost. Learning curves are high, misconfigured backup configurations possibly risking data loss. Policy driven and the ability to use REST APIs ensure that the platform easily integrates in every environment. The policies are so easy to use that no training is necessary; this reduces the impact of transition to a new backup system. The low learning curve means that no countless hours are lost by figuring out how to safely backup your data, while the REST APIs allow advanced tech crews to integrate Rubrik in their highly automated service offerings. Architecture One thing that made me very happy to see is the Rubriks’ ability to “cloud-out” your data. Rubrik provides a gateway to AWS allowing you to send “aged data” to the cloud in a very secure way. This feature benefits the complexity reduction of local architecture. Instead of having to incorporate a tape library, you now only need an Internet connection. Having worked with a big tape libraries myself for years I know this will not only bring a lot of datacenter space back and reduce your energy bill, you WILL have way less heat to cool. As the team understand the concept of distributed architectures thoroughly (more about that in the next paragraph) it doesn’t come as a surprise that it scales very well. The architecture can scale to 1000s of nodes. What’s interesting is that it can mount the snapshots directly on the Rubrik platform allowing virtual machines to run directly on the appliance. Think about the possibilities for development. Snapshot your current production workload and test your new code instantly without any impact on active services. Rubrik starts off by supporting VMware vSphere and it makes sense to focus on the biggest market out there as a startup. But support for other hypervisors and cloud infrastructures (to cloud-out data) will follow. I expect Rubrik to become a success, the product aligns with the todays enterprise datacenter requirements and the pedigree of the team is amazing. As mentioned before the co-founders have a very rich background in distributed systems, Arvind Jain was the founding Engineer of Riverbed and was a Distinguished Engineer at Google before co-founding the other three members. Interesting enough (for me at least) there are stong ties with PernixData. Prior to Rubrik, Bipul Sinha was a partner at LightSpeed before founding Rubrik. I had the great pleasure of talking to Bipul often, as he is the initial investor of PernixData. Funny enough I can recall a conversation where Bipul asked me on my view of the backup world. I believe boring and totally not sexy was my initial reply. Guess he is setting out to change that fast! The CTO of Rubrik, Arvind Nithrakashyap, worked at Oracle where he co-founded Oracle Exadata. The other Co-founder of Exadata is PernixData CEO Poojan Kumar. Last but certainly not least Soham Mazumdar, who worked at Google on the search engine and founded Tagtile. As of today you can sign-up for the early access program. Go visit the website to read more and follow them on twitter. Exciting times ahead for Rubrik! Don’t Backup. Go Forward!
PART 3 - DATA PATH IS NOT MANAGED AS A CLUSTERED RESOURCE
Welcome to part 3 of the Virtual Datacenter scaling problems with traditional shared storage series. Last week I published an article about the FAST presentation “A Practical Implementation of Clustered Fault Tolerant Write Acceleration in a Virtualized Environment”. Ian Forbes followed up with the question about the advantages of throughput and latency of host-to-host network versus a traditional SAN when both have similar network speeds. Part 1: Intro Part 2: Storage Area Network topology IOPS distribution amongst ESXi hosts In the previous part of this series the IOPS provided by the array were equally divided amongst the ESXi hosts. In reality given the nature of applications and their variance it’s the application demand that drives the I/O demand. And due to this I/O demand will not be equally balanced across all the host in the cluster. The virtual datacenter is comprised of different resource layers each with components that introduce their own set of load balancing algorithms. Back in 2009 Chad published nice diagram depicting all the queues and buffers of a typical storage environment. Go read the excellent article “VMware I/O queues, micro bursting and multipathing”. How can you ensure that the available paths to the array are load-balanced based on virtual machine demand and importance? Unfortunately for us today’s virtual datacenter lacks load balance functionality that clusters these different layers, reducing hotspots and optimally distributes workloads. Let’s focus on the existing algorithms currently available, and possibly present, in virtual datacenters around the world. Clustered load balancers The only cluster-wide load balancing tools are DRS and Storage DRS. Both cluster resources into seamless pools and distribute workload according to their demand and their priority. When the current host cannot provide the entitled resources a virtual machine demands, the virtual machine is migrated to another host or datastore. DRS aggregates CPU and memory resources, Storage DRS tries to mix and match the VM I/O & capacity demand with the datastore I/O and capacity availability. The layers between compute and datastores are equally important yet network bandwidth and data paths are not managed as a clustered resource. Load balancing occurs within the boundaries of the host; specifically they focus on outgoing data streams. Data Path load balancing With IP-based storage networks, multiple options exist to balance the workload across the outgoing ports. With iSCSI, binding of multiple VMkernel NICs can be used to distribute workload, some storage vendors prefer a configuration using multiple VLANS to load balance across storage ports. When using NFS Load Based Teaming (LBT) can be used to load balance data across multiple NICs. Unfortunately all these solutions don’t take the path behind the first switch port in consideration. Although the existing workload is distributed across the available uplinks as efficiently as possible, no solution exists that pools the connected paths of the hosts in the cluster and distribute the workloads across the hosts accordingly. A solution that distributes virtual machines across host with less congested data paths in a well-informed and automatic manner simply does not exist. Distributed I/O control Storage I/O Control (SIOC) is a datastore-wide scheduler, allowing distributing of queue priority amongst the virtual machines located on various hosts that are connected to that datastore. SIOC is designed to deal with situations where contention occurs. If necessary it divides the available queue slots across the hosts to satisfy the I/O requirements based on the virtual machine priority. SIOC measures the latency from the (datastore) device inside the kernel to disk inside the array. It is not designed to migrate virtual machines to other hosts into the cluster to reduce latency or bandwidth limitations incurred by the data path. Network IO Control (NetIOC) is based on similar framework. It allocates and distributes bandwidth across the virtual machines that are using the NICs of that particular host. It has no ability to migrate virtual machines by taking lower utilized links of other hosts in the cluster into account. Multipathing software VMware Pluggable Storage Architecture (PSA) is interesting. The PSA allows third party vendors to provide their own native multipathing software (NMP). Within the PSA, Path Selection Plugins (PSPs) are active that are responsible for choosing a physical path for I/O requests. The VMware Native Multipathing Plugin framework supports three types of PSPs; Most Recently Used (MRU), Fixed and Round Robin. The Storage Array Type Plugins (SATP) run in conjunction with NMP and manages array specific operations. SATPs are aware of storage array specifics, such as whether it’s an active/active array or active/passive array. For example, when the array uses ALUA (Asymmetric LUN Unit Access) it determines which paths lead to the ports of the managing controllers. The Round Robin PSP distributes I/O for a datastore down all active paths to the managing controller and uses a single path for a given number of I/O operations. Although it distributes workload across all (optimized) paths, it does not guarantee that throughput will be constant. There is no optimization on the I/O profile of the host. Its load balance algorithm is based purely on equal numbers of I/O down a given path, without regard to what the block size of I/O type is, it will not be balanced on application workload characteristics or the current bandwidth utilization of the particular path. Similar to SIOC and NetIOC, NMP is not designed to treat data paths as clustered resource and has no ability to distribute workloads across all available uplinks in the cluster. EMC PowerPath is a third party NMP and has multiple algorithms that consider current bandwidth consumption of the paths and the pending types of I/O. It also integrates certain storage controller statistics to avoid negative affects by continuously switching paths. PowerPath squeezes as much performance (and resilience) out of their storage paths as possible because it can probe the link from host all the way to the back-end of a supported array and make decisions about active links accordingly. However PowerPath hosts do not communicate with each other and balances the I/O load on a host-by-host basis. This paragraph focuses on EMC solution; other storage vendors are releasing their NMP software with similar functionality. However not all vendors are providing their own software, and EMC PowerPath is only supported on a short list of storage vendors other than EMC own products. Quality of Services on data paths Quality of Services on data paths (QoS) is an interesting solution if it provides end-to-end QoS; from virtual machine to datastore. The hypervisor is context-rich environment, allowing kernel services to understand which I/O belongs to which virtual machine. However when the I/O exits the host and hits the network, the remaining identification is the address of the transmitting device of the host. There is no differentiation of priority possible other then at host level. Not all applications are equally important to the business, therefor end-to-end QoS is necessary to guarantee that business critical application get the resources they deserve. Scalability limitation on storage controller ports influences the overall impact of QoS. Similar to most algorithms, it does not aim to provide a balanced utilization of all available data paths; it deals with priority control during resource contention. Storage Array layer Storage DRS is able to migrate virtual machine files based on their resource demand. Storage DRS monitors the VMobserved latency, this includes the kernel and data path latency. Storage DRS incorporates the latency to calculate the benefit a migration has on the overall change in latency at source and destination datastore. It does not use the different latencies of kernel and data path to initiate a migration at compute level. Storage DRS initiates a self-vMotion to load the new VMX file as it has a different location after the storage vMotion, the virtual machine remains on the same hosts. In other words Storage DRS is not designed to migrate virtual machine at the compute layer or datastore layer to solve bandwidth imbalance. Most popular arrays provide asymmetric LUN unit access. All ports on the Storage controllers accept incoming read and write operations, however the controller owning the LUN always manages read operations. Distributing LUNs across controllers are crucial as imbalance of CPU utilization or port utilization of the storage controllers can be easily introduced. LUNs can be manually transferred to improve CPU utilization, however this is not done dynamically unfortunately. Manual detection and management is as good as people watching it and not many organizations watch the environment at that scrutiny level all the time. Some might argue that arrays transfer the LUNs automatically, but that’s when a certain amount of “proxy reads” are detected. This means that the I/O’s are transmitted across the non-optimized paths, and likely either the PSP is not doing a great job, or all your active optimized paths are dead. Both not hallmarks of an healthy – or –properly architected environment. Is oversizing a solution? Oversizing bandwidth can help you so far, as its difficult to predict workload increase and intensity variation. Introduction of radically new application landscape impact current designs tremendously. When looking at the industry developments, its almost certain that most datacenters will be forced to absorb these new application landscapes, can current solutions applied in a traditional storage stack provide and guarantee the services they require and are they able to scale to provide the resources necessary? Non-holistic load balancer available In essence, the data-path between the compute layer and datastore layer is not treated as a clustered resource. Virtual machine host placement is based on compute resource availability and entitlement, disregarding the data path towards storage layer. This can potentially lead to hotspots inside the cluster, where some hosts saturate their data paths, while data paths of other hosts are underutilized. Data path saturation impacts application performance. Unfortunately there is no mechanism available today that takes the various resource demands of a virtual machine into account. No solution at this time has the ability to intelligently manage these resources without creating other bottlenecks or impracticalities. This article is not a stab at the current solutions. It is a very difficult problem to solve, especially for an industry that relies on various components of various vendors, expecting everything to integrate and perform optimally aligning. And think about moving forward and attempt to incorporate new technology advancements from all different vendors when they are available. And don’t forget about backward compatibility, world peace might be easier to solve. With this problem in mind, the existence of uncontrollable data paths, oversubscribed inter-switch links and its inability to be application aware, many solutions nowadays move away from the traditional storage architecture paradigm. Deterministic performance delivery and Policy-driven management are the future and when no centralized control plane is available that stitching these disparate components together a different architecture arises. PernixData FVP, VMware VSAN and Hyper-converged systems rely on point-to-point network architectures of the crossbar switch architecture to provide consistent and non-blocking network performance to cater their storage performance and storage service resiliency needs. Leveraging point-to-point connections and being able to leverage the context-aware hypervisor allows you not only to scale easily, it allows you to create environments that provide consistent and deterministic performance levels. The future datacenter is one step closer! Part 4 focusses on the storage controller architecture and why leveraging host-to-host communication and host resource availability remove scalability issues.
VIRTUAL DATACENTER SCALING PROBLEMS WITH TRADITIONAL SHARED STORAGE - PART 2
Oversubscription Ratios The most common virtual datacenter architecture consists of a group of ESXi hosts connected via a network to a centralized storage array. The storage area network design typically follows the core-to-edge topology. In this topology a few high-capacity core switches are placed in the middle this topology. ESXi hosts are sometimes connected directly to this core switch but usually to a switch at the edge. An inter-switch link connects the edge switch to the core switch. Same design applies to the storage array; it’s either connected directly to the core switch or to an edge switch. This network topology allows scaling out easily from a network perspective. Edge switches (sometimes called distribution switches) are used to extend the port count. Although core switches are beefy, there is however a limitation on the number of ports. And each layer in the network topology has its inherent roles in the design. One of the drawbacks is that Edge-to-Core topologies introduce oversubscription ratios, where a number of edge ports use a single connection to connect to a core port. Placement of systems begins to matter in this design as multiple hops increase latency. To reduce latency the number of hops should be reduced, but this impacts port count (and thus number of connected hosts) which impacts bandwidth availability, as there are a finite amounts of ports per switch. Adding switches to increase port count brings you back to device placement problems again. As latency play a key role in application performance, most storage area network aim to be as flat as possible. Some use a single switch layer to connect the host layer to the storage layer. Let’s take a closer look on how scale out compute and this network topology impacts storage performance: Storage Area Network topology This architecture starts of with two hosts, connected with 2 x 10GB to a storage array that can deliver 33K IOPS. The storage area network is 10GB and each storage controller has two 10GB ports. To reduce latency as much as possible, a single redundant switch layer is used that connects the ESXi hosts to the storage controller ports. It looks like this: In this scenario the oversubscription ratio of links between the switch and the storage controller is 1:1. The ratio of consumer network connectivity to resource network connectivity is equal. New workload is introduced which requires more compute resources. The storage team increases the spindle count to increase more capacity and performance at the storage array level. Although both the compute resources and the storage resources are increased, no additional links between the storage controllers are added. The oversubscription ratio is increased and now a single 10Gb link of an ESXi host has to share this link with potentially 5 other hosts. The answer is obviously increasing the number of links between the switch and the storage controllers. However most storage controllers don’t allow scaling the network ports. This design stems from the era where storage arrays where connected to a small number of hosts running a single application. On top of that, it took the non-concurrent activity into account. Not every application is active at the same time and with the same intensity. The premise of grouping intermitted workloads led to virtualization, allowing multiple applications using powerful server hardware. Consolidation ratios are ever expanding, normalizing intermittent workload into a steady stream of I/O operations. Workloads have changed, more and more data is processed every day, pushing these IO’s of all these applications through a single pipe. Bandwidth requirements are shooting through the roof, however many storage area network designs are based of best practices prior to the virtualization era. And although many vendors stress to aim for a low oversubscription ratio, the limitation of storage controller ports prevents removing this constraint. In the scenario above I only used 6 ESXi hosts, typically you will see a lot more ESXi hosts connected to the same-shared storage array, stressing the oversubscription ratio. In essence you have too squeeze more IO through a smaller funnel, this will impact latency and bandwidth performance. Frequently scale-out problems with traditional storage architecture are explained by calculating the average number of IOPS per host by dividing the number of host by the total number of IOPS provided by the array. In my scenario, the average number of IOPS is 16.5K IOPS remained the same due to the expansion of storage resources at the same time the compute resources were added (33/2 or 100/6). Due to the way storage is procured (mentioned in part 1) storage arrays are configured for expected peak performance at the end of its life cycle. When the first hosts are connected, bandwidth and performance are (hopefully) not a problem. New workloads lead to higher consolidation ratio’s, which typically result in expansion of compute cycles to keep the consolidation ratio at a certain level to satisfy performance and availability requirements. This generally leads to reduction of bandwidth and IOPS per hosts. Arguably this should not pose a problem if sizing was done correctly. Problem is, workload increase and workload behavior typically do not align with expectations, catching the architects off-guard or simply new application landscape turn up unexpectedly when business needs change. A lack of proper analytics impacts consumption of the storage resources and avoiding hitting limits. It’s not unusually for organizations to experience performance problems due to lack of proper visibility in workload behavior. To counteract this, more capacity is added to the storage array to satisfy capacity and performance requirements. However this does not solve the problem that exists right in the middle of these two layers. It ignores the funnel created by the oversubscription ratio of the links connected to the storage controller ports. The storage controller port count impact the ability to solve this problem, another problem is that the way total bandwidth is consumed. The activity of the applications and the distribution of the virtual machines across the compute layer affect the storage performance, as workload might not be distributed equally across the links to the storage controllers. Part 3 of this series will focus on this problem.
MEMORY DEEP DIVE SUMMARY
This is last part of the memory deep dive. As the total series count 7667 words, I thought it would be a good idea to create a summary of the previous 6 parts. The memory deep dive series: Part 1: Memory Deep Dive Intro Part 2: Memory subsystem Organisation Part 3: Memory Subsystem Bandwidth Part 4: Optimizing for Performance Part 5: DDR4 Memory Part 6: NUMA Architecture and Data Locality Part 7: Memory Deep Dive Summary The reason why I started this deep dive is to understand the scalability of server memory configurations and constrains certain memory type introduce. Having unpopulated DIMM slot does not always translate into future expandability of memory capacity. A great example is the DIMM layout of today’s most popular server hardware. The server boards of the Cisco UCS B200 M4, HP Proliant DL380 Gen 9, and Dell PowerEdge 730 Gen 13 come equipped with 2 CPU’s and 24 DIMM slots. Processors used in the aforementioned systems are part of the new Intel Xeon 26xx v3 micro-architecture. It uses multiple onboard memory controllers to provide a multi-channel memory architecture. Multi-channel configurations, DIMM ranking and DIMM types must be considered when designing your new server platform. If these are not taken into account, future scalability might not be possible or the memory will not perform as advertised. Multi-channel memory architecture Modern CPU microarchitectures support triple or quadruple memory channels. This allows the memory controller to access the multiple DIMMs simultaneously. The key to high bandwidth and low latency is interleaving. Data is distributed in small chunks across multiple DIMMs. Smaller bits of data are retrieved from each DIMM across independent channels instead of accessing a single DIMM for the entire chunk of data across one channel. For in-depth information, please go to part 2. The Intel Xeon 26xx v3 micro-architecture offers a quad-channel memory architecture. To leverage all the available bandwidth each channel should be populated with at least one DIMM. This configuration has the largest impact on performance and especially on throughput. Part 4 dives into multi-channel configuration in depth. The configuration depicted above leverages all four channels and allows the CPU to interleave memory operations across all four channels. The memory controller groups memory across the channels in a region. When creating a 1 DIMM per Channel configuration, the CPU creates one region (Region 0) and interleaves the memory access. The CPU will use the available bandwidth if less than 4 DIMMs are used, for example when 3 DIMMs are used, the memory controller uses three channels to interleave memory. 2 DIMMs result in two usable channels, and one DIMM will use a single channel, disabling interleaving across channels. Populating four channels provide the best performance, however sometimes extra capacity is required, but less than DIMMS populating four channels provide. For example if 384 GB is required and 32 GB DIMMs are used, 6 DIMMS are used. The CPU will create two Regions. Region 0 will run in quad channel mode, while region 1 runs in dual channel mode: This creates an unbalanced memory channel configuration, resulting in inconsistent performance. With Quad Channel configurations its recommended to add memory in groups of 4 DIMMS. Therefor use 4, 8 or 12 DIMMs per CPU to achieve the required memory capacity. Memory Ranking A DIMM groups the chips together in ranks. The memory controller can access ranks simultaneously and that allows interleaving to continue from channel to rank interleaving. Rank interleaving provides performance benefits as it provides the memory controller to parallelize the memory request. Typically it results in a better improvement of latency. DIMMs come in three rank configurations; single-rank, dual-rank or quad-rank configuration, ranks are denoted as (xR). To increase capacity, combine the ranks with the largest DRAM chips. A quad-ranked DIMM with 4Gb chips equals 32GB DIMM (4Gb x 8bits x 4 ranks). As server boards have a finite amount of DIMM slots, quad-ranked DIMMs are the most effective way to achieve the highest memory capacity. Unfortunately current systems allow up to 8 ranks per channel. Therefor limiting the total capacity and future expandability of the system. 3 DIMMs of QR provide the most capacity however it is not a supported configuration as 12 ranks exceeds the allowable 8 ranks. Ranking impacts the maximum number of DIMMs used per channel. If current memory capacity of your servers needs to be increased, verify the ranking configuration of the current memory modules. Although there might be enough unpopulated DIMM slots, quad rank memory modules might prevent you from utilizing these empty DIMM slots. LRDIMMs allow large capacity configurations by using a memory buffer to obscure the number of ranks on the memory module. Although LRDIMMs are quad ranked, the memory controller only communicates to the memory buffer reducing the electrical load on the memory controller. DIMMs per Channel A maximum of 3 DIMMs per channel are allowed. If one DIMM is used per channel, this configuration is commonly referred to as 1 DIMM Per Channel (1 DPC). 2 DIMMs per channel (2 DPC) and if 3 DIMMs are used per channel, this configuration is referred to as 3 DPC. When multiple DIMMs are used per channel they operate at a slower frequency.