NUMA DEEP DIVE PART 4: LOCAL MEMORY OPTIMIZATION
If a cache miss occurs, the memory controller responsible for that memory line retrieves the data from RAM. Fetching data from local memory could take 190 cycles, while it could take the CPU a whopping 310 cycles to load the data from remote memory. Creating a NUMA architecture that provides enough capacity per CPU is a challenge considering the impact memory configuration has on bandwidth and latency. Part 2 of the NUMA Deep Dive covered QPI bandwidth configurations, with the QPI bandwidth ‘restrictions’ in mind, optimizing the memory configuration contributes local access performance the most. Similar to CPU, memory is a very complex subject and I cannot cover all the intricate details in one post. Last year I published the memory deep dive series and I recommend to review that series as well to get a better understanding of the characteristics of memory.
NUMA DEEP DIVE PART 3: CACHE COHERENCY
When people talk about NUMA, most talk about the RAM and the core count of the physical CPU. Unfortunately, the importance of cache coherency in this architecture is mostly ignored. Locating memory close to CPUs increases scalability and reduces latency if data locality occurs. However, a great deal of the efficiency of a NUMA system depends on the scalability and efficiency of the cache coherence protocol! When researching the older material of NUMA, today’s architecture is primarily labeled as ccNUMA, Cache Coherent NUMA.
NUMA DEEP DIVE PART 2: SYSTEM ARCHITECTURE
Reviewing the physical layers helps to understand the behavior of the CPU scheduler of the VMkernel. This helps to select a physical configuration that is optimized for performance. This part covers the Intel Xeon microarchitecture and zooms in on the Uncore. Primarily focusing on Uncore frequency management and QPI design decisions. Terminology There a are a lot of different names used for something that is apparently the same thing. Let’s review the terminology of the Physical CPU and the NUMA architecture. The CPU package is the device you hold in your hand, it contains the CPU die and is installed in the CPU socket on the motherboard. The CPU die contains the CPU cores and the system agent. A core is an independent execution unit and can present two virtual cores to run simultaneous multithreading (SMT). Intel proprietary SMT implementation is called Hyper-Threading (HT). Both SMT threads share the components such as cache layers and access to the scalable ring on-die Interconnect for I/O operations. Interesting entomology; The word “die” is the singular of dice. Elements such as processing units are produced on a large round silicon wafer. The wafer is cut “diced” into many pieces. Each of these pieces is called a die.
NUMA DEEP DIVE PART 1: FROM UMA TO NUMA
Non-uniform memory access (NUMA) is a shared memory architecture used in today’s multiprocessing systems. Each CPU is assigned its own local memory and can access memory from other CPUs in the system. Local memory access provides a low latency - high bandwidth performance. While accessing memory owned by the other CPU has higher latency and lower bandwidth performance. Modern applications and operating systems such as ESXi support NUMA by default, yet to provide the best performance, virtual machine configuration should be done with the NUMA architecture in mind. If incorrect designed, inconsequent behavior or overall performance degradation occurs for that particular virtual machine or in worst case scenario for all VMs running on that ESXi host. This series aims to provide insights of the CPU architecture, the memory subsystem and the ESXi CPU and memory scheduler. Allowing you in creating a high performing platform that lays the foundation for the higher services and increased consolidating ratios. Before we arrive at modern compute architectures, it’s helpful to review the history of shared-memory multiprocessor architectures to understand why we are using NUMA systems today.
INTRODUCTION 2016 NUMA DEEP DIVE SERIES
Recently I’ve been analyzing traffic to my site and it appears that a lot CPU and memory articles are still very popular. Even my first article about NUMA published in february 2010 is still in high demand. And although you see a lot of talk about the upper levels and overlay technology today, the focus on proper host design and management remains. After all, it’s the correct selection and configuration of these physical components that produces a consistent high performing platform. And it’s this platform that lays the foundation for the higher services and increased consolidating ratios.
TOP 5 VBLOG AGAIN, THANKS!!!!
Yesterday the top 25 vBlogs were announced and once again I’m in the top 5. I would like to thank all who have voted for me! It’s great to see that the content is appreciated. The broadcast: https://youtu.be/-Kb6gAys0jc Looking forward, there is a lot of content getting ready to be published and I hope to release my 5th book this year, the vSphere 6.x host resource deep dive. I’m excited about the content I’m working on and I’ll hope you guys will too! Thanks! Frank
NEW HOME LAB HARDWARE - DUAL SOCKET XEON V4
A new challenge - a new system for the home lab About one year ago my home lab was expanded with a third server and a fresh coat of networking. During this upgrade, which you can read at “When your Home Lab turns into a Home DC” I faced the dilemma of adding new a new generation of CPU (Haswell) or expanding the hosts with another Ivy Bridge system. This year I’ve proceeded to expand my home lab with a dual Xeon system, and decided to invest in the latest and greatest hardware available. Like most good tools, you buy them for the next upcoming job, but in the end, you will use it for countless other projects. I expect the same thing with this year’s home lab ‘augmentation’. Initially, the dual socket system will be used to test and verify the theory published in the upcoming book “vSphere 6 Host resource deep dive” and the accompanying VMworld presentation (session id 8430), but I have a feeling that it’s going to become my default test platform. Besides the dual socket system, the Intel Xeon 1650 v2 servers are expanded with more memory and a Micron 9100 PCIe NVMe SSD 1.2 TB flash device. Listed below is the bill of materials of the dual socket system:
VSPHERE 6.X HOST RESOURCE DEEP DIVE SESSION (8430) ACCEPTED FOR VMWORLD US AND EUROPE
Yesterday both Niels and I received the congratulatory message from the VMworld team, informing us that our session is accepted for both VMworld US and Europe. We are both very excited that our session was selected and we are looking forward at presenting to the VMworld audience. Our session is called the vSphere 6.x host resource deep dive (session ID 8430) and is an abstract of our similar titled book (publish date will be disclosed soon). Session Outline Today’s focus is on upper levels/overlay’s (SDDC stack, NSX, Cloud) but proper host design and management still remains the foundation of success. With the introduction of these new ‘overlay’ services, we are presented with a new consumer of host resources. Ironically it’s the attention to these abstraction layers that returns us to focusing on individual host components. Correct selection and configuration of these physical components leads to creating a stable high performing platform, that lays the foundation for the higher services and increased consolidating ratios. Topics we will address in this presentation are: The introduction of NUMA (Non-Uniform Memory Access) required changes in memory management. Host physical memory is now split into local and remote memory structures for CPUs that can impact virtual machine performance. We will discuss how to right size your VMs CPU and memory configuration in regards to NUMA and vNUMA VMkernel CPU scheduler characteristics. Processor speed and core counts are important factors when designing a new server platform. However with virtualization platforms the memory subsystem can have equal or sometimes even have a greater impact on application performance than the processor speed. In this talk we focus on physical memory configurations. Providing consistent performance is key to predictable application behavior. It benefits day-to-day customer satisfaction and helps reduce application performance troubleshooting. This talk covers flash architecture and highlights the differences between the predominant types of local storage technologies. We look closer into recurring questions about virtual networking. For example, how many resources does the VMkernel claim for networking, what impact does a vNIC type has on resource consumption. Such info allows you to get better grips on sizing your virtual datacenter for NFV workloads. Key Takeaway 1: Identifying how proper NUMA and physical memory configuration allows for increased VM performance Key Takeaway 2: What is the impact of virtual network services on consumption of host compute resources? Key Takeaway 3: How next-gen storage components lead to low latency, higher bandwidth and increased scalability. Key dates: VMworld US takes place at Mandalay Bay Hotel & Convention Center in Las Vegas, NV from August 28 - September 1, 2016 VMworld Europe takes place at Fira Barcelona Gran Via in Barcelona, Spain from 17 - 20 October, 2016 Repeat the feat Five years ago Duncan and I got this room completely full with our vSphere Clustering Deepdive Q&A, I would love to repeat that feat doing a Host deep dive session. I hope to see you all in our session!
HOME LAB FUNDAMENTALS: DNS REVERSE LOOKUP ZONES
When starting your home lab, all hints and tips are welcome. The community is full of wisdom, yet sometimes certain topics are taken for granted or are perceived as common knowledge. The Home Lab fundamentals series focusses on these subjects, helping you how to avoid common pitfalls that provide headaches and waste incredible amounts of time. One thing we always keep learning about vSphere is that both time and DNS needs to be correct. DNS resolution is important to many vSphere components. You can go a long way without DNS and use IP-addresses within your lab, but at one point you will experience weird behavior or installs just stop without any clear explanation.In reality vSphere is build for professional environments where it’s expected that proper networking structure is in place, physical and logical. When reviewing a lot of community questions, blog posts and tweets, it appears that DNS is partially setup, i.e. only forward lookup zones are configured. And although it appears to be ‘‘just enough DNS to get things going, many have experienced that their labs start to behave differently when no reverse lookup zones are present. Time-outs or delays are more frequent, the whole environment isn’t snappy anymore. Ill-configured DNS might give you the idea that the software is crap but in reality, it’s the environment that is just configured crappy. When using DNS, use the four golden rules; forward, reverse, short and full. DNS in a lab environment isn’t difficult to set up and if you want to simulate a proper working vSphere environment then invest time in setting up a DNS structure. It’s worth it! Besides expanding your knowledge, your systems will feel more robust and believe me, you will wait a lot less on systems to respond. vCenter and DNS vCenter inventory and search rely heavy on DNS. And since the introduction of vCenter Single Sign-On service (SSO) as a part of the vCenter Server management infrastructure DNS has become a crucial element. SSO is an authentication broker and security token exchange infrastructure. As described in the KB article Upgrading to vCenter Server 5.5 best practices (2053132);
HOME LAB FUNDAMENTALS: TIME SYNC
First rule of Home Lab club, don’t talk about time sync! Or so it seems. When starting your home lab, all hints and tips are welcome. The community is full of wisdom, however sometimes certain topics are taken for granted or are perceived as common knowledge. The Home Lab fundamentals series focusses on these subjects, helping you how to avoid the most common pitfalls that provide headaches and waste incredible amounts of time. A ’time-consuming’ pitfall is dealing with improper time synchronization between the various components in your lab environment. Most often, the need for time synchronization is seen as an Enterprise requirement but not really necessary for lab environments. Maybe because most think time synchronization is solely necessary for troubleshooting purposes. In some cases, this is true as ensuring correct time notation allows for proper correlation of events. Interestingly enough, this alone should be enough reason to maintain synchronized clocks throughout your lab, but most home labs are just rebuilt when troubleshooting becomes too time-consuming. However time sync is much more expedite troubleshooting and ignoring time drift is a straight path into the rabbit hole. Time synchronization utilities such as NTP are necessary to correct time drift introduced by hardware time drift and guest operating system timekeeping imprecision. When time differs between systems to much it can lead to installation and authentication errors. Unfortunately, time issues are not always easily identifiable, to provide a great example;