Frank Denneman - Chief Technologist AI at VMware

AMD EPYC Naples vs Rome and vSphere CPU Scheduler Updates

October 14, 2019 by frankdenneman

Recently AMD announced the 2nd generation of the AMD EPYC CPU architecture, the EPYC 7002 series. Most refer to the new CPU architecture using its internal codename Rome. When AMD introduced the 1st generation EPYC (Naples), they succeeded in setting a new record of core count and memory capacity per socket. However, due to the CPU multi-chip-module (MCM) architecture, it is not an apples-to-apples comparison when compared to an Intel Xeon architecture. As each chip module contains a memory controller, each module presents a standalone NUMA domain. This impacts OS scheduling decisions and, thus, virtual machine sizing. A detailed look can be found here in English or here translated by Grigory Pryalukhin in Russian. Rome is different, the new CPU architecture is more aligned with the single NUMA per Socket paradigm, and this helps with obtaining workload performance consistency. There are some differences between Xeons and Rome. In addition, we made some adjustments to the CPU scheduler to deal with this new architecture. Let’s take a closer look at the difference between Naples and Rome.

7 nanometer (7 nm) lithography process forcing a new architecture

Rome is using the new 7nm Zen 2 microarchitecture. A smaller lithography process (7nm vs. 14nm) allows CPU manufacturers to cram more CPU cores in a CPU package. However, there are more elements on a CPU chip than CPU cores alone, such as I/O and memory controllers. The scalability of I/O interfaces is limited, and therefore, AMD decided to use a separated and more massive 14nm die that contains the memory and I/O controllers. This die is typically revered to as the server I/O Die (sIOD). In the picture below, you see a side by side comparison of an unlidded Naples (left) and an unlidded Rome, exposing the core chiplet dies and the SIOD.

Naples Zeppelin vs. Rome Chiplet

The photo above provides a clear overview of the structure of the CPU package. The Naples CPU package contains four Zeppelin dies (black rectangles). A Zeppelin die provides a maximum of eight Zen cores. The cores are divided across two compute complexes (CCX). A Zeppelin of a 32 core EPYC contains 4 cores per CCX. When Simultaneous Multi-Threading (SMT) is enabled, a CCX offers eight threads. Each CCX is connected to the Scalable Data Fabric (SDF) through the Cache-Coherent Master (CCM) that is responsible for sending traffic cross CCXes. The SDF contains two Unified Memory Controllers (UMC) connecting the DRAM memory modules. Each UMC provides a memory channel to two DIMMs. Providing the memory capacity of 4 DIMMs in total. Due to the combination of Cores, cache, and memory controller, a Zeppelin is a NUMA domain. To access a “remote” on-package memory controller, the Infinity Fabric On Package Controller (IFOP) sets up and coordinates the data communication.

The Rome CPU package contains a 14nm I/O Die (the center black rectangle), and 8 chiplet dies (the smaller black rectangles). A Rome chiplet contains two CCX’es with each containing four cores and L3 cache, but no I/O components or the memory controllers. There is a small Infinity Fabric “controller” on each CCX that connects the CCX to the sIOD. As a result, every memory read beyond the local CCX L3 cache has to go to the sIOD. Even for a cache line (data from memory stored in the cache) that is stored in the LL3 cache of the CCX sharing the same Rome chiplet. A Chiplet is a part of the NUMA Domain.

NUMA Domain per Socket

As mentioned before, a NUMA Domain, typically called NUMA node, is a combination of CPU cores, cache, and memory capacity connected to a local memory controller. Intel architecture design uses a single NUMA Domain per Socket (NPS), AMD Naples offered four NPS, while Rome is back to a single NPS. Single NPS simplifies VM and application sizing while providing the best and consistent performance.

The bandwidth to local memory differs between each CPU architecture. The Intel Xeon Scalable Family provides a maximum of six channels of memory supporting a DDR4-2933 memory type. The Naples provides two memory channels to its locally connected memory, supporting a DDR4-2666 memory type. The Rome architecture provides eight memory channels to its locally connected memory, supporting a DDR4-3200 memory type. Please note that the memory controllers in the Rome architecture are located on the centralized die, handling all types of I/O and memory traffic, the Intel memory controllers are constructs isolated from any other traffic. Real-life application testing must be used to determine whether this architecture impacts memory bandwidth performance.

CPU Architecture	Local Channels	Mem Types	Peak transfer
Intel Xeon Scalable	6	DDR4-2666	127.8 GB/s
AMD EPYC v1 (Naples)	2	DDR4-2933	46.92 GB/s
AMD EPYC v2 (Rome)	8	DDR4-3200	204.8 GB/s

With a dual-socket system, there are typically two different distances with regards to memory access. Accessing memory connected to the local memory controller and accessing memory connected to the memory controller located on the other socket. With Naples, there are three different distances. The IFOP is used for intra-socket communication, while the Infinity Fabric Inter Socket (IFIS) controller takes care of routing traffic across sockets. As there are eight Zeppelins in a dual-socket system, not every Zeppelin is connected directly to each other and thus sometimes the memory access is routed through the IFIS first before hitting an IFOP to get to the appropriate Zeppelin.

Naples Memory Access	Hops
Local memory access within a Zeppelin	0
Intra-socket memory access between Zeppelins	1
Inter-socket memory access between Zeppelins with direct IFIS connection	1
Inter-socket memory access between Zeppelins with indirect connection (IFIS+Remote IFOP)	2

AMD Rome provides equidistant memory access within the die and a single hop connection between sockets. Every memory access within the socket, every cache line load within the socket has to go to the I/O die. Every remote memory and cache access goes across the Infinity Fabric between sockets. This is somewhat similar to the Intel architecture that we have been familiar with since Nehalem, which launched in 2008. Why somewhat? Because there is a difference in cache domain design.

The Importance of Cache in CPU Scheduling

Getting memory capacity as close to the CPU improves performance tremendously. That’s the reason why each CPU package contains multiple levels of cache. Each core has a small but extremely fast cache capacity for instructions and data (L1), a slightly larger but relatively slower (L2) cache. A third and larger cache (L3) capacity is shared amongst the cores in the socket (Intel paradigm). Every time when a core request data to be loaded, it makes sense to retrieve this from the closest source possible, typically this is cache. To get an idea of how fast cache is relative to local and remote memory, look at the following table:

System Event	Actual Latency	Human Scaled Latency
One CPU cycle (2.3 GHz)	0.4 ns	1 second
Level 1 cache access	1.6 ns	4 seconds
Level 2 cache access	4.8 ns	12 seconds
Level 3 cache access	15.2 ns	38 seconds
Remote level 3 cache access	63 ns	157 seconds
Local memory access	75 ns	188 seconds (3min)
Remote memory access	130 ns	325 seconds (5min)
Optane PMEM Access	350 ns	875 seconds (15min)
Optane SSD I/O	10 us	7 hours
NVMe SSD I/O	25 us	17 hours

Back in the day when you could disable the cache of the CPU, someone tested the effect of cache on loading Windows 95. With cache it took almost five minutes, without the use of the cache, it took over an hour. Cache performance is crucial to get the best performance. And because of this, the vSphere NUMA scheduler and the CPU scheduler work together to optimize workloads that communicate with each other often. As they are communicating, they typically use the same data sources. Therefore, if vSphere can run the workload on the same cores that share the cache, then this could improve performance tremendously. The challenge is that AMD uses a different cache domain design than Intel.

As depicted in the diagram above, Intel uses a 1:1:1 relationship model. One socket equals one NUMA domain and contains one Last Level Cache domain. As Intel is used in more than 98% of the dual-socket systems (info based on internal telemetry reports), our scheduling team obviously focused most of their efforts on this model. EPYC Naples introduced a 1:4:2 model, one socket, that contains four NUMA domains, and each NUMA domain contains two LLC domains. Rome provides a NUMA model similar to the XEON, with a single socket and single NUMA domain. However, each chiplet contains two separate LLC domains. A Rome CPU package contains eight chiplets, and thus, 16 different LLC domains exist within a socket & NUMA domain.

Relational Scheduling

vSphere uses this LLC domain as a target for its relational scheduling functionality. Relational scheduling is better known as Action-Affinity. Its actions have made most customers think that the NUMA scheduler was broken. As the scheduler is optimized for cache sharing, it can happen that a majority of vCPU is running on a single socket, while the cores of the other sockets are idling. When reviewing ESXTOP you might see an unbalanced number of VMs running on the same NUMA Host Node (NHN). As a result, the VMs running in this NUMA domain (or in ESX terminology NHN) might compete with CPU resources and thus experience increased %Ready time.

Side note: It is my opinion to test the difference of relational scheduling on the performance of the application. Do not test this with synthetic test software. Although %Ready time is something to avoid, some applications benefit more from low-latency and highly consistent memory access than being impacted by an increase of CPU scheduling latency.

Action-Affinity can lead to ready time on an Intel CPU architecture where more than eight cores share the same cache domain, imagine what impact it can have on AMD EPYC systems where the maximum number of cores per cache domain is four. In lower-core count AMD EPYC systems, the cores are disabled per CCX, reducing the scheduling domain any further.

As the majority of the data centers are running on Intel, vSphere is optimized for a CPU topology where the NUMA and LLC domain are of consistent scope, i.e. the same size. With AMD the scopes are different and thus the current CPU scheduler can make “sub-optimal” decisions, impact performance. What happens is that the NUMA scheduler dictates the client size, the number of vCPUs to run on a NUMA Home Node, but it’s up to the CPU scheduler discretion to decide which vCPU to run on which physical core. As there are multiple Cache domains within a NUMA client, it can happen that there is an extraordinary amount of vCPU migrations between the cache domains within the NUMA domain. And that means cold cache access and a very crowded group of cores.

Therefore, the CPU team worked very hard to introduce optimizations for the AMD architecture and these optimizations are released in the updates ESXi 6.5 Update 3 and ESXi 6.7 Update 2.

The fix informs the CPU scheduler about the presence of the multiple cache domains within the NUMA node, allowing it to schedule the vCPU more intelligently. The fix also introduces a automatic virtual NUMA client sizer. By default, a virtual NUMA architecture is exposed to the guest OS when the vCPU count exceeds the physical core count of the physical NUMA domain and if the vCPU count is no less than the numa.vcpu.min setting, which defaults to 9. A physical NUMA domain in Naples counts eight cores, and thus no virtual NUMA topology is exposed. With the patch, this is solved. What is crucial to note is that the virtual NUMA topology is determined at first boot by default. Therefore, existing VMs need to have its virtual NUMA topology reset to leverage this new functionality. This involves a power-down to remove the NUMA settings in the VMX.

When introducing Naples/Rome based systems in your virtual data center, it’s strongly recommended to deploy the latest update of your preferred vSphere platform version. This allows you to extract as much performance from your recent investment.

60 Minutes of NUMA VMworld Session Commands

August 27, 2019 by frankdenneman

Verify Distribution of Memory Modules with PowerCLI

Get-CimInstance -CimSession $Session CIM_PhysicalMemory | select BankLabel, Description, @{n=‘Capacity in GB';e={$_.Capacity/1GB}}

PowerCLI Script to Detect Node Interleaving

Get-VMhost | select @{Name="Host Name";Expression={$_.Name}}, @{Name="CPU Sockets";Expression={$_.ExtensionData.Hardware.CpuInfo.NumCpuPackages}}, @{Name="NUMA Nodes";Expression={$_.ExtensionData.Hardware.NumaInfo.NumNodes}}

Action-Affinity Monitoring

Sched-Stats
-t numa-migration

Disable Action Affinity

numa.LocalityWeightActionAffinity = 0

numa.PreferHT

For more information on how to enable PreferHT: KB article 2003582

Host Setting:  numa.PreferHT=1

VM Setting:  numa.vcpu.PreferHT = TRUE

5 Things to Know About Project Pacific

August 26, 2019 by frankdenneman

During the keynote of the first day of VMworld 2019, Pat unveiled Project Pacific. In short, project Pacific transforms vSphere into a unified application platform. By deeply integrating Kubernetes into the vSphere platform, developers can deploy and operate their applications through a well-known control plane. Additionally, containers are now first-class citizens enjoying all the operations generally available to virtual machines.

Although it might seem that the acquisition of Heptio and Pivotal kickstarted project Pacific, VMware has been working on project Pacific for nearly three years! Jared Rosoff, the initiator or the project and overall product manager, told me that over 200 engineers are involved as it affects almost every component of the vSphere platform.

Lengthy technical articles are going to be published in the following days. With this article, I want to highlight the five key takeaways from project Pacific.

1: One Control Plane to Rule Them All

By integrating Kubernetes into the vSphere platform, we can expose the Kubernetes control plane to allow both developers and operation teams to interact with the platform. Instead of going through the hassle of installing, configuring, and maintaining Kubernetes clusters, each ESXi host acts as a Kubernetes worker node. Every cluster runs a Kubernetes control plane that is lifecycle managed by vCenter. We call this Kubernetes cluster the supervisor cluster, and it runs natively inside the cluster. This means that Kubernetes functionality, just like DRS and HA, is just a toggle switch away.

2: Unified Platform = Simplified Operational Effort

As containers are first-class citizens, multiple teams can now interact with them. By being able to run them natively on vSphere means they are visible to all your monitoring, log analytics, change management operations as well. This allows IT teams to move away from the dual-stack environments. Many IT teams that have been investing in Kubernetes over the last few years started to create a full operational stack beside the stack to manage, monitor, and operate the virtualization environment. Running independent and separate stacks next to each other is a challenge by itself.

However, most modern application landscapes are not silo’ed in either one of these stacks. They are a mix of containers, virtual machines, and sometimes even functions. Getting the same view across multiple operational stacks is near impossible. Project Pacific provides a unified platform where developers and operations share the same concepts. Each team can see all the objects across the compute, storage, and network layers of the SDDC. The platform provides a universal view with common naming and organization methods while offering a unified view of the complete application landscape.

3: Namespaces Providing Developer Self-service and Simplifying Management

Historically, vSphere is designed with the administrator group in mind as the sole operator. By exposing the Kubernetes API, developers can now deploy and manage their applications directly. As mentioned earlier, modern applications are a collection of containers and VMs, and therefore the vSphere Kubernetes API has been extended to support virtual machines, allowing the developer to use the Kubernetes API to deploy and manage both containers as well as virtual machines.

To guide the deployments of applications by the developers, project Pacific uses namespaces. Within Kubernetes, namespaces allow for resource allocation requirements and restrictions, and grouping of objects such as containers and disks. Withing project Pacific it’s way more than that. In addition, these namespaces allow the IT ops team to apply policies to it as well. For example, in combination with Cloud-Native Storage (CNS), a storage policy can be attached to the namespace, providing a persistent volume with the appropriate service levels. For more info on CNS, check out Myles Gray’s session: HCI2763BU Technical Deep Dive on Cloud Native Storage for vSphere

Besides the benefits for the developers, as the supervisor cluster is subdivided into namespaces, they become a unit of tenancy and isolation. In essence, they become a unit of management within vCenter, allowing IT ops to resource allocation, policy management, and diagnostic and troubleshooting at namespace and workload level. As the namespace is now a native component within vCenter, it is intended to group every workload, both VMs, containers, and guest clusters and allow operators to manage it as a whole.

4: Guest Clusters

The supervisor cluster is meant to enrich vSphere, providing integrations with cloud-native storage and networking. However, the supervisor cluster is not an upstream conformant Kubernetes cluster. Guest clusters use the Kubernetes upstream cluster API for lifecycle management. It is an open system that’s going to work with the whole Kubernetes ecosystem.

5: vSphere Native Pods providing lightweight containers with the isolation of VMs

As we almost squashed the incorrect belief that ESXi is a Linux OS, we are now stating that containers are first-class citizens. Is ESXi after all a Linux OS, since you need to run Linux to operate containers? No ESXi is still not Linux, to run containers project Pacific is using a new container runtime called CRX.

Extremely simplified, a vSphere Native Pod is a virtual machine. We took out all the unnecessary components and run a lightweight Linux kernel and a small container runtime (CRX). To utilize our years of experience of paravirtualization, we optimized this CRX in such a way that it outperforms containers running on the traditional platforms. As Pat mentioned in the keynote, 30% faster than a traditional Linux VM and 8% faster than bare-metal Linux.

The beauty of using a VM construct is that these vSphere Native Pods are isolated at the hypervisor layer. Unlike pods that run on the same Linux host which share the same Linux kernel and virtual hardware (CPU and memory). vSphere Native Pods have completely separate Linux Kernel and virtual hardware, hence much stronger isolation from security and resource consumption perspective. Simplifying security and ensuring proper isolation models for multi-tenancy.

Modern IT Centers Around Flexibility

It’s all about using the right tool for the job. The current focus of the industry is to reach cloud-native nirvana. However, cloud-native can be great for some products, while other applications benefit from a more monolith perspective. Most applications are a hybrid form of microservices mixed with stateful data collections. Project Pacific allows the customer to use the correct tool for the job; all managed and operated from a single platform.

VMware Breakouts to Attend or Watch

HBI4937BU – The future of vSphere: What you need to know now by Kit Colbert. Monday, August 26, 01:00 PM – 02:00 PM | Moscone West, Level 3, Room 3022

More to follow

Where Can I Sign Up for a Beta?

We called this initiative a project as it is not tied to a particular release of vSphere. Because it’s in tech preview, we do not have a beta program going on at the moment. As this project is a significant overhaul of the vSphere platform, we want to collect as much direct feedback from customers as we can. You can expect we will make much noise when the beta program of Project Pacific starts.

Stay tuned!

VMworld US 2019 – Know Before You Go Podcast

July 22, 2019 by frankdenneman

Last week I had the pleasure of connecting again with my friends and colleagues Pete Flecha, Duncan Epping and amateur back up dancer to Pat Benatar, Mr. Ken Werneburg. During the podcast, we discussed the upcoming VMworld. As it is returning to San Francisco, it might be interesting to revisit your conference strategy.

Although Moscone Center has been rebuilt and expanded, I believe we are still using all three buildings; North, South, and West (Located at Howard and 3rd). So take at least a jacket with you, SF Summers can be treacherous

For more tips about what to wear, what to bring, and which sessions to attend, listen to the episode below or search for it on Spotify. I hope you enjoy the show as much as I did.

Allen, McKeown, and Kondo

April 24, 2019 by frankdenneman

The title is a reference to one of the most interesting books I have ever read, Escher, Godel, and Bach. Someone described it as, “Read this book if you like to think about thinking, as well as to think about thinking about thinking”. The three books I want to share my thoughts on are in a sense feeding and shaping the behavior that allows you to clear your mind and focus more on the task at hand.

The three books that I’m referring to are Getting Things Done (Allen), Essentialism (McKeown), and the KonMari (Kondo) method. They are written by three different authors, from three different continents, impacted by three different cultures in different years. Seemingly they have nothing to do with each other, but they complement each other so perfectly it’s downright amazing. After reading all three and re-reading them again, you start to discover hooks where these individual books mesh together

The three books are instrumental in the way on how I live my life. I can imagine other people in IT with a similar travel lifestyle can benefit from reading these books as well. When you travel a lot, you need to get everything in order, as you have to ensure you have the essentials with you. You sometimes have little time to decompress from your last trip and prepare for the next one. You have to keep track of meetings and obligations both in your personal life as well as your professional one. And above all, you want to avoid wasting time on mundane or trivial tasks while being at home and spend your precious time most optimally. Three books have changed my mindset, and they help me guide my decisions. It helps to provide clarity and streamline day-to-day tasks.

When this topic comes up during a conversation, many of the people I talk to end up buying one (or more) of these books, and it seems they catch the same bug, optimizing life, streamlining their behavior. I thought maybe it’s interesting to more people, let’s write an article about something else than hardcore CPU and Memory resource management. Let’s focus on how to manage time and to some extent energy.

Getting Things Done
The overall theme of Getting Things Done (GTD) is helping you manage to focus and therefore time. The main premise of the book in order, any time, everywhere. In your mind but also your surroundings. Instead of getting distracted by things that you need to do, the main rule is to do it immediately or write it down so you can do when appropriate. Writing it down and classifying the tasks helps to clear your mind, it helps you to focus on the task at hand. The author stresses to get rid of context switching.

A perfect example is the junk drawer. Every time you walk past the junk drawer, it reminds you that you need to sort it out. You need to sift through the junk and see what you can use or what can be tossed away. That’s the context switch. Here you are walking around your house thinking about your big project, and there’s that junk drawer again, providing you with the annoying feeling that you really need to sort that out. You don’t want that, you want your mind focused on bigger things, no guilt trips when walking around the house. That’s where the other two books come into play.

Same applies with the GTD method of categorizing tasks. To have oversight of the tasks at hand, you need to have a clear and tidy surrounding. You can’t keep efficient track of things if you have to go through a lot of junk to find the relevant to-do list. This scene in the movie Limitless is a perfect example. The protagonist is a writer who happens to be excellent in procrastinating. This results in no goals finished and an untidy house. When taking the cognitive expanding drugs, he wants to finish his life long goal of writing a book, but before he begins, he realizes that he cannot deal with any distraction and want order around him, resulting in a big cleanup of the house. That’s what GTD wants you to do as well, sans drugs of course.

Essentialism
When cleaning the house, you typically end up throwing things away. A time-consuming job that never seems to finish. Sometimes you come across that you can’t let go, but you also don’t know what to do with it. In the end, it generates a conflicting feeling, introducing a context switch every time you see it. Hooking back into GTD. Essentialism allows you to prevent this by restructuring your behavior when buying new things and helps you to understand the role of your current belongings. Essentialism is not a lot different than minimalism. However, there is one significant difference, and that is the factor of happiness. With essentialism, you get to rate your belongings on the scale of happiness and usefulness. Does it make you happy or is it useful in day to day activities? If answered yes, then keep it. The interesting thing is that the book starts to reshape your decision making – or better said, the selecting criteria when buying something new. After reading it, I began to buy less of the things that I was eyeing because they just didn’t meet both criteria completely.

The time of the acquisition process of an item is extended as you start to look for the object that provides the most happiness while delivering the required functionality. You begin to research the available options more thoroughly, it’s not uncommon to come to the conclusion that it’s better to approach the “problem” differently. You start to drive towards the essence of the problem, what am I solving here? Is there a better way. This ties in with a mindset that has been introduced by the book of Michael Hammer, reengineering the corporation. A fantastic book about redesigning processes, but I’ll cover that book another time. Another benefit of the elaborate purchase process is the occurrence of (re)buying a similar product, or actually the lack off. We’ve all bought a similar object after the first one because the current one wasn’t living up to its expectation or isn’t functioning properly. As you do your due diligence, you analyze the problem and research the best “tools” available. This can go as far as understand your preference of tactile feel of your cutlery. Trust me, you can go very far with applying this pattern of behavior. As a result, you surround yourself with a minimal set of objects that satisfy your needs perfectly. The stuff you have makes you very happy while decluttering your home as much as possible.

Another example is my collection of Air Jordan shoes. Completely unnecessary, but they bring me joy. I collected these from the period when I played basketball myself. In the beginning, it was almost like a free-for-all, get the next version that is released. Buying it because you can (almost must). After reading essentialism, I reviewed my collection. Yes, collecting specific models makes me happy, but most ones that I have are not meeting the criteria of some of the special ones. In result, I reduced my collection by 70%, sold them so others can have them while reducing “footprint” of the collection in the house. I applied focus to the collection. To this day, with everything that I buy I ask myself: Do I need it? And is this the best I can obtain? What I learned is that the majority of objects acquired after reading essentialism have a longer lifespan than buying the first thing you come across when discovering the need for it. It improves the sustainability of your household tremendously. In short, you end up with a lot less stuff in your house, making it easier to get it organized and clean, increasing or maintaining your focus to the choirs at hand.

The Life-changing Magic of Tidying Up (KonMari Method)
This book took the world by storm, I discovered that the author, Marie Kondo, now has a show on Netflix. Before you wonder, I do not talk to my socks and thank them for the days’ work. 😉 The key takeaway I had from reading this book is that junk is stuff that does not has a permanent place in your home. Everything that keeps moving through the house is junk. It generates context switching. To reduce junk, you have to learn some techniques about how to efficiently store things. Some things have exceeded their purpose and can be let go off. This ties back to the essentialism part. Does it make me happy or is it useful? These are excellent criteria to review all your belongings while cleaning up the house. By ending up with less stuff, it frees up room in your home to find permanent places for things that matter. And with a permanent location, it means less time spent on searching for things. Fewer context switches as the junk drawer is now the drawer that houses x, y, and z. I store my phone, wallet, keys in one particular place. When leaving the house, I do not waste time finding the stuff. I can maintain my focus while grabbing the necessities. The time to pack for a trip is significantly reduced, I just have to understand the weather and the purpose of the trip, I know exactly where everything is stored.

These three books helped me tremendously, maybe they can be of help to you as well, give them a try. Please leave a comment about the books that structurally changed your perception on how to deal with these type of things, hopefully, it expands the must-read book list of others and me.