frankdenneman.nl - Page 16 of 89 -

VMware Cloud on AWS on Virtually Speaking Podcast

April 9, 2019 by frankdenneman

Last week I had the pleasure of connecting again with my friends and colleagues Pete Flecha a.k.a PedroArrow and eternal sunshine John Nicholson. During the podcast, we discussed the road to Hybrid cloud, cloud mobility, multi-cloud operations, and the necessity of replatforming apps or not. It’s always fun hanging out with these guys especially when talking about cool things. Hope you enjoy the show as much as I did.

AMD EPYC and vSphere vNUMA

February 19, 2019 by frankdenneman

AMD is gaining popularity in the server market with the EPYC CPU platform. The EPYC CPU platform provides a high core count and a large memory capacity. If you are familiar with previous AMD generations, you know AMD’s method of operation is different than Intel’s. For reference, take a look at the article I wrote in 2011 about the 12-core 6100 Opteron code name Magny-Cours. EPYC provides an increase of scale but builds on the previously introduced principles. Let’s review the EPYC architecture and see how it can impact your VM sizing and ESXi configuration. (Please note that this article is NOT intended as a good/bad comparison between AMD and Intel, I’m just describing the architectural differences).

EPYC Architecture
The EPYC processor architecture is what AMD refers to as a Multi-Chip-Module (MCM). EPYC is designed to provide a high core count platform by combining multiple silicon dies within a CPU Package. A silicon die (named Zeppelin) is a wafer that contains the circuitry. In simple terms, it’s the component that contains CPU cores, memory cache, and various controllers. Regardless of the core-count, an EPYC CPU package always contains four Zeppelin dies. Comparing this to Intel Xeon, a Xeon CPU package is a single-chip-design which consist of a single silicon die containing all components. The reason why the difference in chip design is interesting is that impacts the logical grouping of compute resources. The size of the logical group, better known as a NUMA node, impacts scheduling decisions made by the CPU scheduler of the operating system (both the hypervisor kernel and possibly the guest operating system). It might be necessary to change some of the default settings of the ESXi host to alter scheduling behavior, these settings are covered in the last part of the article. Let’s continue to explore the architecture of the EPYC CPU.

AMD EPYC – image courtesy of wccftech.com

Compute Complex
The photo above provides a clear overview of the structure of the CPU package. The CPU package houses four Zeppelin dies. In the current generation, a Zeppelin die provides a maximum of eight Zen cores. The cores are divided across two compute complexes (CCX). A Zeppelin of a 32 core EPYC contains 4 cores per CCX. When Simultaneous Multi-Threading (SMT) is enabled within the BIOS, a CCX offers eight threads.

Each core has its own L1 (instruction (64KB) and data (32KB)) and L2 caches (4 MB total L2 cache). A Zeppelin has 16 MB L3 cache. Interestingly enough, each CCX has it’s own L3 Cache of 8MB, in turn, split up into four slices of 2 MB. The two CCXes within a Zeppelin die are connected to each other through an interconnect (Infinity Fabric). Adding hops to memory access is not beneficial to bandwidth and latency. Multiple tech-sites have performed in-depth testing on cache performance, and to quote Anandtech.com:

“The local “inside the CCX” 8 MB L3-cache is accessed with very little latency. But once the core needs to access another L3-cache chunk – even on the same die – unloaded latency is pretty bad: it’s only slightly better than the DRAM access latency.”

In essence, this means that you cannot think of the 64MB L3 cache as one single pool of cache capacity. Better is to approach it as eight 8MB capacity pools. This is important to realize if multiple workloads share the same data, the NUMA scheduler of ESXi attempts to place both workloads in the same NUMA node to optimize cache and memory performance for these workloads. It might happen that the L3 cache size is not sufficient enough. The option that impacts this behavior is called Action Affinity, more details about this setting can be found in the last part of the article.

Zeppelin Core Count
EPYC is offered in multiple SKUs. Next, to the 32 core count model, there are lower-core count models. Since the EPYC architecture always includes four Zeppelins, the difference in core count is created by disabling cores per CCX in a symmetrical way. For example, in a 24 core count EPYC, a single Zeppelin die would look like this.

The table shows the core count per Zeppelin of the three largest EPYC CPUs. The total cores per Zeppelin count can be used as a guideline for the vNUMA setting described later in this article

Cores	Cores per CCX	Total Cores per Zeppelin	Zeppelin Count
32	4	8	4
24	3	6	4
16	2	4	4

Infinity Fabric
The cores within a CCX communicate with memory (DIMMs) via an on-die memory controller through the infinity Fabric. The Infinity fabric is AMD’s proprietary system interconnect architecture that facilitates data and control transmission across all linked components. The Infinity Fabric consists of two communication planes; the Scalable Data Fabric (SDF) and the Infinity Scalable Control Fabric (SCF). The SCF is responsible for processing system control signals, such as thermal and power management. Although very important, we are more interested in the SDF which is responsible for transmitting data within the system. The rest of the article zooms into SDF design and its impact on scheduling decisions.

Each CCX is connected to the SDF through the Cache-Coherent Master (CCM) that is responsible for sending coherent data traffic cross CCXes. The SDF uses a Unified Memory Controller (UMC) to connect to DRAM memory modules. Each UMC provides a memory channel to two DIMMs. Providing the memory capacity of 4 DIMMs in total.

How does this design impact VM sizing? A Zeppelin is a NUMA node that contains a maximum of 8 cores (16 threads) with the memory capacity of four DIMMs. This design results in a single EPYC CPU package presents four NUMA nodes to the operating system.

Server Memory Capacity and NUMA
Intel moved from a 3 DIMMs per channel configuration (DPC) with 4 channels to a model with 6 channels and 2 DIMMs deep. This new model broke the capacity model cadence. For example, using 16 GB DIMMs, you had either 64 GB, 128GB or 192GB available per socket. Now with the scalable architecture, it’s either 96GB or 192GB. That is if you follow the high- performance best practice of populating all channels for maximum bandwidth availability. However, with the current DIMM pricing, a lot of customers cannot afford such a configuration.

With the EPYC, every Zeppelin has two memory channels. Each memory channel can drive two DIMMs. For good performance, each Zeppelin should be equipped with at least 1 DPC. That means that a proper performing dual socket EPYC system should be configured with 16 DIMMs. This configuration allows for a theoretical bandwidth of 42.6 GB/s while providing a (shallow) memory capacity of just the two DIMMs combined. This design results in a single EPYC CPU package presents four NUMA nodes to the operating system. If the minimum of 1DPC is used, the NUMA node size can be too small and thus the overall performance if the VM memory size exceeds the physical memory configuration of each Zeppelin. Servethehome published some benchmark tests about the performance difference between the different memory configurations of EPYC.

With NUMA, it’s important to understand the boundaries of your local memory domain and your remote memory domain. Traditionally the domains were easily demarcated by the CPU package core count and attached memory capacity. With EPYC, a new distinction has to be made between the different remote memory access types. It can be remote on-package memory access or remote socket memory access. The reason why this distinction has to be made is the impact on performance and consistency of application memory access. Having your VM and application span multiple NUMA nodes can introduce a very inconsistent response time.

Local Memory Access
Let’s start with the best and most consistent performance. When a core within the Zeppelin access local memory the path is as follows:

The presentation “Zeppelin an SOC for Multi-Chip Architectures” by AMD list the latency of local memory access within the Zeppelin at 90 nanoseconds.

Remote Memory Access On Package
A core can access memory attached to a different Zeppelin within the same CPU package. This is called remote on-package memory access or “on-package Die-to-Die” memory access. This means we are still using memory controllers within the same socket. In total the EPYC CPU has eight memory channels, but two are local to the Zeppelin. To access a “remote” on-package memory controller the Infinity Fabric On Package Controller (IFOP) sets up and coordinates the data communication.

In total each Zeppelin has 4 IFOPs, but actually, only three are needed since there are 3 other Zeppelins within the same CPU package.

To be more precise, the IO traverses an additional component before hitting the IFOP. This component is called the Coherent AMD socKet Extender (CAKE). It facilitates die-to-die or socket-to-socket memory transactions. This module translates the request and response formats used by the SDF transport layer to and from the serialized format used by the IFOP. What that means is that a few extra hops and CPU cycles are introduced when fetching data stored within DIMMs attached to other Zeppelins on the same die. AMD reports a latency of ~145ns.

Inter Package Remote Access
And then we have the chance that memory needs to be fetched from DIMMs attached to UMCs from a Zeppelin that is a part of another EPYC CPU package within the system (dual socket system). Instead of routing the traffic across the IFOP, the traffic is routed across Infinity Fabric Inter Socket (IFIS) controller. Package-to-package traffic has 8/9 of the bandwidth of IFOP traffic, resulting in a theoretical bandwidth of 37.9 GB/s. The reduction in bandwidth increases the chance of experiencing inconsistent performance. The increased length of the path, increments latency. AMD reports a latency of ~200ns.

Because there are two IFIS controllers per Zeppelin, not every Zeppelin within a dual socket system is directly connected to each other. In the worst case scenario, there are two hops. One hop from one package to the other package and an extra hop to go from one Zeppelin to the Zeppelin that is connected to the DIMM holding the data. Unfortunately, AMD as not shared latency data.

Remote Access Inter-package, die-to-die communication

VM Sizing
The key is to keep memory access as much local as possible. ESXi and most modern guest operating systems are optimized to deal with NUMA. However as with most things in life, for the most optimal performance, reduce distance and reduce any form of variation. Apply this to VM sizing and try to keep the vCPU count of a VM within the core count of NUMA domain. Same applies to VM memory capacity, try to fit this with the capacity of the NUMA node. If the VM cannot fit inside a NUMA node, there is no need to stress, ESXi has got the best NUMA scheduler in the business. To help ESXi to optimize for the EPYC architecture, some advanced settings might be necessary to adjust. As always, tests these settings in a non-revenue critical environment before applying them to production systems.

Virtual NUMA
Virtual NUMA (vNUMA) allows the operating system to understand the “physical” layout of the virtual machine. vNUMA presents the mapping of the VM vCPU to the physical NUMA nodes of the ESXi host. For example, if a VM has 12 vCPUs and the physical core count within a single NUMA node was 10 cores, ESXi would present the guest OS a topology of 2 NUMA nodes with each counting 6 cores. ESXi would group 6 vCPUs into a NUMA client and schedule these across the 10 CPU cores within a NUMA node.

When vNUMA was introduced, the highest core count of a CPU was 8 CPUs, thus the VMware engineers introduced a vNUMA threshold of 9 (numa.vcpu.min=9). Meaning that the VM needs to contain at least 9 vCPUs in order to generate the virtual NUMA topology.Considering the highest core-count of an EPYC system is eight cores per Zeppelin, you might want to adjust the vNUMA default threshold to resemble the physical layout of the used EPYC model.

For example, the EPYC 7401 contains 24 cores, 6 cores per Zeppelin and thus 6 cores per NUMA node. When using the default setting of numa.vcpu.min=9, an 8 vCPU VM is automatically configured like this.

A VPD is the virtual NUMA client that is exposed to the guest OS system, while a PPD is the NUMA client used by the VMkernel CPU scheduler. In this situation, the ESXi scheduler uses two physical NUMA nodes to satisfy CPU and memory requests while the guest OS perceives the layout as a Uniform Memory Access (UMA) system. In a UMA system, the access time to a memory location is independent of which processor makes the request, or which memory chip contains the transferred data). I.e., pretty much the same latency and bandwidth throughout the system. However, this is not the case as reported in this article above. Reading and writing remote CCX cache and remote memory (on-die) is slower than local memory even within the same Zeppelin. By setting the numa.vcpu.min=6, two VPDs are created, and thus the guest OS is made aware of the physical layout by the ESXi scheduler. The guest OS and the applications can optimize memory operations to attain consistent performance.

Action Affinity
When the ESXi scheduler detects multiple VMs communicating with each other, it can decide of placing them together on the same NUMA node to increase intra-NUMA node communication. This behavior is called action affinity, and it can increase performance by up to 30%. However, with the small NUMA nodes of max 8 CPUs, it can also lead to a lot of cache thrashing and remote memory access if the configured memory of the VMs cannot fit inside a single NUMA node. If this is the case, it might be helpful to test disabling the action affinity on the ESXi host. This is done by configuring the /Numa/LocalityWeightActionAffinity to 0 (KB 2097369).

What if the VM Memory Config Exceeds the Memory Capacity of the Physical NUMA Node?
I wrote an article about this situation back in 2017, and it’s featured in the vSphere 6.5 Host deep dive book. However, what happens if your VM memory configuration exceeds the physical capacity of a NUMA node. By default, the ESXi scheduler optimizes for local memory access and attempts to place as much memory along with the vCPU in the same NUMA node. Sometimes it can improve local memory access to creating multiple smaller NUMA clients.

For example, on an EPYC 7601 (32 core), the NUMA node contains 8 cores, and this server is equipped with 256 GB by using 16 x 16 GB DIMMs. A NUMA node has 4 DIMMs attached to it. Thus the NUMA node provides 8 cores and 64 GB. What happens if a VM is configured with 6 vCPUs and 96 GB? By default the NUMA scheduler attempts to store 64GB of VM memory inside the NUMA node, leaving 32 GB in a remote NUMA node. By enabling the VM advanced setting numa.consolidate = FALSE. It instructs the NUMA scheduler to distribute the VM configuration across the optimal number of NUMA nodes greater than 1. In this case, 2 NUMA clients are created, and this will schedule 3 vCPUs in each NUMA node.

Now the performance and the behavior of the application depends on its design. If you have a single-threaded application, this setting might not be helpful at all. However, if it’s a multi-threaded application, you might see some benefit. The only thing to do is to set the numa.vcpu.min equal to the number of vCPUs per virtual NUMA client to expose the vNUMA architecture to the guest OS and the application. The following command helps you to retrieve the NUMA configuration of the VM:

vmdumper -l | cut -d \/ -f 2-5 | while read path; do egrep -oi “DICT.(displayname.|numa.|cores.|vcpu.|memsize.|affinity.)= .|numa:.|numaHost:.” “/$path/vmware.log”; echo -e; done

Please bear in mind that the ESXi CPU and NUMA scheduler do not use an SRAT (System Resource Allocation Table) to determine the distance of the individual NUMA nodes between each other. ESXi uses its own method to determine latency between the different NUMA nodes within the system. It uses these latency numbers for initial placement and attempts to schedule the NUMA clients of a VM as close to each other as possible. However, the ESXi scheduler does not leverage this information during load-balancing operations. This is work in progress. Adding a new first class metric to a heuristic is not a simple task and knowing the CPU engineers, they want to provide a system that is thoroughly improved by augmenting new code.

Increase NUMA Node Compute Sizing
For workloads that are memory latency sensitive with a low processor utilization, you can alter the way the NUMA scheduler sizes the NUMA client of that particular VM. The VM advanced setting numa.vcpu.preferHT=TRUE allows the NUMA scheduler to count threads instead of cores for NUMA node size configuration. For example, an 8 vCPU VM that uses this advanced setting and runs on an EPYC 7401 system (6 cores, 12 threads), is scheduled within a single Zeppelin.
If all workloads follow the same utilization pattern, you can alter the ESXi host setting by adding numa.PreferHT=1 to the ESXi host advanced configuration.

Channel-Pair Interleaving (1 NUMA node per socket)
The EPYC architecture can interleave the memory channels and thus present the cores of the four zeppelins as a single NUMA node. This setting requires that every channel is populated with equal memory size. Some vendors use a different name for it. For example, Dell calls this setting “Memory Die Interleaving”. Little to no data can be found about the performance impact of this setting, but keep in mind, software settings do not change the physical layout (and thus physics). Typically abstraction filters out the outliers and presents an average performance behavior. For NUMA benchmarking, please take a look at the article “AMD EPYC – STREAM, HPL, InfiniBand, and WRF Performance Study” located on the Dell website.

Research Your Workload Requirements
ESXi can handle complex NUMA architectures as the best. However, it’s always best to avoid complexity as possible. Determine if your workload can fit in a minimum number of small NUMA nodes when using the EPYC architecture? Can the workload handle inconsistent memory performance if it does exceed the NUMA node size of 8? The EPYC architecture is an excellent way of adding scale to the server platform but do remember that for real-life workload optimal performance is achieved when you take the NUMA configuration boundaries into account.

On Twitter some asked what my thoughts are about the EPYC CPU architecture? For every tech challenge, there is a solution. When looking at the architecture, I think EPYC is an excellent solution for small and medium-sized workloads. I expect that larger monolithic apps, that require consistent performance, are better off looking at different architectures. (My opinion, not VMware’s!)

Kubernetes, Swap and the VMware Balloon Driver

November 15, 2018 by frankdenneman

Kubernetes requires to disable the swap file at the OS level. As stated in the 1.8 release changelog: The kubelet now fails if swap is enabled on a node.

Why disable swap?
Turning off swap doesn’t mean you are unable to create memory pressure. Why disable such a benevolent tool? Disable swap doesn’t make any sense if you look at it from a single workload, single system perspective.
However, Kubernetes is a distributed system that is designed to operate at scale. When running a large number of containers on a vast fleet of machines, you want predictability and consistency. Disabling swap is the right approach. It’s better to kill a single container than to have multiple containers run on a machine at unpredictable, probably slow, rate.

Therefore the kubelet is not designed to handle swap situations. It’s expected that workload demand should fit within the memory of the host. On top of that, it is recommended to apply quality of service (QoS) settings to workloads that matter. Kubernetes provides three QoS classes to pods; Guaranteed, Burstable, and BestEffort .

Kubernetes provides the construct request to ensure the availability of resources. Similar to reservations at the vSphere level. Guaranteed pods have a request configuration that’s equal to the CPU and memory limit. All memory the container can consume is guaranteed, and therefore it should never need swap. With Burstable a portion of the CPU and memory is protected by a request setting, while a BestEffort pod does not have a CPU and memory request and limit setting specified.

Multi-level Resource Management
Resource management is difficult, mainly when you deal with virtualized infrastructure. You have to ensure the workloads receive the resources they require. Furthermore, you want to drive the utilization of the infrastructure in an economically sound manner. Sometimes resources are limited, and not all workloads are equal, thus adding another level of complexity of prioritization. Once you solved that problem, you need to think about availability and serviceability.

Now the good news is that this is relatively easy with boundaries introduced by virtual machine configuration. I.e., you specify the size of the VM by assigning it CPU and memory resources. And this becomes a bin packing problem. Given n items of different weights and bins each of capacity c, assign each item to a bin such that the number of total used bins is minimized.

A virtual machine is, in essence, a virtual hardware representation. You define the size of the box, with the number of CPUs and the amount of memory. This is a mandatory step in the virtual machine creation process.

With containers it’s a little bit different. In its default state, the most minimal configuration, a container inherits the attributes of the system it runs on. It is possible to consume the entire system, depending on a workload. (a single threaded application, might detect all CPU cores available in the system, but its nature won’t allow it to run on more than a single core. In essence, a container is a process running in the Linux OS.

For a detailed explanation, please (re)view our VMworld session, CNA1553BE.

This means that if you do not specify any limit, the container has no restriction of how much resources such a pod can use. Similar to vSphere admission control, you cannot overcommit reserved resources. Thus, if you commit to an IT policy that only allows configuration of Guaranteed pods, you leverage Kubernetes admission control to avoid overcommitment of resources.

One of the questions to solve either on a technical level or organization level is, how you are going to control pod configuration? From a technical level, you can solve this by using Kubernetes admission control, but that is out of scope for this article.

Pod utilization is ultimately limited by the resources provided by the virtual machine, but you still want to provide predictability and consistency service to all workloads deployed in containers. Guarantees are only as good as the underlying foundation they are built upon. So how do you make sure behavior remains consistent for pods?

Leveraging vSphere Resource Management Constructs
When running Kubernetes within virtual machines (like the majority of the global cloud providers), you have to control the allocation of resources on multiple levels.

From the top-down, the container is scheduled by Kubernetes on a worker node, predominantly Linux is used in the Kubernetes world, so let’s use that as an example. The guest OS allocates the resources and schedules the container. Please remember that a container is just a set of processes that are isolated from the rest of the system. Containers share the same operating system kernel and thus it’s the OS responsibility to manage and maintain resources. Lastly, the virtual machine runs on the hypervisor and the VMkernel manages resource allocation.

VM-Level Reservation
To ensure resources to the virtual machine, two constructs can be used. VM-level reservations or Resource Pool reservations. With VM-level reservations, the (ESXi host) physical resources are dedicated to the virtual machine, once allocated by the guest OS, it’s not shared with other virtual machines running on that ESXi host. This is the most deterministic way to allocate physical resources. However, this method impacts the virtual machine consolidation ratio. When using the vSphere HA admission control policy of Slot Policy it can impact the VM consolidation ratio at cluster level as well.

Resource Pool Reservation
A resource pool reservation is dynamic in nature. The resources backed by a reservation are distributed amongst the child-objects of the resource pool by usage and priority. If a Kubernetes worker node is inactive or running at a lower utilization-rate, these resources are allocated to other (active) Kubernetes worker nodes within the same resource pool. Resource Pools and Kubernetes are a great fit together, however, resource pool sizing must be adjusted when the Kubernetes cluster is scaled out with new workers. If the resource pool reservation is not adjusted, resources are allocated in an opportunistic manner from the cluster, possibly impacting predictability and consistency of resource behavior.

Non-overcommitted Physical Resources
Some vSphere customers design and size their vSphere clusters to fully back virtual machine memory with physical memory. This can be quite costly, but it does reduce operational overhead tremendously. The challenge is the keep the growth of the physical cluster aligned with the deployment of workload.

Overcommited Resources
But what if this strategy does not go the way as planned? What if for some reason resources are constrained within a host and the VMkernel applies one of its resource reclamation techniques? One of the feature that is in the first line of defense is the balloon driver. Designed to be as non-intrusive as possible to the applications running inside the VMs.

Balloon Driver
The balloon driver is installed within the guest VM as part of the VMware-Tools package. When memory is over-committed the ESXi server reclaims memory by instructing the balloon driver to inflate by allocating pinned physical pages inside the guest OS. This causes memory pressure within the guest OS which invokes its own native memory management techniques to reclaim memory. Balloon driver then communicates these physical pages to the VMkernel which can then reclaim the corresponding machine page. Deflating the balloon driver releases the pinned pages and frees up memory for general use by the guest OS.

The interesting part is the dependencies of guest OS native memory management techniques. As a requirement, the swap file inside the guest OS needs to be set to disabled when you install Kubernetes. Otherwise, the kubelet won’t start. The swap file is the main reason why the balloon driver is so non-intrusive. It allows the guest OS to select memory page it deems fit. Typically these are idle pages and thus the working set of the application is not affected. What happens if the swap file is disabled. Is the balloon driver disabled? The answer is no.

Let’s verify if the swap file is disabled, by using the command cat /proc/swaps. Just to be sure I used another command swapon -s. Both outputs shows no swap file.

The command vmware-toolbox-cmd stat balloon shows the balloon driver size. Just to be sure I used another command lsmod | grep -E ‘vmmemctl|vmware_balloon to show if the balloon driver is loaded
I created an overcommit scenario on the host and soon enough the balloon driver kicked into action.

The command vmware-toolbox-cmd stat balloon confirmed the output of the stats showed by vCenter. The balloon driver pinned 4GB of memory within the guest.

4GB memory pinnned, but top showed nothing in swap.

dmesg shows the kernel messages, one of them is the activity of the OOM Killer. OOM stands for out of memory.

According to online description: The Out-Of-Memory Killer process that It is the task of the OOM Killer to continue killing processes until enough memory is freed for the smooth functioning of the rest of the process that the Kernel is attempting to run.

The OOM Killer has to select the best process(es) to kill. Best here refers to that process which will free up the maximum memory upon killing and is also the least important to the system.

The primary goal is to kill the least number of processes that minimizes the damage done and at the same time maximizing the amount of memory freed.

Beauty is in the eye of the beholder, but I wouldn’t call killing CoreDNS the best process to kill in a Kubernetes system.

Guaranteed Scheduling For Critical Add-On Pods
In the (must-watch) presentation at Kubecon 2018, Michael Gasch provided some best practices from the field. One of them is to protect critical system pods, like DaemonSets, Controllers and Master Components.

In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a control plane (master) nodes there are a number of add-ons which run on a worker node . Some of these add-ons are critical to a fully functional cluster, such as CoreDNS. A cluster may stop working properly if a critical add-on is evicted. Please take a look at the settings and recommendations listed in “Reserve Compute Resources for System Daemons“.

Please keep in mind that the guest OS, the Linux kernel, is a shared resource. Kubernetes runs a lot of its services as containers, however, not everything is managed by Kubernetes. For these services, it is best to monitor these important Linux resources in order that you don’t run out of them if you are using the QoS classes other than guaranteed.

Exploring the Kubernetes Landscape
For the vSphere admin who is just beginning to explore Kubernetes, we recommend keeping the resource management constructs aligned. Use reservations at the vSphere level and use guaranteed QoS class for your pods at the Kubernetes level. Solely using Guaranteed QoS class won’t allow for overcommitment, possibly impacting cluster utilization, but it gives you a nice safety net to learn Kubernetes without chasing weird behavior due to processes such as the OOM killer.

Thanks to Michael Gasch for the invaluable feedback

Free vSphere Clustering Deep Dive Book at VMworld Europe

November 2, 2018 by frankdenneman

Last year Rubrik gave away hard copies of the vSphere Host Deep Dive book, this year they are doing it again with the vSphere 6.7 Clustering Deep Dive Book.

Come by the Rubrik Booth #P305 on Tuesday from 4:00 PM – 5:00 PM to get a signed, complimentary copy of vSphere 6.7 Clustering Deep Dive and meet the authors.

Last year we gave away a thousand copies and were gone within an hour. As most of you can remember, the line was insane. This year we have a similar amount, so make sure you’re on time.

Kubernetes at VMworld Europe

October 30, 2018 by frankdenneman

With only a few days left until VMworld Europe 2018 kicks off in Barcelona, I would like to highlight some of the many Kubernetes focussed sessions. I’ve selected a bunch of breakout sessions and meet the expert sessions based on my exposure to them at VMworld US or the quality of the speaker.
The content catalog has marked some sessions as “at capacity”, but experience thought us that there are always a couple of no-shows. Plans change during VMworld. People register for a session they would like to attend but get pulled in an interesting conversation along the way. Or sometimes you suffer from information overload and want to catch a breather. In many cases, spots open at sold-out sessions and therefore it’s always recommended to walk up to sold out sessions and try your luck.

Tuesday 06 November

11:00 – 12:00
[NET1285BE] (Breakout Session)
The Future of Networking and Security with VMware NSX
This talk provides detailed insights into the architecture and capabilities of NSX-T. We’ll show how NSX-T addresses container workloads and integrates with frameworks like Kubernetes. We’ll also cover the multi-cloud networking and security capabilities that allow consistent networking policies across any cloud, public or private. Finally, we’ll look at how SD-WAN has become part of the NSX portfolio, enabling networking and security to be deployed from cloud to data center to edge. More info.
By Bruce Davie, CTO, APJ, VMware
12:15 – 13:00
[MTE5044E] (Expert Roundtable)
Selecting the Right Container Platform for Your Use Case with Patrick Daigle
There are a variety of containers and Kubernetes platforms out there in market today. Ever got confused and wanted some expert insight into what types of container or Kubernetes platforms are best suited to your use case?
By Patrick Daigle, Sr. Technical Marketing Architect, VMware
13:15 – 14:00
[MTE5209E] (Expert Roundtable)
Cloud Native Applications and vSAN with Myles Gray
Learn how vSAN can provide storage for next generation applications autonomously, including Kubernetes, PKS or any K8S distribution and moves the provisioning of storage from the admin into the hands of the developer.
By Myles Gray, Sr. Technical Marketing Architect, VMware
14:00 – 15:00
[CNA1553BE] (Breakout Session)
Deep Dive: The Value of Running Kubernetes on vSphere
In this technical session, you will find out how VMware vSphere provides a lot of value, especially in large-scale Kubernetes deployments. With 20 years of engineering experience in kernel and distributed computing, VMware solved many challenges Kubernetes currently faces. Building on work done with enterprises running Kubernetes at scale, you will see a hypothetical customer scenario to illustrate the benefits of running Kubernetes on top of VMware vSphere and avoid the common pitfalls associated with running on bare metal. More info.
By Frank Denneman, Chief Technologist, VMware
Michael Gasch, Customer Success Architect – Application Platforms, VMware
15:30 – 16:30
[HCI1338BE] (Breakout Session)
vSAN: An Ideal Storage Platform for Kubernetes-controlled Cloud-Native Apps
The session discusses how VMware’s HCI offering (vSphere and vSAN) is becoming a platform of choice for deploying, running and managing the data needs of Cloud-Native Applications (CNA). We will use real world examples to highlight the benefits of an HCI control plane for Kubernetes environments. More info.
By Christos Karamanolis, Fellow and CTO Storage & Availability, VMware
Cormac Hogan, Director and Chief Technologist, VMware

Wednesday 07 November

11:15 – 12:00
[MTE5057E] (Expert Roundtable)
Next-Gen Apps on vSAN by expert Chen Wei
Are you planning to migrate your next-gen workload to the vSAN cluster? Attend this roundtable to talk to our vSAN Solutions Architect about different aspects regarding putting Next-gen applications on vSAN. Those aspects include the next-gen application deployment best practices, performance tuning, availability/performance trade-off. Bring the questions and let’s talk.
Chen Wei, Sr. Solutions Architect, VMware
12:30 – 13:30
[CNA1493BE] (Breakout Session)
Run Docker on Existing Infrastructure with vSphere Integrated Containers
In this session, you will find out how to run Docker on vSphere with VMware vSphere Integrated Containers. See a live demo on how vSphere Integrated Containers leverage vSphere for isolation and scheduling. Find out how vSphere Integrated Containers are the ideal way to host containers on vSphere, providing a Docker-native experience for end users and a vSphere-native experience for IT. More info.
By Patrick Daigle, Sr. Technical Marketing Architect, VMware
Martijn Baecke, Cloud Evangelist, VMware
13:15 – 14:00
[MTE5116E] (Expert Roundtable)
Function as a Service with Mark Peek
During this roundtable, we will discuss Dispatch, the VMware framework for deploying and managing serverless style applications.
By Mark Peek, Principal Engineer, VMware
15:30 – 16:30
[DC3845KE] (Keynote)
Cloud and Developer Keynote: Public Clouds and Kubernetes at Scale
This session will cover VMware’s strategy to deliver an enterprise-grade Kubernetes platform while supporting the needs of DevOps and CloudOps teams. VMware’s Cloud and Developer keynote will outline how to deliver developers a consistent experience across native clouds while enabling operators with more flexibility and control for how they support next generation workloads. More info.
By Guido Appenzeller, CTO, VMware
Joseph Kinsella, Vice President and CTO, Products, CloudHealth, VMware
15:30 – 16:30
[CNA2755BE] (Breakout Session)
Architecting PKS for Production: Lessons Learned from PKS Deployments
In this session, you will get a deep dive into PKS within the context of real-world customer deployment scenarios. The speakers will share the lessons learned from their successful PKS and NSX-T deployments, and show you how to architect PKS for a production environment.
Come and learn about the do’s, don’ts, and best practices. After this session, you will be better equipped to deploy and manage enterprise-grade Kubernetes in your infrastructure and use NSX-T to bridge the gap in network and security for container workloads.
By Romain Decker, Senior Solutions Architect, VMware
Dominic Foley, Senior Solutions Architect, VMware

Thursday 08 November

15:00 – 16:00
[NET1677BE] (Breakout Session)
Kubernetes Container Networking with NSX-T Data Center Deep Dive
In this session, you will get technical details of how the NSX-T Data Center integration with Kubernetes in Pivotal Container Service (PKS), OpenShift, and upstream Kubernetes is implemented. Get a deep dive into each identified problem statement, find out how the solution was implemented with NSX-T Data Center, and see a demo of each of the solutions live on stage using PKS with NSX-T Data Center. More info.
By Dennis Breithaupt, Sr. Systems Engineer (NSX), VMware
Yasen Simeonov, Technical Product Manager, VMware

Product Preview

This year the UX team organizes design studios that allows you to provide feedback on a future product. The product you will see will blow your mind. But since it’s NDA, I can’t tell 😉 Just show up and see for yourself!
Every day – multiple sessions available
[UX8011E] (Design Studio)
Kubernetes on vSphere
Do you want to offer Kubernetes? Explore user interface concepts for managing containerized cloud native applications using vSphere together with other products such as PKS.
This session is part of the VMware Design Studio where you have the opportunity to participate in interactive sessions exploring technical previews and early design ideas. Because of the early nature of these designs, participants will be asked to sign a Non-Disclosure Agreement (NDA) to participate.
By Boaz Gurdin, User Experience Researcher, VMware
Pamel Shinh, Product Designer, VMware
Hope to see you there. Enjoy your VMworld!