• Skip to primary navigation
  • Skip to main content

frankdenneman.nl

  • AI/ML
  • NUMA
  • About Me
  • Privacy Policy

VM Service – Help Developers by Aligning Their Kubernetes Nodes to the Physical Infrastructure

May 3, 2021 by frankdenneman

The vSphere 7.0 U2a update released on April 27th introduces the new VM service and VM operator. Hidden away in what seems to be a trivial update is a collection of all-new functionalities. Myles Gray has written an extensive article about the new features. I want to highlight the administrative controls of VM classes of the VM service.

VM Classes
What are VM classes, and how are they used? With Tanzu Kubernetes Grid Service running in the Supervisor cluster, developers can deploy Kubernetes clusters without the help of the Infra Ops team. Using their native tooling, they specify the size of the cluster control plane and worker nodes by using a specific VM class. The VM class configuration acts a template used to define CPU, memory resources and possibly reservations for these resources. These templates allow the InfraOps team to set guardrails for the consumption of cluster resources by these TKG clusters.

The supervisor cluster provides twelve predefined VM classes. They are derived from popular VM sizes used in the Kubernetes space. Two types of VM classes are provided, a best-effort class and a guaranteed class. The guaranteed class edition fully reserves its configured resources. That is, for a cluster, the spec.policies.resources.requests match the spec.hardware settings. A best-effort class edition does not, that is, it allows resources to be overcommitted. Let’s take a closer look at the default VM classes.

VM Class TypeCPU ReservationMemory Reservation
Best-Effort-‘size’0 MHz0 GB
Guaranteed-‘size’Equal to CPU configEqual to memory config

There are eight default sizes available for both VM class types. All VM classes are configured with a 16GB disk.

VM Class SizeCPU Resources ConfigurationMemory Resources Configuration
XSmall22 Gi
Small24 Gi
Medium28 Gi
Large416 Gi
XLarge432 Gi
2 XLarge16128 Gi
4 XLarge16128 Gi
8 XLarge32128 Gi

Burstable Class
One of the first things you might notice if you are familiar with Kubernetes is that the default setup is missing a QoS class, the Burstable kind. Guaranteed and Best-Effort classes are located at both ends of the spectrum of reserved resources (all or nothing). The burstable class can be anywhere in the middle. I.e., the VM class applies a reservation for memory and or CPU. Typically, the burstable class is portrayed to be a lower-cost option for workloads that do not have a sustained high resource usage. Still, I think the class can play an essential role in no-chargeback cloud deployments.

To add burstable classes to the Supervisor Cluster, go to the Workload Management view, select the Services tab, and click on the manage option of the VM Service. Click on the “Create VM Class” option and enter the appropriate settings. In the example below, I entered 60% reservations for both CPU and memory resources, but you can set independent values for those resources. Interestingly enough, no disk size configuration is possible.

Although the VM Class is created, you have to add it to a namespace to be made available for self-service deployments.

Click on “Add VM Class” in the VM Service tile. I modified the view by clicking on the vCPU column, to find the different “small” VM classes and selected the three available classes.

After selecting the appropriate classes, click ok. The Namespace Summary overview shows that the namespace offers three VM classes.

The developer can view the VM classes assigned to the namespace by using the following command:

kubectl get virtualmachineclassbindings.vmoperator.vmware.com -n namespacename I logged into the API server of the supervisor cluster, changed the context to the namespace “onlinebankapp” and executed the command:

kubectl get virtualmachineclassbindings.vmoperator.vmware.com -n onlinebankapp

If you would have used the command “kubectl get virtualmachineclass -n onlinebankapp“, you get presented with the list of virtualmachineclasses available within the cluster.

Help Developers by Aligning Their Kubernetes Nodes to the Physical Infrastructure

With the new VM service and the customizable VM classes, you can help the developer align their nodes to the infrastructure. Infrastructure details are not always visible at the Kubernetes layers, and maybe not all developers are keen to learn about the intricacies of your environment. The VM service allows you to publish only the VM classes you see fit for that particular application project. One of the reasons could be the avoidance of monster-VM deployment. Before this update, developers could have deployed a six worker node Kubernetes cluster using the guaranteed 8XLarge class (each worker node equipped with 32 vCPUs, 128Gi all reserved), granted if your hosts config is sufficient. But the restriction is only one angle to this situation. Long-lived relationships are typically symbiotic of nature, and powerplays typically don’t help build relationships between developers and the InfraOps team. What would be better is to align it with the NUMA configuration of the ESXi hosts within the cluster.

NUMA Alignment
I’ve published many articles on NUMA, but here is a short overview of the various NUMA configuration of VMs. If a virtual machine (VM) powers on, the NUMA scheduler creates one or more NUMA clients based on the VM CPU count and the physical NUMA topology of the ESXi host. For example, a VM with ten vCPUs powered on an ESXi host with ten cores per NUMA node (CPN2) is configured with a single NUMA client to maximize resource locality. This configuration is a narrow-VM configuration. Because all vCPU have access to the same localized memory pool, this can be considered an Unified Memory Architecture (UMA).

Take the example of a VM with twelve vCPUs powered-on on the same host. The NUMA scheduler assigns two NUMA clients to this VM. The NUMA scheduler places both NUMA clients on different NUMA nodes, and each NUMA client contains six vCPUs to distribute the workload equally. This configuration is a wide VM configuration. If simultaneous multithreading (SMT) is enabled, a VM can have as many vCPUs equal to the number of logical CPUs within a system. The NUMA scheduler distributes the vCPUs across the available NUMA nodes and trusts the CPU scheduler to allocate the required resources. A 24 vCPU VM would be configured with two NUMA clients, each containing 12 vCPUs if deployed on a 10 CPN2 host. This configuration is a high-density wide VM.

A great use of VM service is to create a new set of VM classes aligned with the various NUMA configurations. Using the dual ten core system as an example, I would create the following VM classes and the associated CPU and memory resource reservations :

CPUMemoryBest EffortBurstableBurstable Mem OptimizedGuaranteed
UMA-Small216GB0% | 0%50% | 50%50% | 75%100% | 100%
UMA-Medium432GB0% | 0%50% | 50%50% | 75%100% | 100%
UMA-Large648GB0% | 0%50% | 50%50% | 75%100% | 100%
UMA-XLarge864GB0% | 0%50% | 50%50% | 75%100% | 100%
NUMA-Small1296GB0% | 0%50% | 50%50% | 75%100% | 100%
NUMA-Medium14128GB0% | 0%50% | 50%50% | 75%100% | 100%
NUMA-Large16160GB0% | 0%50% | 50%50% | 75%100% | 100%
NUMA-XLarge18196GB0% | 0%50% | 50%50% | 75%100% | 100%

The advantage of curating VM classes is that you can align the Kubernetes nodes with a physical NUMA node’s boundaries at CPU level AND memory level. In the table above, I create four classes that remain within a NUMA node’s boundaries and allow the system to breathe. Instead of maxing out the vCPU count to what’s possible, I allowed for some headroom, avoiding a noisy neighbor with a single NUMA node and system-wide. Similar to memory capacity configuration, the UMA-sized (narrow-VM) classes have a memory configuration that does not exceed the physical NUMA boundary of 128GB, increasing the chance that the ESXi system can allocate memory from the local address range. The developer can now query the available VM classes and select the appropriate VM class with his or her knowledge about the application resource access patterns. Are you deploying a low-latency memory application with a moderate CPU footprint? Maybe a UMA-Medium or UMA-large VM class helps to get the best performance. The custom VM class can transition the selection process from just a numbers game (how many vCPUs do I want?) to a more functional requirement exploration (How does it behave?) Of course, these are just examples, and these are not official VMware endorsements.

In addition, I created a new class, “Burstable mem optimized”, A class that reserves 25% more memory capacity than its sibling VM class “Burstable”. This could be useful for memory-bound applications that require the majority of memory to be reserved to provide consistent performance but do not require all of it. The beauty of custom VM classes is that you can design them as they fit your environment and your workload. With your skillset and knowledge about the infrastructure, you can help the developer to become more successful.

Filed Under: Kubernetes, NUMA

Kubernetes, Swap and the VMware Balloon Driver

November 15, 2018 by frankdenneman

Kubernetes requires to disable the swap file at the OS level. As stated in the 1.8 release changelog: The kubelet now fails if swap is enabled on a node.

Why disable swap?
Turning off swap doesn’t mean you are unable to create memory pressure. Why disable such a benevolent tool? Disable swap doesn’t make any sense if you look at it from a single workload, single system perspective.
However, Kubernetes is a distributed system that is designed to operate at scale. When running a large number of containers on a vast fleet of machines, you want predictability and consistency. Disabling swap is the right approach. It’s better to kill a single container than to have multiple containers run on a machine at unpredictable, probably slow, rate.

Therefore the kubelet is not designed to handle swap situations. It’s expected that workload demand should fit within the memory of the host. On top of that, it is recommended to apply quality of service (QoS) settings to workloads that matter. Kubernetes provides three QoS classes to pods; Guaranteed, Burstable, and BestEffort .

Kubernetes provides the construct request to ensure the availability of resources. Similar to reservations at the vSphere level. Guaranteed pods have a request configuration that’s equal to the CPU and memory limit. All memory the container can consume is guaranteed, and therefore it should never need swap. With Burstable a portion of the CPU and memory is protected by a request setting, while a BestEffort pod does not have a CPU and memory request and limit setting specified.

Multi-level Resource Management
Resource management is difficult, mainly when you deal with virtualized infrastructure. You have to ensure the workloads receive the resources they require. Furthermore, you want to drive the utilization of the infrastructure in an economically sound manner. Sometimes resources are limited, and not all workloads are equal, thus adding another level of complexity of prioritization. Once you solved that problem, you need to think about availability and serviceability.

Now the good news is that this is relatively easy with boundaries introduced by virtual machine configuration. I.e., you specify the size of the VM by assigning it CPU and memory resources. And this becomes a bin packing problem. Given n items of different weights and bins each of capacity c, assign each item to a bin such that the number of total used bins is minimized.

A virtual machine is, in essence, a virtual hardware representation. You define the size of the box, with the number of CPUs and the amount of memory. This is a mandatory step in the virtual machine creation process.

With containers it’s a little bit different. In its default state, the most minimal configuration, a container inherits the attributes of the system it runs on. It is possible to consume the entire system, depending on a workload. (a single threaded application, might detect all CPU cores available in the system, but its nature won’t allow it to run on more than a single core. In essence, a container is a process running in the Linux OS.



For a detailed explanation, please (re)view our VMworld session, CNA1553BE.

This means that if you do not specify any limit, the container has no restriction of how much resources such a pod can use. Similar to vSphere admission control, you cannot overcommit reserved resources. Thus, if you commit to an IT policy that only allows configuration of Guaranteed pods, you leverage Kubernetes admission control to avoid overcommitment of resources.

One of the questions to solve either on a technical level or organization level is, how you are going to control pod configuration? From a technical level, you can solve this by using Kubernetes admission control, but that is out of scope for this article.

Pod utilization is ultimately limited by the resources provided by the virtual machine, but you still want to provide predictability and consistency service to all workloads deployed in containers. Guarantees are only as good as the underlying foundation they are built upon. So how do you make sure behavior remains consistent for pods?

Leveraging vSphere Resource Management Constructs
When running Kubernetes within virtual machines (like the majority of the global cloud providers), you have to control the allocation of resources on multiple levels.

From the top-down, the container is scheduled by Kubernetes on a worker node, predominantly Linux is used in the Kubernetes world, so let’s use that as an example. The guest OS allocates the resources and schedules the container. Please remember that a container is just a set of processes that are isolated from the rest of the system. Containers share the same operating system kernel and thus it’s the OS responsibility to manage and maintain resources. Lastly, the virtual machine runs on the hypervisor and the VMkernel manages resource allocation.

VM-Level Reservation
To ensure resources to the virtual machine, two constructs can be used. VM-level reservations or Resource Pool reservations. With VM-level reservations, the (ESXi host) physical resources are dedicated to the virtual machine, once allocated by the guest OS, it’s not shared with other virtual machines running on that ESXi host. This is the most deterministic way to allocate physical resources. However, this method impacts the virtual machine consolidation ratio. When using the vSphere HA admission control policy of Slot Policy it can impact the VM consolidation ratio at cluster level as well.

Resource Pool Reservation
A resource pool reservation is dynamic in nature. The resources backed by a reservation are distributed amongst the child-objects of the resource pool by usage and priority. If a Kubernetes worker node is inactive or running at a lower utilization-rate, these resources are allocated to other (active) Kubernetes worker nodes within the same resource pool. Resource Pools and Kubernetes are a great fit together, however, resource pool sizing must be adjusted when the Kubernetes cluster is scaled out with new workers. If the resource pool reservation is not adjusted, resources are allocated in an opportunistic manner from the cluster, possibly impacting predictability and consistency of resource behavior.

Non-overcommitted Physical Resources
Some vSphere customers design and size their vSphere clusters to fully back virtual machine memory with physical memory. This can be quite costly, but it does reduce operational overhead tremendously. The challenge is the keep the growth of the physical cluster aligned with the deployment of workload.

Overcommited Resources
But what if this strategy does not go the way as planned? What if for some reason resources are constrained within a host and the VMkernel applies one of its resource reclamation techniques? One of the feature that is in the first line of defense is the balloon driver. Designed to be as non-intrusive as possible to the applications running inside the VMs.

Balloon Driver
The balloon driver is installed within the guest VM as part of the VMware-Tools package. When memory is over-committed the ESXi server reclaims memory by instructing the balloon driver to inflate by allocating pinned physical pages inside the guest OS. This causes memory pressure within the guest OS which invokes its own native memory management techniques to reclaim memory. Balloon driver then communicates these physical pages to the VMkernel which can then reclaim the corresponding machine page. Deflating the balloon driver releases the pinned pages and frees up memory for general use by the guest OS.

The interesting part is the dependencies of guest OS native memory management techniques. As a requirement, the swap file inside the guest OS needs to be set to disabled when you install Kubernetes. Otherwise, the kubelet won’t start. The swap file is the main reason why the balloon driver is so non-intrusive. It allows the guest OS to select memory page it deems fit. Typically these are idle pages and thus the working set of the application is not affected. What happens if the swap file is disabled. Is the balloon driver disabled? The answer is no.

Let’s verify if the swap file is disabled, by using the command cat /proc/swaps. Just to be sure I used another command swapon -s. Both outputs shows no swap file.

The command vmware-toolbox-cmd stat balloon shows the balloon driver size. Just to be sure I used another command lsmod | grep -E ‘vmmemctl|vmware_balloon to show if the balloon driver is loaded
I created an overcommit scenario on the host and soon enough the balloon driver kicked into action.

The command vmware-toolbox-cmd stat balloon confirmed the output of the stats showed by vCenter. The balloon driver pinned 4GB of memory within the guest.


4GB memory pinnned, but top showed nothing in swap.

dmesg shows the kernel messages, one of them is the activity of the OOM Killer. OOM stands for out of memory.

According to online description: The Out-Of-Memory Killer process that It is the task of the OOM Killer to continue killing processes until enough memory is freed for the smooth functioning of the rest of the process that the Kernel is attempting to run.

The OOM Killer has to select the best process(es) to kill. Best here refers to that process which will free up the maximum memory upon killing and is also the least important to the system.

The primary goal is to kill the least number of processes that minimizes the damage done and at the same time maximizing the amount of memory freed.

Beauty is in the eye of the beholder, but I wouldn’t call killing CoreDNS the best process to kill in a Kubernetes system.

Guaranteed Scheduling For Critical Add-On Pods
In the (must-watch) presentation at Kubecon 2018, Michael Gasch provided some best practices from the field. One of them is to protect critical system pods, like DaemonSets, Controllers and Master Components.

In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a control plane (master) nodes there are a number of add-ons which run on a worker node . Some of these add-ons are critical to a fully functional cluster, such as CoreDNS. A cluster may stop working properly if a critical add-on is evicted. Please take a look at the settings and recommendations listed in “Reserve Compute Resources for System Daemons“.

Please keep in mind that the guest OS, the Linux kernel, is a shared resource. Kubernetes runs a lot of its services as containers, however, not everything is managed by Kubernetes. For these services, it is best to monitor these important Linux resources in order that you don’t run out of them if you are using the QoS classes other than guaranteed.

Exploring the Kubernetes Landscape
For the vSphere admin who is just beginning to explore Kubernetes, we recommend keeping the resource management constructs aligned. Use reservations at the vSphere level and use guaranteed QoS class for your pods at the Kubernetes level. Solely using Guaranteed QoS class won’t allow for overcommitment, possibly impacting cluster utilization, but it gives you a nice safety net to learn Kubernetes without chasing weird behavior due to processes such as the OOM killer.

Thanks to Michael Gasch for the invaluable feedback

Filed Under: Kubernetes, VMware

Kubernetes at VMworld Europe

October 30, 2018 by frankdenneman

With only a few days left until VMworld Europe 2018 kicks off in Barcelona, I would like to highlight some of the many Kubernetes focussed sessions. I’ve selected a bunch of breakout sessions and meet the expert sessions based on my exposure to them at VMworld US or the quality of the speaker.
The content catalog has marked some sessions as “at capacity”, but experience thought us that there are always a couple of no-shows. Plans change during VMworld. People register for a session they would like to attend but get pulled in an interesting conversation along the way. Or sometimes you suffer from information overload and want to catch a breather. In many cases, spots open at sold-out sessions and therefore it’s always recommended to walk up to sold out sessions and try your luck.

Tuesday 06 November

11:00 – 12:00
[NET1285BE]
(Breakout Session)
The Future of Networking and Security with VMware NSX
This talk provides detailed insights into the architecture and capabilities of NSX-T. We’ll show how NSX-T addresses container workloads and integrates with frameworks like Kubernetes. We’ll also cover the multi-cloud networking and security capabilities that allow consistent networking policies across any cloud, public or private. Finally, we’ll look at how SD-WAN has become part of the NSX portfolio, enabling networking and security to be deployed from cloud to data center to edge.  More info.
By Bruce Davie, CTO, APJ, VMware
12:15 – 13:00
[MTE5044E]
(Expert Roundtable)
Selecting the Right Container Platform for Your Use Case with Patrick Daigle
There are a variety of containers and Kubernetes platforms out there in market today. Ever got confused and wanted some expert insight into what types of container or Kubernetes platforms are best suited to your use case?
By Patrick Daigle, Sr. Technical Marketing Architect, VMware
13:15 – 14:00
[MTE5209E]
(Expert Roundtable)
Cloud Native Applications and vSAN with Myles Gray
Learn how vSAN can provide storage for next generation applications autonomously, including Kubernetes, PKS or any K8S distribution and moves the provisioning of storage from the admin into the hands of the developer.
By Myles Gray, Sr. Technical Marketing Architect, VMware
14:00 – 15:00
[CNA1553BE]
(Breakout Session)
Deep Dive: The Value of Running Kubernetes on vSphere
In this technical session, you will find out how VMware vSphere provides a lot of value, especially in large-scale Kubernetes deployments. With 20 years of engineering experience in kernel and distributed computing, VMware solved many challenges Kubernetes currently faces. Building on work done with enterprises running Kubernetes at scale, you will see a hypothetical customer scenario to illustrate the benefits of running Kubernetes on top of VMware vSphere and avoid the common pitfalls associated with running on bare metal. More info.
By Frank Denneman, Chief Technologist, VMware
Michael Gasch, Customer Success Architect – Application Platforms, VMware
15:30 – 16:30
[HCI1338BE]
(Breakout Session)
vSAN: An Ideal Storage Platform for Kubernetes-controlled Cloud-Native Apps
The session discusses how VMware’s HCI offering (vSphere and vSAN) is becoming a platform of choice for deploying, running and managing the data needs of Cloud-Native Applications (CNA). We will use real world examples to highlight the benefits of an HCI control plane for Kubernetes environments. More info.
By Christos Karamanolis, Fellow and CTO Storage & Availability, VMware
Cormac Hogan, Director and Chief Technologist, VMware

Wednesday 07 November

11:15 – 12:00
[MTE5057E]
(Expert Roundtable)
Next-Gen Apps on vSAN by expert Chen Wei
Are you planning to migrate your next-gen workload to the vSAN cluster? Attend this roundtable to talk to our vSAN Solutions Architect about different aspects regarding putting Next-gen applications on vSAN. Those aspects include the next-gen application deployment best practices, performance tuning, availability/performance trade-off. Bring the questions and let’s talk.
Chen Wei, Sr. Solutions Architect, VMware
12:30 – 13:30
[CNA1493BE]
(Breakout Session)
Run Docker on Existing Infrastructure with vSphere Integrated Containers
In this session, you will find out how to run Docker on vSphere with VMware vSphere Integrated Containers. See a live demo on how vSphere Integrated Containers leverage vSphere for isolation and scheduling. Find out how vSphere Integrated Containers are the ideal way to host containers on vSphere, providing a Docker-native experience for end users and a vSphere-native experience for IT. More info.
By Patrick Daigle, Sr. Technical Marketing Architect, VMware
Martijn Baecke, Cloud Evangelist, VMware
13:15 – 14:00
[MTE5116E]
(Expert Roundtable)
Function as a Service with Mark Peek
During this roundtable, we will discuss Dispatch, the VMware framework for deploying and managing serverless style applications.
By Mark Peek, Principal Engineer, VMware
15:30 – 16:30
[DC3845KE]
(Keynote)
Cloud and Developer Keynote: Public Clouds and Kubernetes at Scale
This session will cover VMware’s strategy to deliver an enterprise-grade Kubernetes platform while supporting the needs of DevOps and CloudOps teams. VMware’s Cloud and Developer keynote will outline how to deliver developers a consistent experience across native clouds while enabling operators with more flexibility and control for how they support next generation workloads. More info.
By Guido Appenzeller, CTO, VMware
Joseph Kinsella, Vice President and CTO, Products, CloudHealth, VMware
15:30 – 16:30
[CNA2755BE]
(Breakout Session)
Architecting PKS for Production: Lessons Learned from PKS Deployments
In this session, you will get a deep dive into PKS within the context of real-world customer deployment scenarios. The speakers will share the lessons learned from their successful PKS and NSX-T deployments, and show you how to architect PKS for a production environment.
Come and learn about the do’s, don’ts, and best practices. After this session, you will be better equipped to deploy and manage enterprise-grade Kubernetes in your infrastructure and use NSX-T to bridge the gap in network and security for container workloads.
By Romain Decker, Senior Solutions Architect, VMware
Dominic Foley, Senior Solutions Architect, VMware

Thursday 08 November

15:00 – 16:00
[NET1677BE]
(Breakout Session)
Kubernetes Container Networking with NSX-T Data Center Deep Dive
In this session, you will get technical details of how the NSX-T Data Center integration with Kubernetes in Pivotal Container Service (PKS), OpenShift, and upstream Kubernetes is implemented. Get a deep dive into each identified problem statement, find out how the solution was implemented with NSX-T Data Center, and see a demo of each of the solutions live on stage using PKS with NSX-T Data Center. More info.
By Dennis Breithaupt, Sr. Systems Engineer (NSX), VMware
Yasen Simeonov, Technical Product Manager, VMware

Product Preview

This year the UX team organizes design studios that allows you to provide feedback on a future product. The product you will see will blow your mind. But since it’s NDA, I can’t tell 😉 Just show up and see for yourself!
Every day – multiple sessions available
[UX8011E]
(Design Studio)
Kubernetes on vSphere
Do you want to offer Kubernetes? Explore user interface concepts for managing containerized cloud native applications using vSphere together with other products such as PKS.
This session is part of the VMware Design Studio where you have the opportunity to participate in interactive sessions exploring technical previews and early design ideas. Because of the early nature of these designs, participants will be asked to sign a Non-Disclosure Agreement (NDA) to participate.
By Boaz Gurdin, User Experience Researcher, VMware
Pamel Shinh, Product Designer, VMware
Hope to see you there. Enjoy your VMworld!

Filed Under: Kubernetes, VMware

Copyright © 2025 · SquareOne Theme on Genesis Framework · WordPress · Log in