• Skip to primary navigation
  • Skip to main content

frankdenneman.nl

  • AI/ML
  • NUMA
  • About Me
  • Privacy Policy

Kubernetes, Swap and the VMware Balloon Driver

November 15, 2018 by frankdenneman

Kubernetes requires to disable the swap file at the OS level. As stated in the 1.8 release changelog: The kubelet now fails if swap is enabled on a node.

Why disable swap?
Turning off swap doesn’t mean you are unable to create memory pressure. Why disable such a benevolent tool? Disable swap doesn’t make any sense if you look at it from a single workload, single system perspective.
However, Kubernetes is a distributed system that is designed to operate at scale. When running a large number of containers on a vast fleet of machines, you want predictability and consistency. Disabling swap is the right approach. It’s better to kill a single container than to have multiple containers run on a machine at unpredictable, probably slow, rate.

Therefore the kubelet is not designed to handle swap situations. It’s expected that workload demand should fit within the memory of the host. On top of that, it is recommended to apply quality of service (QoS) settings to workloads that matter. Kubernetes provides three QoS classes to pods; Guaranteed, Burstable, and BestEffort .

Kubernetes provides the construct request to ensure the availability of resources. Similar to reservations at the vSphere level. Guaranteed pods have a request configuration that’s equal to the CPU and memory limit. All memory the container can consume is guaranteed, and therefore it should never need swap. With Burstable a portion of the CPU and memory is protected by a request setting, while a BestEffort pod does not have a CPU and memory request and limit setting specified.

Multi-level Resource Management
Resource management is difficult, mainly when you deal with virtualized infrastructure. You have to ensure the workloads receive the resources they require. Furthermore, you want to drive the utilization of the infrastructure in an economically sound manner. Sometimes resources are limited, and not all workloads are equal, thus adding another level of complexity of prioritization. Once you solved that problem, you need to think about availability and serviceability.

Now the good news is that this is relatively easy with boundaries introduced by virtual machine configuration. I.e., you specify the size of the VM by assigning it CPU and memory resources. And this becomes a bin packing problem. Given n items of different weights and bins each of capacity c, assign each item to a bin such that the number of total used bins is minimized.

A virtual machine is, in essence, a virtual hardware representation. You define the size of the box, with the number of CPUs and the amount of memory. This is a mandatory step in the virtual machine creation process.

With containers it’s a little bit different. In its default state, the most minimal configuration, a container inherits the attributes of the system it runs on. It is possible to consume the entire system, depending on a workload. (a single threaded application, might detect all CPU cores available in the system, but its nature won’t allow it to run on more than a single core. In essence, a container is a process running in the Linux OS.



For a detailed explanation, please (re)view our VMworld session, CNA1553BE.

This means that if you do not specify any limit, the container has no restriction of how much resources such a pod can use. Similar to vSphere admission control, you cannot overcommit reserved resources. Thus, if you commit to an IT policy that only allows configuration of Guaranteed pods, you leverage Kubernetes admission control to avoid overcommitment of resources.

One of the questions to solve either on a technical level or organization level is, how you are going to control pod configuration? From a technical level, you can solve this by using Kubernetes admission control, but that is out of scope for this article.

Pod utilization is ultimately limited by the resources provided by the virtual machine, but you still want to provide predictability and consistency service to all workloads deployed in containers. Guarantees are only as good as the underlying foundation they are built upon. So how do you make sure behavior remains consistent for pods?

Leveraging vSphere Resource Management Constructs
When running Kubernetes within virtual machines (like the majority of the global cloud providers), you have to control the allocation of resources on multiple levels.

From the top-down, the container is scheduled by Kubernetes on a worker node, predominantly Linux is used in the Kubernetes world, so let’s use that as an example. The guest OS allocates the resources and schedules the container. Please remember that a container is just a set of processes that are isolated from the rest of the system. Containers share the same operating system kernel and thus it’s the OS responsibility to manage and maintain resources. Lastly, the virtual machine runs on the hypervisor and the VMkernel manages resource allocation.

VM-Level Reservation
To ensure resources to the virtual machine, two constructs can be used. VM-level reservations or Resource Pool reservations. With VM-level reservations, the (ESXi host) physical resources are dedicated to the virtual machine, once allocated by the guest OS, it’s not shared with other virtual machines running on that ESXi host. This is the most deterministic way to allocate physical resources. However, this method impacts the virtual machine consolidation ratio. When using the vSphere HA admission control policy of Slot Policy it can impact the VM consolidation ratio at cluster level as well.

Resource Pool Reservation
A resource pool reservation is dynamic in nature. The resources backed by a reservation are distributed amongst the child-objects of the resource pool by usage and priority. If a Kubernetes worker node is inactive or running at a lower utilization-rate, these resources are allocated to other (active) Kubernetes worker nodes within the same resource pool. Resource Pools and Kubernetes are a great fit together, however, resource pool sizing must be adjusted when the Kubernetes cluster is scaled out with new workers. If the resource pool reservation is not adjusted, resources are allocated in an opportunistic manner from the cluster, possibly impacting predictability and consistency of resource behavior.

Non-overcommitted Physical Resources
Some vSphere customers design and size their vSphere clusters to fully back virtual machine memory with physical memory. This can be quite costly, but it does reduce operational overhead tremendously. The challenge is the keep the growth of the physical cluster aligned with the deployment of workload.

Overcommited Resources
But what if this strategy does not go the way as planned? What if for some reason resources are constrained within a host and the VMkernel applies one of its resource reclamation techniques? One of the feature that is in the first line of defense is the balloon driver. Designed to be as non-intrusive as possible to the applications running inside the VMs.

Balloon Driver
The balloon driver is installed within the guest VM as part of the VMware-Tools package. When memory is over-committed the ESXi server reclaims memory by instructing the balloon driver to inflate by allocating pinned physical pages inside the guest OS. This causes memory pressure within the guest OS which invokes its own native memory management techniques to reclaim memory. Balloon driver then communicates these physical pages to the VMkernel which can then reclaim the corresponding machine page. Deflating the balloon driver releases the pinned pages and frees up memory for general use by the guest OS.

The interesting part is the dependencies of guest OS native memory management techniques. As a requirement, the swap file inside the guest OS needs to be set to disabled when you install Kubernetes. Otherwise, the kubelet won’t start. The swap file is the main reason why the balloon driver is so non-intrusive. It allows the guest OS to select memory page it deems fit. Typically these are idle pages and thus the working set of the application is not affected. What happens if the swap file is disabled. Is the balloon driver disabled? The answer is no.

Let’s verify if the swap file is disabled, by using the command cat /proc/swaps. Just to be sure I used another command swapon -s. Both outputs shows no swap file.

The command vmware-toolbox-cmd stat balloon shows the balloon driver size. Just to be sure I used another command lsmod | grep -E ‘vmmemctl|vmware_balloon to show if the balloon driver is loaded
I created an overcommit scenario on the host and soon enough the balloon driver kicked into action.

The command vmware-toolbox-cmd stat balloon confirmed the output of the stats showed by vCenter. The balloon driver pinned 4GB of memory within the guest.


4GB memory pinnned, but top showed nothing in swap.

dmesg shows the kernel messages, one of them is the activity of the OOM Killer. OOM stands for out of memory.

According to online description: The Out-Of-Memory Killer process that It is the task of the OOM Killer to continue killing processes until enough memory is freed for the smooth functioning of the rest of the process that the Kernel is attempting to run.

The OOM Killer has to select the best process(es) to kill. Best here refers to that process which will free up the maximum memory upon killing and is also the least important to the system.

The primary goal is to kill the least number of processes that minimizes the damage done and at the same time maximizing the amount of memory freed.

Beauty is in the eye of the beholder, but I wouldn’t call killing CoreDNS the best process to kill in a Kubernetes system.

Guaranteed Scheduling For Critical Add-On Pods
In the (must-watch) presentation at Kubecon 2018, Michael Gasch provided some best practices from the field. One of them is to protect critical system pods, like DaemonSets, Controllers and Master Components.

In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a control plane (master) nodes there are a number of add-ons which run on a worker node . Some of these add-ons are critical to a fully functional cluster, such as CoreDNS. A cluster may stop working properly if a critical add-on is evicted. Please take a look at the settings and recommendations listed in “Reserve Compute Resources for System Daemons“.

Please keep in mind that the guest OS, the Linux kernel, is a shared resource. Kubernetes runs a lot of its services as containers, however, not everything is managed by Kubernetes. For these services, it is best to monitor these important Linux resources in order that you don’t run out of them if you are using the QoS classes other than guaranteed.

Exploring the Kubernetes Landscape
For the vSphere admin who is just beginning to explore Kubernetes, we recommend keeping the resource management constructs aligned. Use reservations at the vSphere level and use guaranteed QoS class for your pods at the Kubernetes level. Solely using Guaranteed QoS class won’t allow for overcommitment, possibly impacting cluster utilization, but it gives you a nice safety net to learn Kubernetes without chasing weird behavior due to processes such as the OOM killer.

Thanks to Michael Gasch for the invaluable feedback

Filed Under: Kubernetes, VMware

Free vSphere Clustering Deep Dive Book at VMworld Europe

November 2, 2018 by frankdenneman

Last year Rubrik gave away hard copies of the vSphere Host Deep Dive book, this year they are doing it again with the vSphere 6.7 Clustering Deep Dive Book.

Come by the Rubrik Booth #P305 on Tuesday from 4:00 PM – 5:00 PM to get a signed, complimentary copy of vSphere 6.7 Clustering Deep Dive and meet the authors.

Last year we gave away a thousand copies and were gone within an hour. As most of you can remember, the line was insane. This year we have a similar amount, so make sure you’re on time.

Filed Under: VMware

Kubernetes at VMworld Europe

October 30, 2018 by frankdenneman

With only a few days left until VMworld Europe 2018 kicks off in Barcelona, I would like to highlight some of the many Kubernetes focussed sessions. I’ve selected a bunch of breakout sessions and meet the expert sessions based on my exposure to them at VMworld US or the quality of the speaker.
The content catalog has marked some sessions as “at capacity”, but experience thought us that there are always a couple of no-shows. Plans change during VMworld. People register for a session they would like to attend but get pulled in an interesting conversation along the way. Or sometimes you suffer from information overload and want to catch a breather. In many cases, spots open at sold-out sessions and therefore it’s always recommended to walk up to sold out sessions and try your luck.

Tuesday 06 November

11:00 – 12:00
[NET1285BE]
(Breakout Session)
The Future of Networking and Security with VMware NSX
This talk provides detailed insights into the architecture and capabilities of NSX-T. We’ll show how NSX-T addresses container workloads and integrates with frameworks like Kubernetes. We’ll also cover the multi-cloud networking and security capabilities that allow consistent networking policies across any cloud, public or private. Finally, we’ll look at how SD-WAN has become part of the NSX portfolio, enabling networking and security to be deployed from cloud to data center to edge.  More info.
By Bruce Davie, CTO, APJ, VMware
12:15 – 13:00
[MTE5044E]
(Expert Roundtable)
Selecting the Right Container Platform for Your Use Case with Patrick Daigle
There are a variety of containers and Kubernetes platforms out there in market today. Ever got confused and wanted some expert insight into what types of container or Kubernetes platforms are best suited to your use case?
By Patrick Daigle, Sr. Technical Marketing Architect, VMware
13:15 – 14:00
[MTE5209E]
(Expert Roundtable)
Cloud Native Applications and vSAN with Myles Gray
Learn how vSAN can provide storage for next generation applications autonomously, including Kubernetes, PKS or any K8S distribution and moves the provisioning of storage from the admin into the hands of the developer.
By Myles Gray, Sr. Technical Marketing Architect, VMware
14:00 – 15:00
[CNA1553BE]
(Breakout Session)
Deep Dive: The Value of Running Kubernetes on vSphere
In this technical session, you will find out how VMware vSphere provides a lot of value, especially in large-scale Kubernetes deployments. With 20 years of engineering experience in kernel and distributed computing, VMware solved many challenges Kubernetes currently faces. Building on work done with enterprises running Kubernetes at scale, you will see a hypothetical customer scenario to illustrate the benefits of running Kubernetes on top of VMware vSphere and avoid the common pitfalls associated with running on bare metal. More info.
By Frank Denneman, Chief Technologist, VMware
Michael Gasch, Customer Success Architect – Application Platforms, VMware
15:30 – 16:30
[HCI1338BE]
(Breakout Session)
vSAN: An Ideal Storage Platform for Kubernetes-controlled Cloud-Native Apps
The session discusses how VMware’s HCI offering (vSphere and vSAN) is becoming a platform of choice for deploying, running and managing the data needs of Cloud-Native Applications (CNA). We will use real world examples to highlight the benefits of an HCI control plane for Kubernetes environments. More info.
By Christos Karamanolis, Fellow and CTO Storage & Availability, VMware
Cormac Hogan, Director and Chief Technologist, VMware

Wednesday 07 November

11:15 – 12:00
[MTE5057E]
(Expert Roundtable)
Next-Gen Apps on vSAN by expert Chen Wei
Are you planning to migrate your next-gen workload to the vSAN cluster? Attend this roundtable to talk to our vSAN Solutions Architect about different aspects regarding putting Next-gen applications on vSAN. Those aspects include the next-gen application deployment best practices, performance tuning, availability/performance trade-off. Bring the questions and let’s talk.
Chen Wei, Sr. Solutions Architect, VMware
12:30 – 13:30
[CNA1493BE]
(Breakout Session)
Run Docker on Existing Infrastructure with vSphere Integrated Containers
In this session, you will find out how to run Docker on vSphere with VMware vSphere Integrated Containers. See a live demo on how vSphere Integrated Containers leverage vSphere for isolation and scheduling. Find out how vSphere Integrated Containers are the ideal way to host containers on vSphere, providing a Docker-native experience for end users and a vSphere-native experience for IT. More info.
By Patrick Daigle, Sr. Technical Marketing Architect, VMware
Martijn Baecke, Cloud Evangelist, VMware
13:15 – 14:00
[MTE5116E]
(Expert Roundtable)
Function as a Service with Mark Peek
During this roundtable, we will discuss Dispatch, the VMware framework for deploying and managing serverless style applications.
By Mark Peek, Principal Engineer, VMware
15:30 – 16:30
[DC3845KE]
(Keynote)
Cloud and Developer Keynote: Public Clouds and Kubernetes at Scale
This session will cover VMware’s strategy to deliver an enterprise-grade Kubernetes platform while supporting the needs of DevOps and CloudOps teams. VMware’s Cloud and Developer keynote will outline how to deliver developers a consistent experience across native clouds while enabling operators with more flexibility and control for how they support next generation workloads. More info.
By Guido Appenzeller, CTO, VMware
Joseph Kinsella, Vice President and CTO, Products, CloudHealth, VMware
15:30 – 16:30
[CNA2755BE]
(Breakout Session)
Architecting PKS for Production: Lessons Learned from PKS Deployments
In this session, you will get a deep dive into PKS within the context of real-world customer deployment scenarios. The speakers will share the lessons learned from their successful PKS and NSX-T deployments, and show you how to architect PKS for a production environment.
Come and learn about the do’s, don’ts, and best practices. After this session, you will be better equipped to deploy and manage enterprise-grade Kubernetes in your infrastructure and use NSX-T to bridge the gap in network and security for container workloads.
By Romain Decker, Senior Solutions Architect, VMware
Dominic Foley, Senior Solutions Architect, VMware

Thursday 08 November

15:00 – 16:00
[NET1677BE]
(Breakout Session)
Kubernetes Container Networking with NSX-T Data Center Deep Dive
In this session, you will get technical details of how the NSX-T Data Center integration with Kubernetes in Pivotal Container Service (PKS), OpenShift, and upstream Kubernetes is implemented. Get a deep dive into each identified problem statement, find out how the solution was implemented with NSX-T Data Center, and see a demo of each of the solutions live on stage using PKS with NSX-T Data Center. More info.
By Dennis Breithaupt, Sr. Systems Engineer (NSX), VMware
Yasen Simeonov, Technical Product Manager, VMware

Product Preview

This year the UX team organizes design studios that allows you to provide feedback on a future product. The product you will see will blow your mind. But since it’s NDA, I can’t tell 😉 Just show up and see for yourself!
Every day – multiple sessions available
[UX8011E]
(Design Studio)
Kubernetes on vSphere
Do you want to offer Kubernetes? Explore user interface concepts for managing containerized cloud native applications using vSphere together with other products such as PKS.
This session is part of the VMware Design Studio where you have the opportunity to participate in interactive sessions exploring technical previews and early design ideas. Because of the early nature of these designs, participants will be asked to sign a Non-Disclosure Agreement (NDA) to participate.
By Boaz Gurdin, User Experience Researcher, VMware
Pamel Shinh, Product Designer, VMware
Hope to see you there. Enjoy your VMworld!

Filed Under: Kubernetes, VMware

Repeat Session vSphere Clustering Deep Dive at VMworld Europe

October 22, 2018 by frankdenneman

Good news for the VMworld attendees who couldn’t sign up anymore for the vSphere Clustering Deep Dive session on Tuesday. I’m happy to announce that the VMworld team scheduled a repeat session for the vSphere Clustering Deep Dive session on Thursday 08 November at 10:30 to 11:30.
Session Outline
In this session, Duncan and Frank will take you through the trenches of VMware vSphere Distributed Resource Scheduler (DRS) and vSphere High Availability (HA). Find out about options to optimize your DRS settings for your specific requirements and goals, such as if you should be load balancing on active or consumed memory, as well as what has recently changed in the DRS algorithm and if it will impact DRS behavior. And for vSphere HA, you will learn about when it restarts virtual machines (VMs), what kind of restart times to expect, and where you can find evidence that a VM (or multiple) have been restarted. You will find out about all of these items and more. Prepare to dive deep, as the basics will not be covered.
Don’t wait too long with registering, VMworld Europe room sizes max out at 400 people. We hope to see you there!

Filed Under: VMware

Compute Policy in VMware Cloud on AWS

October 19, 2018 by frankdenneman

The latest update of VMware Cloud on AWS introduced a new feature called compute policies. In its initial release, the compute policies provide the ability to configure affinity rules and mobility control based of declarative policies and vSphere tags.
Management of affinity rules
Historically, affinity rules are a part of the cluster configuration. Within VMware Cloud on AWS, cluster configuration is controlled by VMware and thus customers cannot set affinity rules for virtual machines running within the SDDC. Instead of merely pulling the affinity rules configuration outside the cluster configuration, we decided to improve the affinity functionality and work towards a more uniform and consistent experience across multiple clouds.
The road to declarative policies
Within a declarative system, you describe what you want to happen. This is the opposite of imperative operations where you specify actions. Declarative commands define state and to some extent affinity rules are declarative statements. Let’s take VM anti-affinity rules as an example. You want to keep VM1 and VM2 separated and keep them in different fault domains. Instead of providing imperative actions of pinning VM1 to host A and pinning VM2 to host B, you create an anti-affinity rule with VM1 and VM2 as members. You state that these two VMs should not run on the same ESXi host. vCenter (DRS) controls placement and takes the necessary actions to solve any violations of this intent. We want to apply this model to other features.
Instead of logging into vCenter to deal with configuration issues, and manually correct the situation, we want vCenter to manage the functions of your behalf. The way you interact with vCenter, in this more declarative way, is with policies. Instead of specifying more detailed imperative actions, you would declare your intent and the only thing you want to monitor after that is whether the policy is compliant or not.
We have to start somewhere, thus we concentrated on affinity rules (VM-VM and VM-host) and anti-mobility (vMotion disabled) policies. Once we have this more abstract way of interacting with vCenter Server, it provides more advantages. One of them is an additional level of abstraction. And abstraction allows for a more uniform and consistent experience across multiple clouds.
With today’s ability on-prem setup, you configure your cluster for a particular workload and this could inhibit the ability to move your workload to another cluster, on-prem or even to the cloud. To make sure you can easily burst out to VMware Cloud environments, you want this to be seamless. The directions where we are going to is that you do not need to have configurations that are specific to on-prem clusters and in-cloud or at-edge clusters. But ideally you express what you want and it should be the job of the cloud control plane, such as vCenter, to push this configuration to the environment the workload is presently in. So that could be to an on-prem cluster or an in-cloud cluster.
Compute policies are active at vCenter level
Due to this model, the rules are decoupled from cluster level and are now managed at vCenter level. If you would configure a VM-VM anti-affinity rule and you would move the VMs to another cluster, the policy remains active.
At the time of writing, VMware Cloud on AWS allows the customer to create 10 clusters per SDDC. Clusters can span multiple AWS availability zones (AZs). The VM-Host affinity ruleset allows customers to tag the hosts per AZ and tag the VMs that needs to remain in that availability zone. You can move the VMs to hosts between clusters within the same AZ, the compute policy remains active while vCenter ensures the compliance of the rule.
Introduction of firm rules
An interesting fact is that the VM-Host rules are firm rules, these firm rules differ from the traditional soft (should run on) and hard (must run on). They sit in between these two rules. DRS cannot violate these rules, only if the host is placed in maintenance mode. This ensures that during normal operations the rules are never broken while providing VMware the ability to service the SDDC. The only time a host is placed into maintenance mode in VMware Cloud on AWS is during upgrades which are handled by VMware and well communicated before the service window. This allows the customer to generate a strategy for these virtual machines well ahead before the service window.
In the next article, I will go through the steps on how to create a compute policy.

Filed Under: VMware

  • « Go to Previous Page
  • Page 1
  • Page 2
  • Page 3
  • Page 4
  • Page 5
  • Interim pages omitted …
  • Page 29
  • Go to Next Page »

Copyright © 2025 · SquareOne Theme on Genesis Framework · WordPress · Log in