PROJECT MONTEREY AND THE NEED FOR NETWORK CYCLES OFFLOAD FOR ML WORKLOADS.

VMworld has started, and that means a lot of new announcements. One of the most significant projects VMware is working on is project Monterey. Project Monterey allows the use of SmartNICS, also known as Data Processing Units, of various VMware partners within the vSphere platform. Today we use the CPU inside the ESXi host to run workloads and to process network operations. With the shift towards distributed applications, the CPUs inside the ESXi hosts need to spend more time processing network IO instead of application operations. This extra utilization impacts data center economics like consolidation ratios and availability calculations. On top of this shift from monolith application to distributed application is the advent of machine learning supported services in the enterprise data center.

MACHINE LEARNING INFRASTRUCTURE FROM A VSPHERE INFRASTRUCTURE PERSPECTIVE

For the last 18 months, I’ve been focusing on machine learning, especially how customers can successfully deploy machine learning infrastructure on a vSphere infrastructure. This space is exciting as it has so many great angles to explore. Besides the model training, a lot of stuff happens with the data. Data is transformed, data is moved. Data sets are often hundreds of gigabytes in size. Although that doesn’t sound that much compared to modern databases, these data sets are transformed and versioned. Where massive databases nest on an array, these data sets travel through pipelines that connect multiple systems with different architectures, from data lakes to in-memory key-value stores. As a data center architect, you need to think about the various components involved, where the compute horsepower is needed? How do you deal with an explosion of data? Where do you place particular storage platforms, what kind of bandwidth is needed, and do you always need the extreme low-latency systems in your ML infrastructure landscape?

CPU PINNING IS NOT AN EXCLUSIVE RIGHT TO A CPU CORE!

https://twitter.com/MrsBrookfield/status/1402955235497287685 Katarina tweeted a very expressive tweet about her love/hate (mostly hate) relation with CPU pinning, and lately I have been in conversations with customers contemplating whether they should use CPU pinning. The analogy that I typically use to describe CPU pinning is the story of the favorite parking space at your office parking lot. CPU pinning limits the compliant CPU “slots” for that vCPU to be scheduled on. So think about that CPU slot as the parking spot closest to the entrance of your office. You have decided that you only want to park in that spot. Every day of the year, that’s your spot and no other place else. The problem is, this is not a company-wide directive. Anyone can park in that spot, but you just limited yourself to that spot only. So it can happen that Bob arrives at the office first and lazy as he is, he will park to the office entrance as close as he can. Right in your spot. Now the problem with your self-imposed rule is that you cannot and will not park anywhere else. So when you show up (late to the party), you notice that Bob’s car is in YOUR parking spot, and the only thing you can do is to drive circles in some holding pattern until Bob leaves the office again. The stupidest thing. It’s Sunday, and you and Bob are the only ones doing some work. You’re out there on the parking lot, driving circles waiting until Bob leaves again, while Bob is inside in the empty building waiting on you to get started.

VM SERVICE - HELP DEVELOPERS BY ALIGNING THEIR KUBERNETES NODES TO THE PHYSICAL INFRASTRUCTURE

The vSphere 7.0 U2a update released on April 27th introduces the new VM service and VM operator. Hidden away in what seems to be a trivial update is a collection of all-new functionalities. Myles Gray has written an extensive article about the new features. I want to highlight the administrative controls of VM classes of the VM service. VM Classes What are VM classes, and how are they used? With Tanzu Kubernetes Grid Service running in the Supervisor cluster, developers can deploy Kubernetes clusters without the help of the Infra Ops team. Using their native tooling, they specify the size of the cluster control plane and worker nodes by using a specific VM class. The VM class configuration acts a template used to define CPU, memory resources and possibly reservations for these resources. These templates allow the InfraOps team to set guardrails for the consumption of cluster resources by these TKG clusters.

VSPHERE WITH TANZU VCENTER SERVER NETWORK CONFIGURATION OVERVIEW

I noticed quite a few network-related queries about the install of vSphere with Tanzu together with vCenter Server Networks (Distributed Switch and HA-Proxy). The whole install process can be a little overwhelming when it’s your first time dealing with Kubernetes and load-balancing services. Before installing Workload Management (the user interface designation for running Kubernetes inside vSphere), you have to setup HA-Proxy, a virtual appliance for provisioning load-balancers. Cormac did a tremendous job describing the configuration steps of all components. Still, when installing vSphere with Tanzu, I found myself mapping out the network topology to make sense of it all since there seems to be a mismatch of terminology between HA-proxy and Workload Management configuration workflow.

WHAT'S YOUR FAVORITE TECH NOVEL?

Today I was discussing with Duncan some great books to read. Disconnecting fully from work is difficult for me, so typically, the books I read are tech-related. I have read some brilliant books that I want to share with you, but mostly I want to hear your recommendations for other excellent tech novels. Cyberwarfare Cyberwarfare intrigues me, so any book covering Operation Olympic Games -or- Stuxnet interests me. One of the best books on this topic “Countdown to Zero day” by Kim Zetter. The book is filled with footnotes and references to corresponding research.

NEW WHITEPAPER AVAILABLE ON VSPHERE 7 DRS LOAD BALANCING

vSphere 7 contains the new DRS algorithm that differs tremendously from the old one. The performance team has put the new algorithm through the test and have published a whitepaper presenting their findings. Read the white paper: Load Balancing Performance of DRS in vSphere 7.0.

VSPHERE 7 DRS SCALABLE SHARES DEEP DIVE

You are one tickbox away from completely overhauling the way you look at resource pools. Yes you can still use them as folders (sigh), but with the newly introduced Scalable Shares option in vSphere 7 you can turn resource pools into more or less Quality of Service classes. Sounds interesting right? Let’s first take a look at the traditional working of a resource pool, the challenges they introduced, and how this new delivery of resource distribution works. To understand that we have to take a look at the basics of how DRS distributes unreserved resources first.

VSPHERE 7 VMOTION WITH ATTACHED REMOTE DEVICE

A lot of cool new features fly under the radar during a major release announcement. Even the new DRS algorithm didn’t get much air time. One thing that I discovered this week, is that vSphere 7 allows for vMotion with an attached remote device attached. When using the VM Remote Console, you can attach ISOs stored on your local machine to the VM. An incredibly useful tool that allows you to quickly configure a VM.

DRS MIGRATION THRESHOLD IN VSPHERE 7

DRS in vSphere 7 is equipped with a new algorithm. The old algorithm measured the load of each host in the cluster and tried to keep the difference in workload within a specific range. And that range, the target host load standard deviation, was tuned via the migration threshold. The new algorithm is designed to find an efficient host for the VM that can provide the resources the workload demand while considering the potential behavior of other workloads. Throughout a series of articles, I will explain the actions of the algorithm in more detail.