vSphere 8 and vSan 8 Unexplored Territory Podcast Double Header

September 5, 2022 by frankdenneman

This week we released two episodes covering the vSphere 8 and vSan 8 releases. Together with Feidhlim O’Leary, we discover all the new functions and features of the vSphere 8 platform. You can listen to this episode on Spotify, Apple, or on our website: unexploredterritory.tech

Pete Koehler repeats his stellar performance of last time and helps us understand the completely new architecture of vSAN 8. You can listen to this episode on Spotify, Apple, or on our website: unexploredterritory.tech or anywhere else you get your podcasts!

Unexplored Territory – VMware Explore USA Special

September 2, 2022 by frankdenneman

This week Duncan and I attended VMware Explore to co-present the session “60 Minutes of Virtually Speaking Live: Accelerating Cloud Transformation.” with William Lam and our buddies of the Virtually Speaking Podcast, Pete Flecha and John Nicholson. The recordings should be made available soon or sign up for the session in Barcelona.

During the week, we caught up with many people and captured soundbites of people such as Kit Colbert, Chris Wolf, Stephen Foskett, Sazalla Reddy, and a few more. You can listen to this special VMware Explore episode on Spotify (spoti.fi/3cITI7p), Apple (apple.co/3q35dJJ) or our website: unexploredterritory.tech/episodes/

New vSphere 8 Features for Consistent ML Workload Performance

August 30, 2022 by frankdenneman

vSphere 8 is full of enhancements. Go to blogs.vmware.com or yellow-bricks.com for more extensive overviews of the vSphere 8 release. In this article, I want to highlight two features of the new vSphere 8 version that will help machine learning (ML) workloads perform consistently and possibly faster than manually configured workload constructs. The two features which make this possible are UI enhancements for the vNUMA Topology and the Device Groups.

Hardware 20 Scalability Enhancements

Before we dive into the features, vSphere 8 introduces a new virtual hardware feature that allows us to introduce new wonderful things and push the boundaries again of the platform. With vSphere 8, the virtual hardware level advances to version 20, and it offers new capabilities for ML accelerators. The support for DirectPath I/O devices went up from 16 to 32. We also worked with NVIDIA to increase the support of the vGPU devices, and now, with vSphere 8, each ESXi host can support up to 8 vGPU devices.

These enhancements improve the spectrum of the ML accelerator range tremendously. With vGPU, the platform team, or in some cases, the MLOPs team, can create workload constructs (VMs, containers) and utilize fractional GPU resources that allow the data scientists to run some light testing or compartmentalize GPUs for inference workloads. At the other end of the spectrum, we have the work beast for training workloads, the multi-GPU configurations. We offer these technologies host-local and remote with VMware Bitfusion technology for allowing fast attach and detach of workloads and hardware resources. In the diagram, the orange dots indicate vSphere 7 maximum supported devices. The blue dots indicate vSphere 8.

Simplified Virtual NUMA Configuration

The device assignment functionality in the new vSphere 8 UI of the vNUMA Topology helps VI-admins and MLOPs teams assign the vCPU and GPU of a VM to the same NUMA node. This feature improves the possibility that the memory of the VM remains the same NUMA node as the GPU. I wrote an extensive article about this in January 2020, “Machine Learning Workload and GPGPU NUMA Node Locality.” The idea of the script is now codified correctly in the official product, a personal highlight to see for me.

Device Groups

Device Groups is a brilliant new feature. And before we dive into device groups, we have to look at Dynamic DirectPath I/O. Before Dynamic Direct Path, a VI-admin specified a GPU device by PCI location. That meant that the VI-admin must track what ESXi hosts have which devices and what VMs are using those devices. The VI-admin selects that particular PCI address and constraints that VM to run only on that particular device.

With the introduction of hardware labels, Dynamic DirectPath I/O (DDIO) allows a VM-admin to specify the kind of device to add to a VM. Niels Hagoort wrote a very informative article about Dynamic DirectPath I/O with its initial product name title: “vSphere 7 – Assignable Hardware.”

The problem is that DDPIO is only for one device, but as I have shown at the beginning of the article, we support the full spectrum of ML accelerator configurations. What if a data science team requires a multi-GPU configuration? Multi-GPU configuration is an infrastructure way of looking at this. The data science teams call it distributed training or distributed deep learning. The workload distribution happens between GPUs within an ESXi host or across multiple ESXi hosts. That’s where device groups come into play.

With Device Groups, vSphere 8 allows the VI-admin or MLOps team to create a configuration for workloads requiring multiple GPUs connected by a high-speed link or devices that must be on the same PCI switch.

Distributed workloads running across GPUs located on multiple ESXi hosts want the lowest possible latency. The interconnect between the ESXi hosts receives the most attention, but the path from the GPU to the external interconnect is also essential. To minimize latency, we have to take the NUMA locality of both the GPU and the NIC into account. Modern CPUs have PCI controllers baked into the CPU package; thus, NUMA PCI-Locality exists. To provide consistent performance, you must select devices connected to the same PCI controller or PCI switch (available in large systems).

A high-speed interconnect between GPU accelerators allows for a stable, consistent high bandwidth to ensure the most performance from the available local hardware. NVIDIA offers NVLINK, a direct GPU to GPU interconnect. An A30 card offers one link per card. An a100 is equipped with three links, offering 150 GB/s GPU to GPU bandwidth. Each link provides 50 GB/s of theoretical bandwidth per link. Device Groups allow VI-admins of MLOPs team to add these multiple devices as a single unit to a virtual machine.

More in-depth articles about these features will follow in the upcoming weeks.

Training vs Inference – Network Compression

August 26, 2022 by frankdenneman

This training versus inference workload series provides platform architects and owners insights about ML workload characteristics. Instead of treating deep neural networks as black box workloads, ML architectures and techniques are covered with infrastructure experts in mind. A better comprehension of the workload opens up the dialog between infrastructure and data science teams, hopefully resulting in better matching workload requirements and platform capabilities.

Part 3 of the series focussed on the memory consumption of deep learning neural network architectures. It introduced the different types of operands (weights, activations, gradients) and how each consumes memory and requires computational power throughout the different stages and layers of the neural network. Part 4 showed that a floating point data type impacts a neural network’s memory consumption and computational power requirement. I want to cover neural network compression in this part of the training versus inference workload deep dive. The goal of neural network compression is inference optimization to either help fit and run a model at a constrained endpoint or to reduce inference infrastructure running costs.

A data science team’s goal is to create a neural network model that provides the highest level of accuracy (Performance in data science terminology). To achieve high levels of accuracy, data science teams feed high-quality data sets to the ML platform and execute multiple training runs (epochs). The ML community builds newer, more complex, and more extensive neural networks to improve precision. The chart below shows the growth of parameters of image classification (orange line) and Natural Language Processing (blue line) in state-of-the-art (SOTA) neural network architectures.

If we deconstruct any neural network architecture, we can see that each neural network has different layers and operands, i.e., weights, activations, and gradients. These layers and operands impact a model’s performance and inference time. Data scientists select an appropriate floating-point data type to reduce the neural network model’s memory utilization and increase the processing speed.

Sometimes, the neural network size (the memory footprint) prohibits successful deployment to the target production infrastructure. For example, it can be an edge deployment onto a particular device, a physical space with a restricted-energy envelope. As a result, the data science team can optimize the network even further by performing quantization. Post-training quantization converts floating point data points into integers. If done smartly, it can reduce the neural network memory footprint tremendously while retaining accuracy. An additional technique to improve the efficiency of the algorithm is pruning. Pruning and quantization go hand in hand. The CERN Large Hadron Collider team is exploring a Quantization-Aware Pruning technique.

Pruning

Pruning helps to identify the important connections within the neural network and uses methods to remove either the connection to individual weights (unstructured pruning) or remove the connection of groups of weights by disconnecting an entire channel or filter (structured pruning). The most popular frameworks, like TensorFlow (Keras) and Pytorch, contain standard modules to perform unstructured pruning on neural networks.

An interesting thing about pruning is that many online literature and research papers use the terms remove or delete weights. It does not change the neural network layout. See the screenshot of the Optimal Brain Damage research paper, or check out the PyTorch Pruning tutorial.

When replacing the trained parameter with a zero, sparsity is introduced into the tensor or dense matrix data structure. Sparsity is the proportion of zero to non-zero weights. Algorithms can use sparsity to speed up or compress the footprint of the neural network.

Pruning is possible as many weights in a trained neural network end up close to the value of zero. Many researchers believe that most neural networks are over-parameterized. As a result, pruning has been a hot topic since the 90s. There have been some influential papers that are still referenced today and used as the starting point for research on new pruning techniques:

In the paper “Optimal Brain Damage,” LeCun et al. discover that “reducing the size of a learning network” improved generalization (the neural network’s ability to adapt correctly to new, previously unseen data) and inference speed.

Fast forward to 2015, Han et al. published “deep compression,” combining pruning, trained quantization, and Huffman coding to reduce the neural network footprint for mobile and other low-power applications. The pruning mechanism is (unstructured) magnitude-based, the most common today. Magnitude-based pruning assumes that the weight with the smallest value has the most negligible contribution to the neural network’s performance and removes those weights.

Interestingly, if you train a network and prune the network connections, you end up with a neural network that retains its accuracy but has more than 50% of fewer parameters. However, if you start training with that neural network architecture, it will not achieve that high accuracy level.

In the paper “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” (2019), Jonathan Frankle and Michael Carbin asked why training a network with the topology of the pruned network yields worse performance. They conclude that within an extensive neural network, a smaller neural network exists that would match the performance of the larger one. This “winning lottery ticket” subnetwork exists, but you can only discover it if the same weight initializations are used as the original networks. And therefore, it’s almost impossible to discover that smaller network before you train the larger one to completion.

The paper hypothesis automatically makes you wonder as a platform operator/architect about the approach data scientists use to optimize their models. Why not start training with a small neural network and slowly build more extensive networks? (constructive approach). Getting a neural network to react to new data correctly (generalize) with a few parameters is complicated. There seems to be no traction in the research space to investigate “Constructive neural network learning” thoroughly.

Pruning Scheduling

Generally, the data science team takes a destructive approach to model development. Multiple epochs are used to train the complete neural network with all its parameters to achieve the highest accuracy possible. The next step is to apply a pruning method in which learned parameters are set to zero, and the connections are stripped away.

The most popular pruning method is “train, prune and fine-tune.” Pruning takes place after training. The data science team determines a pruning percentage. The pruning percentage indicates the number of learned parameters that will be set to zero across the neural network. Once the pruning is complete, the neural network retrains to “recover from the loss of parameters ” these two steps are one iteration. Please keep in mind that one iteration can contain multiple epochs of training.

The data science team can choose to execute the pruning method in different ways that affect the time the accelerators are in use and when they are idling. There are two mainstream methods that I want to highlight:

One-shot Pruning
Iterative Pruning

One-shot pruning prunes the neural network to a target sparsity level in one iteration. The deep compression paper showed that pruning and retraining are repeated iteratively until the target sparsity threshold is met, typically resulting in higher accuracy than one-shot pruning. It’s common to prune 20% of the weights with the lowest magnitudes during one iteration.

Generally, iterative pruning is computationally intensive and time-consuming, especially if the global target sparsity is high and the number of parameters removed at each iteration is low. The pruning extends the use of accelerator time, but downtime occurs between iteration sessions. It can be a very “jerky” process from an accelerator utilization perspective. The iterative process introduces additional criteria besides the pruning strength:

Pruning Strength: How many weights should be removed?
Saliency: Which weights should be pruned?
Pruning Stop Condition: When should the pruning process end?

Commonly, the data science team evaluates the progress between iterations. During this time, the accelerator is typically assigned to a workload construct (VM, container), yet it is not productive. When reviewing accelerator use for pruning purposes, it’s not uncommon to see the same number of epochs used for training. The paper “Retrain or Not Retrain? – Efficient Pruning Methods of Deep CNN Networks” shows a great example:

Retraining of the pre-trained Resnet50 with global sparsity of 20%. Retraining starts at epoch 104. Top5 is marked in blue, and Top1 in red.

Sparsity

Introducing sparsity allows for efficient compression. There is the machine learning definition of compression (i.e., sparsity) and the definition we have used since Robert Jung blessed us with ARJ in MS-DOS and when Winzip burst onto the scene in 1991. Let’s use the old-school definition for now. A pruned neural network is highly susceptible to efficient compression, as you have millions of zeroes floating around. Compression is perfect if you want to reduce the model file size for the model’s distribution to mobile devices. Mobile devices have limited memory, and as the data compression paper points out, there is a power consumption difference between data retrieval from cache or SRAM. But for large retail organizations or Telco companies dealing with countless edge locations, reducing the model size can significantly speed up the distribution of the model.

But pruning is not easy, and pruning will not automatically guarantee success. Results vary widely, depending on the neural network type and task. Some architecture responds better to pruning than others. And then, of course, there is the hardware. Replacing a trained weight for a zero doesn’t make it easier for the hardware. We, as humans, know that when we multiply by zero, the answer is … zero, and any number added to 0 is equal to itself. So we take shortcuts, but this calculation needs to be executed for a computer. It cannot be ignored unless we include this logic in an algorithm.

The focus for pruning is to retain similar accuracy while increasing sparsity. However, feeding a dense matrix data structure riddled with zeroes can cause irregular memory access patterns for accelerators. And so, a sparse, dense matrix data structure isn’t always faster. The data science team has to apply more optimizations or choose structured pruning to speed up successfully on particular accelerator devices. But removing entire filters or layers dramatically changes the neural network structure’s layout and reduces the neural network accuracy.

Hardware vendors have also been researching this field for years, especially NVIDIA. The papers “Learning both weights and connections for efficient neural networks” and “Exploring the granularity of sparsity in convolutional neural networks” by Jeff Pool et al. (Senior Architect NVIDIA) are interesting reads. With the Ampere Architecture, NVIDIA introduced sparse tensor cores and Automatic Sparsity (ASP), a concept to generate the correct sparsity level for the hardware to accelerate.

NVIDIA Sparsity Support

Part 5 – Numerical Precision showed the spec sheet of an NVIDIA A100 (Ampere architecture), listed 624 TOPS for INT8 operations, and listed 1248 TOPS with an asterisk. Those 1248 TOPS are sparse tensor core operations following the 2:4 structured-sparse matrix pattern prescribed by NVIDIA.

This predefined pattern means any pruned neural network can get the best performance from an Ampere accelerator (A2, A30, A40, A100). It has to follow the 2:4 structured-sparse matrix pattern. And that means that two values out of each contiguous block of four must be zeroed out. In the end, 50% of the trained values across the network are replaced by a zero. This method’s beauty is that it uses metadata to track where those zeroes are stored. To some, this sounds like a sparse matrix data structure format. But the problem is that many machine learning libraries do not offer support for sparse matrices.

So NVIDIA uses a compressed format of a dense matrix data structure which contains the trained weights and a metadata structure that is necessary for sparse tensor cores to exploit the sparsity. This format is fed into the sparse tensor cores to get that speed up. Thus from an infrastructure perspective, you need to have your development/inference infrastructure in lockstep. They have to run both the Ampere architecture. Possibly A100 for training and pruning with ASP operations, A2 accelerator cards for inference operations.

A 2:4 structured sparse matrix W and its compressed representation (source NVIDIA)

The smart part is in the metadata. The metadata stores the positions of the non-zero weights in the compressed matrix. We must look at a standard General Matrix Multiplication operation (GEMM). In a forward propagation operation, you have a matrix (A) with weights, A matrix with activations (B), and you capture the output in Matrix C. Matrix A and B are identical in dimensions for mapping the weights and activations.

What happens with the sparse operation of the Ampere architecture? The non-zero weights are in matrix A. This one is now in a compressed state, so the dimensions do not match matrix B anymore. It only needs to pull the values from the other matrix to perform the sparse operation. It needs to perform the multiplication with the trained weight. The metadata allows the algorithm to do that. As matrix A only contains non-zero values, the metadata helps the algorithm to know precisely which activation to pull from B to match up with the train values within the compressed matrix A.

What’s next for Pruning?

NVIDIA is looking into incorporating ASP into training operations, and I can imagine this will be a significant unique selling point. As of this point, data scientists use a huge budget to train the model before starting the activities to optimize the neural network for edge deployment. These activities are not always successful. Pruning requires long fine-tuning times that could exceed the original training time by a factor of 3 or sometimes even more. Pruning consumes tremendous amounts of resources, whether on-prem or OPEX, without proper guarantees. NVIDIA has proven that sparsity can be leveraged, it is just a matter of time before other solutions pop up.

How to Write a Book – Show Up Daily

August 1, 2022 by frankdenneman

During the Belgium VMUG, I talked with Jeffrey Kusters and the VMUG leadership team about the challenges of writing a book. Interestingly enough, since that VMUG, the question of how to start writing a book kept appearing in my inbox, dm, and Linkedin Messaging regularly. This morning Michael Rebmann’s question convinced me that it’s book writing season again, so maybe it’s better to put my response in a central place.

I have some ideas for writing a book 📕 and I think now would be the best time to start with something like this. Any advice from #vExperts, the #vCommunity and existing #authors? Which publisher? What shall I prepare and think of first? Where do I start? 😱 pic.twitter.com/XysOqINJoC
— Michael Rebmann (@_michaelrebmann) July 30, 2022

There are millions of books and millions of authors, and that makes me believe a million ways to write a book. Here is what works best for me. Hopefully, there is something in it that will work for you too.

Show up daily

The biggest thing you can do to help you succeed is to show up daily. I suppose this can be applied to anything in life, but it applies to writing books especially. The key to understanding is the level of showing up, i.e., your output. You simply cannot write thousands and thousands of words every day. You will have high energy days and low energy days. Days spent on research. Days spent questioning simple things that lead to rabbit holes such as this: “How much MB is there in a GB? What about Gibibyte? And which one will I be using in the book? How many tables do I need to convert now?” You will lose much time and energy on things that will not show up in the book. And that can have a demoralizing effect and give you a feeling that you are not making any progress. Especially when you are still operating under the impression that you must write a couple of thousands of words daily. And this is not true, at least not in my experience.

Showing up means doing something. This cartoon, full credits to @saraharnoldhall says it all.

Add some words to your draft each day. It doesn’t have to be a lot. Sometimes you have good days. Sometimes you have bad days. People who have known me a bit longer and listened to my podcasts might know I suffer from migraines. Writing books while being a migraine patient is not an ideal combination. You cannot “send it” every day. So you have to work around the bad days. On my bad days, I try to do light work. I reorganize my system. Clean out the whiteboard, or if possible, I will remove the interrupts. Because when I have a good day, I don’t want to get interrupted.

Getting rid of interrupts

In 2019, I wrote an article about the three books that helped me focus, Getting things done, Essentialism, and KonMari. Getting rid of stuff, ensuring everything is in the right place and only getting the stuff you need helps you eliminate interruptions. Maybe you can’t do that for your entire household, but try to do it in your office. When you want to write, you do not get interrupted by things that need your attention. You can focus on the things you want to focus on. What also helps me to retain focus is music. Preferably music without any lyrics as that seems to distract me from writing, I’ve created a 33-hour Spotify list that helps me to zone in, but I know it’s not everyone’s taste.

Tools

I use Evernote to store links to interesting articles and research papers to organize my thoughts. Grammarly heavily corrects my English, and I use Omnigraffle for my diagrams. Keeping organized properly is a timesaver long term. You will go back to your notes often, and I lost countless hours finding that one paragraph with that one datapoint that verified my thought. Safe everything, and label everything correctly.

The last thing I want to say is just do it. Write that book that you wanted to write. Cover that topic from your perspective. Make sure it is factually correct, but add some personal flavor to it. To quote Rick Rubin, “Make what you love, whatever it is, be your own audience. So make the thing you love for you, the audience.”