VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE – USING DYNAMIC DIRECTPATH IO (PASSTHROUGH) WITH VMS

vSphere 7 and 8 offer two passthrough options, DirectPath IO and Dynamic DirectPath IO. Dynamic DirectPath IO is the vSphere brand name of the passthrough functionality of PCI devices to virtual machines. It allows assigning a dedicated GPU to a VM with the lowest overhead possible. DirectPath I/O assigns a PCI Passthrough device by identifying a specific physical device located on a specific ESXi host at a specific bus location on that ESXi host using the Segment/Bus/Device/Function format. This configuration path restricts that VM to that specific ESXi host.

#47 - HOW VMWARE ACCELERATES CUSTOMERS ACHIEVING THEIR NET ZERO CARBON EMISSIONS GOAL

In episode 047, we spoke with Varghese Philipose about VMware’s sustainability efforts and how they help our customers meet their sustainability goals. Features like the green score help many of our customers understand how they can lower their carbon emissions and hopefully reach net zero. Topics discussed: Creating sustainability dashboards - https://blogs.vmware.com/management/2019/06/sustainability-dashboards-in-vrealize-operations-find-how-much-did-you-contribute-to-a-greener-planet.html Sustainability dashboards in VROps 8.6 - https://blogs.vmware.com/management/2021/10/sustainability-dashboards-in-vrealize-operations-8-6.html VMware Green Score - https://blogs.vmware.com/management/2022/11/vmware-green-score-in-aria-operations-formerly-vrealize-operations.html Intrinsically green - https://news.vmware.com/esg/intrinsically-evergreen-vmware-earth-day-2023 Customer success story - https://blogs.vmware.com/customer-experience-and-success/2023/04/tam-partnerships-make-customers-the-hero.html Follow the podcast on Twitter for updates and news about upcoming episodes: https://twitter.com/UnexploredPod.

VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE – ESXI HOST BIOS, VM, AND VCENTER SETTINGS

To deploy a virtual machine with a vGPU, whether a TKG worker node or a regular VM, you must enable some ESXi host-level and VM-level settings. All these settings are related to the isolation of GPU resources and memory-mapped I/O (MMIO) and the ability of the (v)CPU to engage with the GPU using native CPU instructions. MMIO provides the most consistent high performance possible. By default, vSphere assigns a MMIO region (an address range, not actual memory pages) of 32GB to each VM. However, modern GPUs are ever more demanding and introduce new technologies requiring the ESXi Host, VM, and GPU settings to be in sync. This article shows why you need to configure these settings, but let’s start with an overview of the required settings.

VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE –NVIDIA AI ENTERPRISE SUITE

vSphere allows assigning GPU devices to a VM using VMware’s (Dynamic) Direct Path I/O technology (Passthru) or NVIDIA’s vGPU technology. The NVIDIA vGPU technology is a core part of the NVIDIA AI Enterprise suite (NVAIE). NVAIE is more than just the vGPU driver. It’s a complete technology stack that allows data scientists to run an end-to-end workflow on certified accelerated infrastructure. Let’s look at what NVAIE offers and how it works under the cover.

VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE - GPU DEVICE DIFFERENTIATORS

The two last parts reviewed the capabilities of the platform. vSphere can offer fractional GPUs to Multi-GPU setups, catering to the workload’s needs in every stage of its development life cycle. Let’s look at the features and functionality of each supported GPU device. Currently, the range of supported GPU devices is quite broad. In total, 29 GPU devices are supported, dating back from 2016 to the last release in 2023. A table at the end of the article includes links to each GPUs product brief and their datasheet. Although NVIDIA and VMware form a close partnership, the listed support of devices is not a complete match. This can lead to some interesting questions typically answered with; it should work. But as always, if you want bulletproof support, follow the guides to ensure more leisure time on weekends and nights.

#46 - VMWARE CLOUD FLEX COMPUTE TECH PREVIEW

We’re extending the VMware Cloud Services overview series with a tech preview of the VMware Cloud Flex Compute service. Frances Wong shares a lot of interesting use cases and details with us in this episode! In short, VMware Cloud Flex Compute is a new approach to the Enterprise-grade VMware Cloud, but instead of obtaining a full SDDC, it is sliced, diced, sold, and deployed by fractional SDDC increments in the global cloud.

VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE FOR DISTRIBUTED TRAINING - MULTI-GPU

The first part of the series reviewed the capabilities of the vSphere platform to assign fractional and full GPU to workloads. This part zooms in on the multi-GPU capabilities of the platform. Let’s review the full spectrum of ML accelerators that vSphere offers today. In vSphere 8.0 Update 1, an ESXi host can assign up to 64 (dynamic) direct path I/O (passthru) full GPU devices to a single VM. In the case of NVIDIA vGPU technology, vSphere supports up to 8 full vGPU devices per ESXi host. All of these GPU devices can be assigned to a single VM. 

VSPHERE ML ACCELERATOR DEEP DIVE - FRACTIONAL AND FULL GPUS

Many organizations are building a sovereign ML platform that aids their data scientist, software developers, and operator teams. Although plenty of great ML platform services are available, many practitioners have discovered that a one-size-fits-all platform doesn’t suit their needs. There are plenty of reasons why an organization chooses to build its own ML platform; it can be as simple as control over maintenance windows, being able to curate their own toolchain, relying on a non-opinionated tech stack, or governance/regulations reasons. 

VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE SERIES

The number of machine learning workloads is increasing in on-prem data centers rapidly. It arrives in different ways, either within the application itself or data science teams build solutions that incorporate machine learning models to generate predictions or influence actions when needed. Another significant influx of ML workloads is the previously prototyped ML solutions in the cloud that are now moved into the on-prem environment, either for data gravity, governance, economics, or infrastructure (maintenance) control reasons. Techcrunch recently published an interesting article on this phenomenon.

VSPHERE 8.0 UPDATE 1 ENHANCEMENTS FOR ACCELERATING MACHINE LEARNING WORKLOADS

Recently vSphere 8 Update 1 was released, introducing excellent enhancements, ranging from VM-level power consumption metrics to Okta Identity Federation for vCenter. In this article, I want to investigate the enhancements to accelerate Machine Learning workloads. If you want to listen to all the goodness provided by update 1, I recommend listening to episode 40 of the Unexplored Territory Podcast with Féidhlim O’Leary (Spotify | Apple). Machine learning is rapidly becoming an essential tool for organizations and businesses worldwide. The desire for accurate models is overwhelming; in many cases, the value of a model comes from accuracy. The machine learning community strives to build more intelligent algorithms, but we still live in a world where processing more training data generates a more accurate model. A prime example is the large language models (LLM) such as ChatGPT. The more data you add, the more accurate they get.