VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE FOR DISTRIBUTED TRAINING - MULTI-GPU
The first part of the series reviewed the capabilities of the vSphere platform to assign fractional and full GPU to workloads. This part zooms in on the multi-GPU capabilities of the platform. Let’s review the full spectrum of ML accelerators that vSphere offers today. In vSphere 8.0 Update 1, an ESXi host can assign up to 64 (dynamic) direct path I/O (passthru) full GPU devices to a single VM. In the case of NVIDIA vGPU technology, vSphere supports up to 8 full vGPU devices per ESXi host. All of these GPU devices can be assigned to a single VM.
VSPHERE ML ACCELERATOR DEEP DIVE - FRACTIONAL AND FULL GPUS
Many organizations are building a sovereign ML platform that aids their data scientist, software developers, and operator teams. Although plenty of great ML platform services are available, many practitioners have discovered that a one-size-fits-all platform doesn’t suit their needs. There are plenty of reasons why an organization chooses to build its own ML platform; it can be as simple as control over maintenance windows, being able to curate their own toolchain, relying on a non-opinionated tech stack, or governance/regulations reasons.
VSPHERE ML ACCELERATOR SPECTRUM DEEP DIVE SERIES
The number of machine learning workloads is increasing in on-prem data centers rapidly. It arrives in different ways, either within the application itself or data science teams build solutions that incorporate machine learning models to generate predictions or influence actions when needed. Another significant influx of ML workloads is the previously prototyped ML solutions in the cloud that are now moved into the on-prem environment, either for data gravity, governance, economics, or infrastructure (maintenance) control reasons. Techcrunch recently published an interesting article on this phenomenon.
VSPHERE 8.0 UPDATE 1 ENHANCEMENTS FOR ACCELERATING MACHINE LEARNING WORKLOADS
Recently vSphere 8 Update 1 was released, introducing excellent enhancements, ranging from VM-level power consumption metrics to Okta Identity Federation for vCenter. In this article, I want to investigate the enhancements to accelerate Machine Learning workloads. If you want to listen to all the goodness provided by update 1, I recommend listening to episode 40 of the Unexplored Territory Podcast with Féidhlim O’Leary (Spotify | Apple). Machine learning is rapidly becoming an essential tool for organizations and businesses worldwide. The desire for accurate models is overwhelming; in many cases, the value of a model comes from accuracy. The machine learning community strives to build more intelligent algorithms, but we still live in a world where processing more training data generates a more accurate model. A prime example is the large language models (LLM) such as ChatGPT. The more data you add, the more accurate they get.
VMWARE CLOUD SERVICES OVERVIEW PODCAST SERIES
Over the last year, we’ve interviewed many guests, and throughout the Unexplored Territory Podcast show, we wanted to provide a mini overview series of the VMware Cloud Services. Today we released the latest episode featuring Jeremiah Megie discussing the Azure VMware Solution. Azure VMware Solution Listen on Spotify or Apple. VMware Cloud on AWS In episode 013, we talk to Adrian Roberts, Head of EMEA Solution Architecture for VMware Cloud on AWS at AWS. Adrian discusses the various reasons customers are looking to utilize VMware Cloud on AWS, some of the challenges, and the opportunities that arise when you have your VMware workloads close to native AWS services.
RESEARCH AND INNOVATION AT VMWARE WITH CHRIS WOLF
In episode 042 of the Unexplored Territory podcast, we talk to Chris Wolf, Chief Research and Innovation Officer of VMware, about innovation at VMware and exciting new research projects. Make sure to follow Chris on Twitter. Also, check the following resources Chris mentioned in the episode. Cloud Native Security Inspector (Project Narrows) FATE OpenFL WASM Follow us on Twitter for updates and news about upcoming episodes: https://twitter.com/UnexploredPod. Last but not least, hit that subscribe button, rate where ever possible, and share the episode with your friends and colleagues!
MY PICKS FOR NVIDIA GTC SPRING 2023
This week GTC Spring 2023 kicks off again. These are the sessions I look forward to next week. Please leave a comment if you want to share a must-see session. MLOps Title: Enterprise MLOps 101 [S51616] The boom in AI has seen a rising demand for better AI infrastructure — both in the compute hardware layer and AI framework optimizations that make optimal use of accelerated compute. Unfortunately, organizations often overlook the critical importance of a middle tier: infrastructure software that standardizes the machine learning (ML) life cycle, adding a common platform for teams of data scientists and researchers to standardize their approach and eliminate distracting DevOps work. This process of building the ML life cycle is known as MLOps, with end-to-end platforms being built to automate and standardize repeatable manual processes. Although dozens of MLOps solutions exist, adopting them can be confusing and cumbersome. What should you consider when employing MLOps? How can you build a robust MLOps practice? Join us as we dive into this emerging, exciting, and critically important space.
DISCOVER WHAT'S NEW IN VSPHERE 8.0 U1 AND VSAN 8.0 U1
We (the Unexplored Territory team) work with the vSphere release team to get you the latest information about the new releases as quickly as possible. This week we published two new episodes discussing what’s new with vSphere 8.0 U1 and vSAN 8.0 U1. To enjoy the content, you can listen to them using your favorite podcast apps, such as Apple or Spotify, or the embedded players below.
SIMULATING NUMA NODES FOR NESTED ESXI VIRTUAL APPLIANCES
To troubleshoot a particular NUMA client behavior in a heterogeneous multi-cloud environment, I needed to set up an ESXi 7.0 environment. Currently, my lab is running ESXi 8.0, so I’ve turned to William Lams’ excellent repository of nested ESXi virtual appliances and downloaded a copy of the 7.0 u3k version. My physical ESXi hosts are equipped with Intel Xeon Gold 5218R CPUs, containing 20 cores per socket. The smallest ESXi host contains ten cores per socket in the environment I need to simulate. Therefore, I created a virtual ESXi host with 20 vCPUs and ensured that there were two virtual sockets (10 cores per socket)
SAPPHIRE RAPIDS MEMORY CONFIGURATION
The 4th generation of the Intel Xeon Scalable Processors (codenamed Sapphire Rapids) was released early this year, and I’ve been trying to wrap my head around what’s new, what’s good, and what’s challenging. Besides those new hardware native accelerators, which a later blog post covers, I noticed the return of different memory speeds when using multiple DIMMs per channel. Before the Scalable Processor Architecture in 2017, you faced the devil’s triangle when configuring memory capacity, cheap, high capacity, fast, pick two. The Xeons offered four memory channels per CPU package, and each memory channel could support up to three DIMMs. The memory speed decreased when equipped with three DIMMs per channel (3 DPC).