Search Results for: machine learning

vSphere ML Accelerator Deep Dive – Fractional and Full GPUs

May 10, 2023 by frankdenneman

Many organizations are building a sovereign ML platform that aids their data scientist, software developers, and operator teams. Although plenty of great ML platform services are available, many practitioners have discovered that a one-size-fits-all platform doesn’t suit their needs. There are plenty of reasons why an organization chooses to build its own ML platform; it can be as simple as control over maintenance windows, being able to curate their own toolchain, relying on a non-opinionated tech stack, or governance/regulations reasons.

The first step is to determine the primary workload. Will this be only inference, training, or a mix of both? Getting some servers and a few GPU resources might sound like a good start, but understanding the workload in more detail allows you to get the right resources and create a relevant and valuable platform.

If your organization plans to purchase ML-assisted products and services, the focus shifts towards deploying an “inference” workload. Inference workloads are production-ready machine models infused in services or applications that process unseen data and generate an action for the subsequent business process or a recommendation. These workloads require the appropriate hardware and orchestration services. Only a monitoring suite focusing on service availability could suffice if the models are vendor-proprietary.

If your organization builds models, the ML platform should focus on two distinct disciplines: Model development and model deployment. A term often heard in this scenario is MLOPs, DevOPs for the Machine Learning ecosystem. The ML platform should provide an infrastructure and software platform that helps data scientists develop their models. Data Scientists are highly skilled in calculus, linear algebra, and statistics. They are typically not hardcore developers, nor are they infrastructure-tech savvy. The unicorns are the ones that know enough to help themselves with creating their own world and developing their model. This blog series and the training vs. inference series intend to bring you closer to the data science team and help you understand some of the nuances of machine learning without going through a full-fledged linear algebra course.

ML Development Lifecycle

A machine learning model that is fully trained and deemed production ready must be deployed. It cannot run in thin air. It needs to be incorporated into a service or an application. A model is never a standalone feature; thus, developers are needed after the data scientist is done with this model version. And therefore, you need developers ready to incorporate the model into a software system, deploy it, and scale it to serve the inference requests. Software tools are needed to build, test, release, deploy, and monitor these ML-assisted services. The model development lifecycle or ML project work-flow is typically categorized into three broad areas:

Build process
Training process
Deployment process

In the build process, the data science team determines what framework and algorithm to use during the concept phase. They explore what data is available, where the data lives, and how they can access it. They study the idea’s feasibility by running some tests using small data sets.

In the training process, the data science team limited the possible algorithms and trained the models to learn from the data. Based on the training process results, the model is tuned and retrained. The training process can be cyclical and include various steps from the built process, such as finding more data, as the current dataset might not be satisfactory.

The deployment process is where the model is moved into production. It’s now available to the user and processes unseen data. Models facing human behavior tend to deteriorate over time at a much faster rate than models built to augment or support closed-mechanical looped systems. Simply as human nature changes over time and thus the model will slowly detect fewer patterns, it’s trained to recognize. For these models, a recurring training loop must be created where production data is captured, prepared as new datasets to train the model and replace the old model with a freshly trained one.

To successfully integrate, deploy, operate, monitor and retrain and re-release, you must create a platform that allows DevOps and Machine Learning teams to develop together. This is where MLOPs platforms add tremendous value to the parties involved. Mix this with an ML-savvy VI-admin and operator team, and this group can help the organization achieve its goals. This series covers the features and functionalities of the ML accelerators available in vSphere and how to set them up in vSphere and Tanzu Kubernetes Grid Services. Articles about MLOps platforms are planned for later this year.

Understanding the three ML processes better is essential for the infrastructure focussed operator, as this translates to hardware requirements. Let’s look at what’s supported by vSphere first and then map these features and functionalities to the ML development lifecycle processes. vSphere and Tanzu Kubernetes Grid Services can assign ML accelerators (GPUs) to workloads. Three configurations are possible: a full GPU, a fractional GPU, and multiple GPUs assigned to a single VM. Fractional GPU functionality allows vSphere to split up a full GPU and assign smaller GPUs to multiple VMs. With Multi-GPU, the ESXi hosts can assign multiple GPUs to VMs. NVIDIA GPUDirect RDMA technology significantly improves communication between GPU-enabled VMs on different ESXi hosts. Throughout this series, we will continuously dive deeper into each technology.

Full GPUs

vSphere allows assigning a full GPU to a VM. Either by using VMware’s (Dynamic) Direct Path I/O technology (Passthru) or NVIDIA’s vGPU technology. This full GPU is exclusively available for this workload. No sharing between VMs is possible. (Dynamic) Direct Path I/O provides VMs access to the physical functions of the GPU device with the help of Memory Mapped I/O. One of the articles in this series covers this topic in detail. The difference between Dynamic Direct Path I/O and Direct Path I/O is the method of assigning the device to the VM. Direct Path I/O assigns the GPU Device to the VM based on the PCIe address of the device. In contrast, Dynamic Direct Path I/O uses a key-value method using either custom or vendor-device generated labels. This allows vSphere to decouple the static relationship between VM and device and provides more flexibility for initial placement processes used by DRS and HA. By default, vSphere 8 uses Dynamic Direct Path I/O with vendor-generated labels.

NVIDIA vGPU builds on Dynamic Direct Path I/O and installs the NVIDIA vGPU Manager in the kernel. It allows for creating fractional GPUs and makes vMotion possible. And this is where the choice between both technologies becomes interesting when assigning a full GPU.

	Direct Path I/O	Dynamic Direct Path I/O	NVIDIA vGPU
Failover HA	No	Yes	Yes
Initial Placement DRS	No	Yes	Yes
Load Balance DRS	No	No	No
vMotion	No	No	Yes
Host Maintenance Mode	Shutdown VM	Cold Migration	Manual vMotion
Snapshot	No	No	Yes
Suspend and Resume	No	No	Yes
Fractional GPUs	No	No	Yes
TKGS VMClass Support	No	Yes	Yes

In both scenarios, dynamic direct path I/O and vGPU allow assigning a dedicated GPU to a VM, which can help the data science team achieve their goals in the build, train or deploy process. But often, more elegant, more efficient technologies are available that create a suitable environment for the workloads but increase overall resource availability within the platform, ready for the data science teams to utilize.

Fractional GPUs

Fractional GPUs enable multiple VMs to have simultaneous, direct access to a single physical GPU by partitioning a physical GPU device into multiple smaller GPU instances. This functionality is provided by NVIDIA Virtual GPU technology and is available on data center class GPUs and a subset of NVIDIA RTX GPU devices. A vGPU device supports multiple vGPU types that are optimized for specific workloads. The vGPU types applicable for machine learning workloads are C-series and Q-series.

The C-series is optimized for compute-intensive workloads. These are pretty much the classical ML workloads. The Q-series type can do the same, but the key difference is that the C-type can only decode video streams, and the Q-type can also (hardware) encode video streams. This difference is essential to know if the data science team plans to deploy a vision AI model. If the model only generates an action or a warning after object\anomaly detection in a video stream, the video is not encoded, and thus only decoders are necessary. A C-series vGPU type is sufficient. However, if the video stream is encoded after being processed by the model because human intervention or a second opinion is required, then a Q-type series is required.

NVIDIA vGPU offers two modes, the default Time-sliced mode or the vGPU Multi-Instance GPU (MIG) mode, available from the Ampere architecture onwards. A vGPU is assigned to a VM by selecting a vGPU type (C or Q-series) and a frame buffer size (GPU memory). When using MIG mode, the vGPU type also provides the ability to specify compute elements. The GPU device runs either in time-sliced mode or in MIG mode. There is no possibility of creating a heterogenous vGPU environment where MIG and time-sliced profiles share the same physical GPU device. You can deploy multiple GPU devices in one ESXi host and configure one GPU in time-sliced mode and one in MIG mode.

The number of C and Q-series vGPU types are GPU type dependent. For example, an A100 40GB allows ten time-sliced C-Series with a 4GB frame buffer per instance type. In comparison, the A100 80GB allows twenty instances of the same configuration. The A30 and A100 only have video encoders onboard, not video decoders. There is no Q-series vGPU Type available for the A100. A time-sliced vGPU type provides exclusive use of the configured frame buffer until the VM is destroyed. Interesting to note is that the frame buffer cannot be over-allocated. Thus a 40 GB GPU will only accept five VMs with an 8GB frame buffer. Attempting to power on the sixth VM with an 8GB frame buffer fails. Even if all the VMs are idle.

The GPU best effort scheduler coordinates access to the GPU device, allowing active workloads to utilize all the compute architecture, such as decoders (NVDEC), encoders (NVENC), and copy engines (CE) on the GPU device. If multiple VM access the GPU, the scheduler schedules these workloads serially. A time slice determines the time window a vGPU can generate workload on the GPU before it is preempted and access is granted to another VM. It is based on the maximum number of vGPUs allowed for the vGPU type on that physical GPU. This is based on the total GPU memory and the assigned frame buffer per vGPU type. If the maximum number of vGPUs on that device is less than or equal to eight, then the time slice is 2 ms. The time-slice window is reduced to 1 ms if it’s more than eight. The scheduler round-robins the active workload. Thus if only one workload is active, the scheduler constantly assigns a time slice. The moment another workload activates, the scheduler adjusts. Most ML applications appreciate a more prolonged time slice as they require maximum throughput. NVIDIA allows for policy and time-slice adjustments. The following articles in this blog series cover the elements in the diagram (GPU Processing Clusters, Streaming Multiprocessors, MMIO space, BAR, etc.).

Time-shared Fractional GPU use case – The Build Process

If we return to the ML model development cycle, a time-sliced vGPU during the build process might be an excellent fit for most teams. They study the idea’s feasibility by running some tests using small data sets. As a result, the team will run and test some code, with lots of idle time in between. The typical run time is seconds to minutes for these code tests.

In many cases, the CPU provides enough power to run these tests. Still, if the data science team wants to research the effect and behavior of the combination of the ML model and the GPU architecture, a vGPU be beneficial.

When looking at the situation from a platform operator perspective, this moment is where pooling and abstraction, two core tenets of VMware’s DNA, come into play. We can consolidate the efforts of different data science teams in a centralized environment and offer fractional GPUs. Sometimes a full GPU makes sense in these situations. But that is up to the discretion of the teams and organization. Fractional GPU provides tremendous benefits when used in the proper context.

Multi-Instance GPU vGPU

Multi-instance GPU functionality is also great for the build process. It can create up to seven separate GPU partitions called instances by isolating the frame buffer, GPU cores, compute engines, and decoders. Predictable and consistent performance is the outcome of this strict isolation, and therefore MIG vGPUs are typically deployed to accelerate inference production workloads.

Profile Name	Memory	SMs	Decoders	Copy Engines	Instances
MIG 1g.10gb	1/8	1/7	0 NVDECs/0 JPEG/0 OFA	1	7
MIG 1g.10gb+m	1/8	1/7	1 NVDEC/1 JPEG/1 OFA	1	1
MIG 1g.20gb	1/8	1/7	1 NVDECs/0 JPEG/0 OFA	1	4
MIG 2g.20gb	2/8	2/7	1 NVDECs/0 JPEG/0 OFA	2	3
MIG 3g.40gb	4/8	3/7	2 NVDECs/0 JPEG/0 OFA	3	2
MIG 4g.40gb	4/8	4/7	2 NVDECs/0 JPEG/0 OFA	4	1
MIG 7g.80gb	Full	Full	5 NVDECs/1 JPEG/1 OFA	7	1

MIG provides a composable configuration of GPU resources. Although the profiles are pre-configured and cannot be changed, users can isolate the correct elements for the job. An A100 80GB GPU device contains seven GPU processing clusters (GPCs). Each GPC contains 16 streaming Multiprocessors (SMs). An SM contains L0 and L1 cache and four tensor cores that perform the needed FP16/FP32 mixed-precision fused multiply-add (FMA) operations and acceleration for all the data types (FP16, BF16, TF32, FP64, INT8, INT4). An A100 GPU contains ten memory controllers that offer access to five HBM2 stacks, which will be logically grouped into eight GPU memory slices. There are seven NVDECs (Video decoders), 7 NVJPGs (Image decoders), and one Optical Flow Accelerator (OFA). In total, there are seven copy engines. They are responsible for transferring data in and out of the GPU.

For example, the MIG vGPU profile MIG4g.40gb constructs a GPU instance from four “GPU slices.” A GPU slice includes a “Sys Pipe,” a GPC, an L2 cache slice, and a GPU memory slice. A GPU memory slice includes the L2 cache slices and the associated frame buffer. These are dedicated memory resources. An application consuming a GPU instance does not consume an L2 slice from another GPU instance. This partitioning ensures fault isolation, error containment, recovery, and QoS.

The sys pipe communicates with the CPU and is responsible for GPC task scheduling. MIG creates a separate and isolated data path through the entire system, from the crossbar parts all the way to the memory controllers and its DRAM address buses. It’s expected that if more GPU memory is assigned to a GPU instance, more data is copied between the ESXi host system and GPU memory. Thus, dedicated copy engines are assigned to the GPU instance. Additionally, a dedicated number of decoders are assigned per GPU instance. Returning to the MIG Instance example, the MIG vGPU profile MIG4g.40gb isolates four of the eight available memory slices, four GPCs, four copy engines, and two decoders.

MIG provides a defined Quality of Service (QoS) and enhanced security due to isolation. Consistent performance throughout the vSphere cluster as consistent performance is maintained even if the VM is migrated to another ESXi host in the cluster. The vGPU is never impacted if another workload is saturating their GPU instance. In contrast, time-sliced provides a more dynamic environment, sometimes leading to better performance if the other GPU tenants are idling. However, there is no prioritization mechanism to indicate if a particular workload requires priority over others. Performance can be inconsistent. It depends on the activity of other tenants. Although MIG instances are hard-coded swimming lanes, and no other workload will dip in this pool of resources and internal pathways, the workload cannot go beyond its own swimming lane if the other MIG slices are idle. So there is no clear winner. Peak performance depends on the other GPU tenants, but if consistent performance is required, look no further than MIG technology. The Ampere and Hopper architecture provides MIG technology in specific data center GPUs. One of the following articles in the series depicts the availability of all the features in the supported vSphere and NVIDIA AI Enterprise (NVAIE) range.

MIG vGPU Fractional GPU use case – The Deployment Process

If we return to the ML model development cycle, the deployment phase requires consistent performance. Of course, there is no problem in assigning Full GPUs to the workload, but not every inference workload needs that many resources. MIG can offer the right amount of high-performing yet efficient technology. The training vs. inference series dove deep into both workload characteristics. For the typical inference workload, we notice a pattern of lightweight, latency-sensitive streaming data with lower computational needs than the training workload.

	Training	Inference
Data Flow	Batch data	Streaming data
Storage Characteristics	Throughput based	Latency-based, occasionally throughput
Batch Size	Many recommendations between 1-32 Smaller batch size reduces the memory footprint Smaller batch size increases algorithm performance (generalization) Larger batch size increases compute efficiency Larger batch size increases parallelization (Multi-gpu)	1-4
Data Access	Random Access on a large data set Multiple batches are prefetches to keep the pipeline full Fast storage medium recommended Fast storage and network recommended for distributed training	Streaming data
Memory Footprint	Large memory footprint Forward propagation pass – backpropagation pass – model parameters Long time duration of the memory footprint of activations (large bulk of memory footprint)	Smaller memory footprint Forward propagation pass – model parameters Activations are short-lived (Total memory footprint = est. 2 largest consecutive layers)
Numerical Precision	Higher Precision Required	Lower Precision Required
Data Type	FP32 BF16 Mixed Precision (FP16+FP32)	BF16 INT8 INT4 (Not seen Often)

Training Process

For training workloads, it’s typically relatively straightforward present as many GPU resources to the training job as possible. The table shows that training is throughput based, requiring a large memory footprint, which often exceeds the memory capacity of a single GPU.

Many data science teams explore distributed training methods to speed up training jobs to reduce training time duration. With today’s large models and large datasets, it’s common to see training jobs of 150+ hours (a whole week of continuous training). For these workloads, vSphere supports the latest and greatest technology available. VSphere 7 and 8 support assigning multiple physical GPUs to a single VM. NVIDIA technology provides high-speed interconnect technology to speed up inter-GPU communication during training jobs. Part 2 dives into the ML accelerator spectrum for distributed training – Multi-GPU technology.

vSphere ML Accelerator Spectrum Deep Dive Series

May 3, 2023 by frankdenneman

The number of machine learning workloads is increasing in on-prem data centers rapidly. It arrives in different ways, either within the application itself or data science teams build solutions that incorporate machine learning models to generate predictions or influence actions when needed. Another significant influx of ML workloads is the previously prototyped ML solutions in the cloud that are now moved into the on-prem environment, either for data gravity, governance, economics, or infrastructure (maintenance) control reasons. Techcrunch recently published an interesting article on this phenomenon.

But as an operator stuck between the data scientist, developer, and infra, you can be overwhelmed with the requirements that need to be met, the new software stack, and new terminology. You’ll soon realize that a machine-learning model does not run in a vacuum. It’s either integrated into an application or runs as a service. Training and running a model are just steps in applying machine learning to an organizational process. A software stack is required to develop the model, a software stack is required to train it, and a software stack is to integrate it into a service or application, and monitor its accuracy. Models aimed at human behavior tend to deteriorate over time. Our world changes, and the model need to adjust to that behavior. As a result, a continuous development cycle is introduced to retrain the model regularly.

It’s essential to understand the data science teams’ world to be successful as an operator. Building the hardware and software technology stack, together with a data science team, helps you to get early traction with other data science teams in the organization. As machine learning can be a shadow IT monster, it is vital to discover the needs of the data science teams. Build the infrastructure from the ground up, starting with the proper hardware ready to satisfy the requirements for training and inference jobs, and provide the right self-serving platform that allows data science teams to curate their own toolset that helps them achieve their goals.

To create the proper fundament, you need to understand the workload. However, most machine-learning content is geared toward data scientists. These articles primarily focus on solving an algorithmic challenge while using domain-specific terminology. I’ve written several articles about the training and inference workloads to overcome this gap.

Part 1: focuses on the ML Model development lifecycle

Part 2: Gives a brief overview of the pipeline structure

Part 3: Zooms into Training versus Inference Data Flow and Access Patterns

Part 4: Provides a deep dive into memory consumption by Neural Networks

Part 5: Provides a deep dive into Numerical Precision

Part 6: Explores network compression technology in detail, such as pruning and sparsity.

Parts 3 to 6 offer detailed insights into the technical requirements of the neural networks during training jobs and the inference process. It helps to interpret GPU functionality and gauge the expected load of the platform.

To successfully accelerate the workload, I want to dive deeper into the available vSphere and Tanzu options in the upcoming series. It focuses on the available spectrum of machine learning accelerators the NVIDIA AI Enterprise suite offers. What hardware capabilities are available, and how do you configure the platform? Although this series focuses on GPUs, I want to note that CPUs are an excellent resource for light training and inference. And with the latest release of the Intel Sapphire Rapids CPU with its Advanced Matrix Extensions (AMX), the future of CPUs in the ML ecosystem looks bright. But I’ll save that topic for another blog post (series).

Articles in this series:

My Picks for NVIDIA GTC Spring 2023

March 21, 2023 by frankdenneman

This week GTC Spring 2023 kicks off again. These are the sessions I look forward to next week. Please leave a comment if you want to share a must-see session.

MLOps

Title: Enterprise MLOps 101 [S51616]

The boom in AI has seen a rising demand for better AI infrastructure — both in the compute hardware layer and AI framework optimizations that make optimal use of accelerated compute. Unfortunately, organizations often overlook the critical importance of a middle tier: infrastructure software that standardizes the machine learning (ML) life cycle, adding a common platform for teams of data scientists and researchers to standardize their approach and eliminate distracting DevOps work. This process of building the ML life cycle is known as MLOps, with end-to-end platforms being built to automate and standardize repeatable manual processes. Although dozens of MLOps solutions exist, adopting them can be confusing and cumbersome. What should you consider when employing MLOps? How can you build a robust MLOps practice? Join us as we dive into this emerging, exciting, and critically important space.

Michael Balint, Senior Manager, Product Architecture, NVIDIA

William Benton, Principal Product Architect, NVIDIA

Title: Solving MLOps: A First-Principles Approach to Machine Learning Production [S51116]

We love talking about deploying our machine learning models. One famous (but probably wrong) statement says that “87% of data science projects never make it to production.” But how can we get to the promised land of “Production” if we’re not even sure what “Production” even means? If we could define it, we could more easily build a framework to choose the tools and methods to support our journey. Learn a first-principles approach to thinking about deploying models to production and MLOps. I’ll present a mental framework to guide you through the process of solving the MLOps challenges and selecting the tools associated with machine learning deployments.

Dean Lewis Pleban, Co-Founder and CEO, DagsHub

Title: Deploying Hugging Face Models to Production at Scale with GPUs [S51553]

Seems like everyone’s using Hugging Face to simplify and reuse advanced models and work collectively as a community. But how do you deploy these models into real business environments, along with the required data and application logic? How do you serve them continuously, efficiently, and at scale? How do you manage their life cycle in production (deploy, monitor, retrain)? How do you leverage GPUs efficiently for your Hugging Face deep learning models? We’ll share MLOps orchestration best practices that’ll enable you to automate the continuous integration and deployment of your Hugging Face models, along with the application logic in production. Learn how to manage and monitor the application pipelines, at scale. We’ll show how to enable GPU sharing to maximize application performance while protecting your investment in AI infrastructure and share how to make the whole process efficient, effective, and collaborative.

Yaron Haviv, Co-Founder and CTO, Iguazio

Title: Democratizing ML Inference for the Metaverse [S51948]

In this talk, I will drive you through the Roblox ML Platform inference service. You will learn how we integrate Triton inference service with Kubeflow and Kserve. I will describe how we simplify the deployment for our end users to serve models on both CPU and GPUs. Finally, I will highlight few of our current cases like game recommendation and other computer vision models.

Denis Goupil, Principal ML Engineer, Roblox

Data Center / Cloud

Title: Using NVIDIA GPUs in Financial Applications: Not Just for Machine Learning Applications [S52211]

Deploying GPUs to accelerate applications in the financial service industry has been widely accepted and the trend is growing rapidly, driven in large part by the increasing uptake of machine learning techniques. However, banks have been using NVIDIA GPUs for traditional risk calculations for much longer, and these workloads present some challenges due to their multi-tenancy requirements. We’ll explore the use of multiple GPUs on virtualized servers leveraging NVIDIA AI Enterprise to accelerate an application that uses Monte Carlo techniques for risk/pricing application in a large international bank. We’ll explore various combinations of the virtualized application on VMware to show how NVIDIA AI Enterprise software runs this application faster. We’ll also discuss process scheduling on the GPUs and explain interesting performance comparisons using different VM configs. We’ll also detail best practices for application deployments.

Manvender Rawat, Senior Manager, Product Management, NVIDIA

Justin Murray, Technical Marketing Architect, VMware

Richard Hayden, Executive Director and Head of the QR Analytics Team, JP Morgan Chase

Title: AI in the Clouds: Navigating the Hybrid Sky with Ease (Presented by Run:ai) [S52352]

We’ll focus on the different use cases of running AI workloads in hybrid cloud and multi-cloud environments, and the challenges that come along with that. NVIDIA’s Michael Balint Run:ai’s and Gijsbert Janssen van Doorn will discuss how organizations can successfully implement a hybrid cloud strategy for their AI workloads. Examples of use cases include leveraging the power of on-premises resources for sensitive data while utilizing the scalability of the cloud for compute-intensive tasks. We’ll also discuss potential challenges, such as data security and compliance, and how to navigate them. You’ll gain a deeper understanding of the various use cases of hybrid cloud for AI workloads, the challenges that may arise, and how to effectively implement them in your organization.

Michael Balint, Senior Manager, Product Architecture, NVIDIA

Gijsbert Janssen van Doorn, Director Technical Product Marketing, Run:ai

Title: vSphere on DPUs Behind the Scenes: A Technical Deep Dive (Presented by VMware Inc.) [S52382]

We’ll explore how vSphere on DPUs offloads traffic to the data processing unit (DPU), allowing for additional workload resources, zero-trust security, and enhanced performance. But what goes on behind the scenes that makes vSphere on DPUs so good at enhancing performance? Is it just adding a DPU? Join this session to find the answer and more technical nuggets to help you see the power of DPUs with vSphere on DPUs.

Dave Morera, Senior Technical Marketing Architect, VMware

Meghana Badrinath, Technical Product Manager, VMware

Title: Developer Breakout: What’s New in NVAIE 3.0 and vSphere 8 [SE52148]

NVIDIA and VMware have collaborated to unlock the power of AI for all enterprises by delivering an end-to-end enterprise platform optimized for AI workloads. This integrated platform delivers NVIDIA AI Enterprise, the best-in-class, end-to-end, secure, cloud-native suite of AI software running on VMware vSphere. With the recent launches of vSphere 8 and NVIDIA AI Enterprise 3.0, this platform’s ability to deliver AI solutions is greatly expanded. Let’s look at some of these state-of-the-art capabilities.

Jia Dai, Senior MLOps Solution Architect, NVIDIA

Veer Mehta, Solutions Architect, NVIDIA

Dan Skwara, Senior Solutions Architect, NVIDIA

Autonomous Vehicles

Title: From Tortoise to Hare: How AI Can Turn Any Driver into a Race Car Driver [S51328]

Performance driving on a racetrack is exciting, but it’s not widely accessible as it requires advanced driving skills honed over many years. Rimac’s Driver Coach enables any driver to learn from the onboard AI system, and enjoy performance driving on racetracks using full autonomous driving at very high speeds (over 350km/h). We’ll discuss how AI can be used to accelerate driver education and safely provide racing experiences at incredibly high speeds. We’ll dive deep into the overall development pipeline, from collecting data to training models to simulation testing using NVIDIA DRIVE Sim, and finally, implementing software on the NVIDIA DRIVE platform. Discover how AI technology can beat human professional race drivers.

Sacha Vrazic, Director – Autonomous Driving R&D, Rimac Technology

Deep Learning

Title: Scaling Deep Learning Training: Fast Inter-GPU Communication with NCCL [S51111]

Learn why fast inter-GPU communication is critical to accelerate deep learning training, and how to make sure your system has the right level of performance for your model. Discover NCCL, the inter-GPU communication library used by all deep learning frameworks for inter-GPU communication, and how it combines NVLink with high-speed networks like Infiniband to accelerate communication by an order of magnitude, allowing training to be run on hundreds, or even thousands, of GPUs. See how new technologies in Hopper GPUs and ConnectX-7 allow for NCCL performance to reach new highs on the latest generation of DGX and HGX systems. Finally, get updates on the latest improvements in NCCL, and what should come in the near future.

Sylvain Jeaugey, Principal Engineer, NVIDIA

Title: FP8 Mixed-Precision Training with Hugging Face Accelerate [S51370]

Accelerate is a library that allows you to run your raw PyTorch training loop on any kind of distributed setup with multiple speedup techniques. One of these techniques is mixed precision training, which can speed up training by a factor between 2 and 4. Accelerate recently integrated Nvidia Transformers FP8 mixed-precision training which can be even faster. In this session, we’ll dive into what mixed precision training exactly is, how to implement it in various floating point precisions and how Accelerate provides a unified API to use all of them.

Sylvain Gugger, Senior ML Open Source Engineer, Hugging Face

HPC

Title: Accelerating MPI and DNN Training Applications with BlueField DPUs [S51745]

Learn how NVIDIA Bluefield DPUs can accelerate the performance of HPC applications using message passing interface (MPI) libraries and deep neural network (DNN) training applications. Under the first direction, we highlight the features and performance of the MVAPICH2-DPU library in offloading non-blocking collective communication operations to the DPUs. Under the second direction, we demonstrate how some parts of computation in DNN training can be offloaded to the DPUs. We’ll present sample performance numbers of these designs on various computing platforms (x86 and AMD) and Bluefield adapters (HDR-100Gbps and HDR-200 Gbps), along with some initial results using the newly proposed cross-GVMI support with DPU.

Dhabaleswar K. (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University

Title: Tuning Machine Learning and HPC Workloads Performance in Virtualized Environments using GPUs [S51670]

Today’s machine learning (ML) and HPC applications run in containers. VMware vSphere runs containers in virtual machines (VMs) with VMware Tanzu for container orchestration and Kubernetes cluster management. This allows servers in the hybrid cloud to simultaneously host multi-tenant workloads like ML inference, virtual desktop infrastructure/graphics, and telco workloads that benefit from NVIDIA AI and VMware virtualization technologies. NVIDIA AI Enterprise software in VMware vSphere combines the outstanding virtualization benefits of vSphere with near-bare metal, or in HPC applications, better than bare-metal performance. NVIDIA AI Enterprise on vSphere supports NVLink and NVSwitch, which allows ML training, and HPC applications to maximize multi-GPU performance. We’ll describe these technologies in detail, and you’ll learn how to leverage and tune performance to achieve significant savings in total cost of ownership for your preferred cloud environment. We’ll highlight the performance of the latest NVIDIA GPUs in virtual environments.

Uday Kurkure, Staff Engineer, VMware

Lan Vu, Senior Member of the Technical Staff, VMware

Manvender Rawat, Senior Manager, Product Management, NVIDIA

ML Session at CTEX VMware Explore

November 4, 2022 by frankdenneman

Next week during VMware Explore, VMware is also organizing the Customer Technical Exchange. I’m presenting the session “vSphere Infrastructure for Machine Learning workloads”. I will discuss how vSphere act as a self-service platform for data science teams to easily and quickly deploy ML platforms with acceleration resources. I

CTEX is happening at the Fira Barcelona Gran Via in room CC4 4.2. This is an NDA event. Therefore, you will need to register vi

Next week during VMware Explore, the VMware Office of the CTO is organizing the Customer Technical Exchange. I’m presenting the session “vSphere Infrastructure for Machine Learning workloads”. I will discuss how vSphere act as a self-service platform for data science teams to easily and quickly deploy ML platforms with acceleration resources. I

CTEX is happening on the 8th and 9th of November at the Fira Barcelona Gran Via in room CC4 4.2. This is an NDA event. Therefore, you will need to register via

https://via.vmw.com/CTEXExploreEurope2022-Register.

vSphere 8 CPU Topology Device Assignment

October 25, 2022 by frankdenneman

There seems to be some misunderstanding about the new vSphere 8 CPU Topology Device Assignment feature, and I hope this article will help you understand (when to use) this feature. This feature defines the mapping of the virtual PCIe device to the vNUMA topology. The main purpose is to optimize guest OS and application optimization. This setting does not impact NUMA affinity and scheduling of vCPU and memory locality at the physical resource layer. This is based on the VM placement policy (best effort). Let’s explore the settings and their effect on the virtual machine. Let’s go over the basics first. The feature is located in the VM Options menu of the virtual machine.

Click on CPU Topology

By default, the Cores per Socket and NUMA Nodes settings are “assigned at power on” and this prevents you from assigning any PCI device to any NUMA node. To be able to assign PCI devices to NUMA nodes, you need to change the Cores per Socket setting. Immediately, a warning indicates that you need to know what you are doing, as incorrectly configuring the Cores per Socket can lead to performance degradation. Typically we recommend aligning the cores per socket to the physical layout of the server. In my case, my ESXi host system is a dual-socket server, and each CPU package contains 20 cores. By default, the NUMA scheduler maps vCPU to cores for NUMA client sizing; thus, this VM configuration cannot fit inside a single physical NUMA node. The NUMA scheduler will distribute the vCPUs across two NUMA clients equally; thus, 12 vCPUs will be placed per NUMA node (socket). As a result, the Cores per Socket configuration should be 12 Cores per Socket, which will inform ESXi to create two virtual sockets for that particular VM. For completeness’ sake, I specified two NUMA nodes as well. This setting is a PER-VM setting, it is not NUMA Nodes per Socket. You can easily leave this to the default setting, as ESXi will create a vNUMA topology based on the Cores per Socket settings. Unless you want to create some funky topology that your application absolutely requires. My recommendation, keep this one set to default as much as possible unless your application developer begs you otherwise.

This allows us to configure the PCIe devices. As you might have noticed, I’ve added a PCIe device. This device is an NVIDIA A30 GPU in Dynamic Direct Path I/O (Passthrough) mode. But before we dive into the details of this device, let’s look at the virtual machine’s configuration from within the guest OS. I’ve installed Ubuntu 22.04 LTS and used the command lstopo. (install using: sudo apt install hwloc)

You see the two NUMA nodes, with each twelve vCPUs (Cores) and a separate PCI structure. This is the way a virtual motherboard is structured. Compare this to a physical machine, and you notice that each PCI device is attached to the PCI controller that is located within the NUMA node.

And that is exactly what we can do with the device assignment feature in vSphere 8. We can provide more insights to the guest OS and the applications if they need this information. Typically, this optimization is not necessary, but for some specific network load-balancing algorithms or machine learning use cases, you want the application to understand the NUMA PCI locality of the PCIe devices.
In the case of the A30, we need to understand its PCIe-NUMA locality. The easiest way to do this is to log on to the ESXi server through an SSH session and search for the device via the esxcli hardware pci list command. As I’m searching for an NVIDIA device, I can restrict the output of this command by using the following command “esxcli hardware pci list | grep “NVIDIA -A 32 -B 6”. This instructs the grep command to output 32 lines (A)after and 6 lines (B)before the NVIDIA line. The output shows us that the A30 card is managed by the PCI controller located in NUMA node 1 (Third line from the bottom).

We can now adjust the device assignment accordingly and assign it to NUMA node 1. Please note the feature allows you to also assign it to NUMA node 0. You are on your own here. You can do silly things. But just because you can, doesn’t mean you should. Please understand that most PCIe slots on a server motherboard are directly connected to the CPU socket, and thus a direct physical connection exists between the NIC or the GPU and the CPU. You cannot logically change this within the ESXi schedulers. The only thing you can do is to map the virtual world as close to the physical world as possible to keep everything as clear and transparent as possible. I mapped PCI device 0 (the A30) to NUMA node 1.

Running lstopo in the virtual machine provided me this result: