AI & ML Archives - Page 4 of 8

vSphere 8.0 Update 1 Enhancements for Accelerating Machine Learning Workloads

April 26, 2023 by frankdenneman

Recently vSphere 8 Update 1 was released, introducing excellent enhancements, ranging from VM-level power consumption metrics to Okta Identity Federation for vCenter. In this article, I want to investigate the enhancements to accelerate Machine Learning workloads. If you want to listen to all the goodness provided by update 1, I recommend listening to episode 40 of the Unexplored Territory Podcast with Féidhlim O’Leary (Spotify | Apple).

Machine learning is rapidly becoming an essential tool for organizations and businesses worldwide. The desire for accurate models is overwhelming; in many cases, the value of a model comes from accuracy. The machine learning community strives to build more intelligent algorithms, but we still live in a world where processing more training data generates a more accurate model. A prime example is the large language models (LLM) such as ChatGPT. The more data you add, the more accurate they get.

Source: ChatGPT Statistics (2023) — The Key Facts and Figures

To train ChatGPT, they used textual data from 5 sources. 60% of the dataset was based on a filtered version of data from 8 years of web crawling. I was surprised that 22% of that dataset came from Reddit posts with three or more upvotes (WebText2). But I digress. Large datasets need computation power, and our customers are increasing their machine learning accelerator footprint in their data centers. vSphere 8 update 1 caters to that need. vSphere 8 Update 1 provides the following enhancements focusing on Machine Learning workloads.

Increase of PCI Passthrough devices per VM
Support for NVIDIA NVSwitch
vGPU VMotion Improvements
Heterogeneous GPU Profile Support

The spectrum of ML Accelerators in vSphere 8 Update 1

Update 1 again increases the maximum number of PCI passthrough devices for a VM. In 7.0 with hardware version 19, 16 passthrough devices are supported. In 8.0, with hardware version 20, a VM can contain up to 32 passthrough devices. With 8.0 update 1, hardware version 20, vSphere supports up to 64 PCIe passthrough devices per VM.

vSphere 8 Update 1 extends the spectrum of ML accelerator by supporting NVIDIA NVSwitch Architecture. NVIDIA NVSwitch is a technology that bolts onto the system’s motherboard and connects four to sixteen SXM form factor GPUs. Such systems are known as NVIDIA HGX systems. The Dell PowerEdge XE8545 (AMD) (4 x A100), XE9680 (8 x A100\H100) (Intel), and HPE Apollo 6500 Gen10 Plus (AMD) are such systems. The HGX lineup consists of two platforms, the “Redstone” platform, which contains 4 x SXM4 A100 GPUs, and the “Delta” platform, which contains 8 x SXM4 A100 SXMe GPUs. With the introduction of the NVIDIA Hopper architecture, the HGX platforms are now called Redstone-Next and Delta-Next, containing SXM5 H100 GPUs. There is the possibility of connecting two baseboards of a Delta (-Next) platform via the NVSwitch together in a single server, providing the ability to connect sixteen A100/H100 GPUs directly, but I haven’t seen a server SKU of the major server vendors offering that configuration. If we open up an HGX machine, the first thing that sticks out is SXM from factor GPU. It moves away from the PCIe physical interface. The SXM socket handles power delivery, eliminating the need for external power cables, but more importantly, it results in a better (horizontal) mounting position, allowing for better cooling options. As the GPUs are better cooled, the H100 SXM5 can run more cores (132 streaming multi-processors (SMs)) vs. H100 PCIe (113 SMs).

What is the benefit of SXM, NVLINK, and NVSwitch?

Training machine Learning models require a lot of data, which the system has to move between components such as CPUs and GPUs and between GPUs and GPUs. Distributed training uses multiple GPUs to provide enough onboard GPU memory capacity to either process and execute the model parameters or to process the data set. If we dissect the data flow, this process has three major steps.

Load the data from system memory on the GPUs
Run the process (distributed training), which can initiate communication between GPUs
Retrieve results from GPU to system memory.
Rinse and repeat

Internal data buses move data between components, significantly affecting the system’s overall throughput. The most common expansion bus standard is PCI Express (PCIe). Its latest iteration (PCIe5) offers a theoretical bandwidth of 64 GB/s. That is fast, but nothing compared to the onboard GPU RAM speed of an A100 (600 GB/s) or an H100 (900 GB/s). To benefit the most from that memory speed is to build a non-blocking interconnect between the GPUs. If you go one level deeper, by creating a proprietary interconnect system, NVIDIA does not have to wait for the industry to develop and accept standards such as PCIe 6 or 7. It can develop and iterate much faster, attempting to match the interconnect speed to the high bandwidth memory speed of the onboard GPU RAM.

However, NVIDIA has to play well with others in the industry to connect the SXM socket to the CPU, and therefore the SXM4 (A100) connects to the CPU via a PCIe 4.0 x16 bus interface (source), and SXM5 (H100) connects to the CPU via a PCIe 5.0 x16 interface (source). That means that during a host-to-device memory copy, the data flows from the system memory across the PCIe controller to the SXM Socket with the matching PCIe bandwidth.

Suppose you are a regular ready of my content. In that case, you might expect me to start deep diving into PCIe NUMA locality and the challenges of having multiple GPUs connected in a dual-socket system. However, our engineers and NVIDIA engineers helped the NVIDIA library be aware of the home NUMA configuration. It uses CPU and PCIe information to guide the data traffic between the CPU and PCIe interface. When the data arrives at the onboard GPU memory, communication remains between GPUs. All communication flows across the NVLinks and NVswitch fabrics, essentially keeping GPU-related traffic of the CPU interconnect (AMD Infinity fabric, Intel UPI ~40 GB/s theoretical bandwidth).

Please note, on the left side of the diagram, the NVLinks are greyed out of three GPUs to provide a better view of the NVLink connection of an individual GPU in an A100 HGX system.

GPU device-to-device communications occur across NVLinks and NVSwitches. An A100 GPU includes 12 3rd-generation NVLinks to provide up to 600 GB/sec bandwidth. The H100 increases the NVlinks to 18, providing 900 GB/sec, seven times the bandwidth of PCIe 5. With the help of vSphere device groups, the vi-admin can configure the virtual machines with various vGPU configurations. They can be assigned in groups of 2, 4, and 8. Suppose a device group selects a subset of GPU devices of the HGX system. In that case, vSphere isolates these GPUs and disables the NVlink connections to the other GPUs, offering complete isolation between the device groups.

At this moment, the UI displays quite a cryptic name. If we look at the image, we see Nvidia:2@grid_a100x-40c%NVLink. This name means that this is a group of two A100S with a 40C type profile (the entire card) connected via NVLink. Although the system contains eight GPUs, that doesn’t mean that vSphere only allows assigning multiple GPUs to virtual machines and TKGS worker nodes. Fractional GPU technologies, such as time-sliced or Multi-Instance GPU (MIG), are available. A later article provides a deep dive into NVIDIA Switch functionality. The beauty of this solution is that it uses vGPU technology, and thus we can live-migrate workload between different ESXi hosts if necessary. With each vSphere update, we introduce new enhancements to vGPU vMotion. vSphere 8 Update 1 offers two improvements to improve the utilization of high bandwidth vMotion networks.

vGPU vMotion Improvements

This new update introduces improvements to the internals of the vMotion process. Update 1 does not present any new buttons or functionalities to the user, but the vMotion internals are more aligned now with the high data load and high-speed transports.

A vGPU vMotion is a lot more complex than a regular vMotion, which by itself is still a magical thing in itself. With vGPU workloads, we have to deal with memory-mapped I/O and the situation that 100s of GPU stream processors access vGPU memory regions and can completely change multiple times within a second. An article about MMIO and GPUs will be published soon.

To cope with this behavior, we stun the VM so we can drain the memory as quickly as possible. The vMotion team significantly improved by moving checkpoint data to a more efficient vMotion data channel that can leverage multiple threads and sockets. In the previous configuration, the channel for transferring checkpoint data was fixed at two connections, while the new setup can consume as many TCP connections as the network infrastructure permits.

Additional optimizations are made to the communication process between the source and destination host to reduce “CPU Driven copies .”A more innovative method of sharing memory is applied, reducing the processes involved in getting the data over from the source host to the destination. With the help of vMotions stream multi-threaded architecture, vGPU vMotion can now saturate high-speed networks up to 80 Gbps.

Heterogeneous GPU Profile Support

Not necessarily a Machine Learning Workload enhancement, but it allows for a different method of GPU resource consumption, so there is some relationship worth mentioning. Before vSphere 8 Update 1, the first active vGPU workload determines the vGPU profile compatibility of the GPU device. For example, if a VM starts with a C-type vGPU profile with 12G on an NVIDIA A40, the GPU will not accept any other virtual machine with a 12A or 12Q profile. Although each of these profiles consumes the same amount of onboard GPU memory (frame buffer), the GPU rejects these virtual machines. With update 1, this is no longer the case. The GPU accepts different GPU types as long as they have identical frame buffer size configurations. And this makes one of the compelling use cases, “VDI by day, Compute by Night,” even more attainable. This flexibility does offer the ability to mix and match Q, C, and A workloads. The frame buffer size gap between B and the other profile types is too large to expect these profiles to run together on the same physical GPU. The largest B profile contains a 2 GB frame buffer.

vGPU Profile Type	Optimal Workload
Q-Type	Virtual workstations for creative and technical professionals who require the performance and features of Quadro technology
C-Type	Compute-intensive server workloads, such as artificial intelligence (AI), deep learning, or high-performance computing (HPC)
B-Type	Virtual desktops for business professionals and knowledge workers
A-Type	App streaming or session-based solutions for virtual applications users

Source: Virtual GPU Software Documentation

vSphere 8 introduces a tremendous step forwards in accelerator resource scalability, from the ideation phase to big dataset training to securely isolating production streams of unseen data to tailored-sized GPUs. The spectrum of machine learning accelerators available in vSphere 8 update 1 allows organizations to cater to the needs of any data science team regardless of where they are within the life-cycle of their machine learning model development.

My Picks for NVIDIA GTC Spring 2023

March 21, 2023 by frankdenneman

This week GTC Spring 2023 kicks off again. These are the sessions I look forward to next week. Please leave a comment if you want to share a must-see session.

MLOps

Title: Enterprise MLOps 101 [S51616]

The boom in AI has seen a rising demand for better AI infrastructure — both in the compute hardware layer and AI framework optimizations that make optimal use of accelerated compute. Unfortunately, organizations often overlook the critical importance of a middle tier: infrastructure software that standardizes the machine learning (ML) life cycle, adding a common platform for teams of data scientists and researchers to standardize their approach and eliminate distracting DevOps work. This process of building the ML life cycle is known as MLOps, with end-to-end platforms being built to automate and standardize repeatable manual processes. Although dozens of MLOps solutions exist, adopting them can be confusing and cumbersome. What should you consider when employing MLOps? How can you build a robust MLOps practice? Join us as we dive into this emerging, exciting, and critically important space.

Michael Balint, Senior Manager, Product Architecture, NVIDIA

William Benton, Principal Product Architect, NVIDIA

Title: Solving MLOps: A First-Principles Approach to Machine Learning Production [S51116]

We love talking about deploying our machine learning models. One famous (but probably wrong) statement says that “87% of data science projects never make it to production.” But how can we get to the promised land of “Production” if we’re not even sure what “Production” even means? If we could define it, we could more easily build a framework to choose the tools and methods to support our journey. Learn a first-principles approach to thinking about deploying models to production and MLOps. I’ll present a mental framework to guide you through the process of solving the MLOps challenges and selecting the tools associated with machine learning deployments.

Dean Lewis Pleban, Co-Founder and CEO, DagsHub

Title: Deploying Hugging Face Models to Production at Scale with GPUs [S51553]

Seems like everyone’s using Hugging Face to simplify and reuse advanced models and work collectively as a community. But how do you deploy these models into real business environments, along with the required data and application logic? How do you serve them continuously, efficiently, and at scale? How do you manage their life cycle in production (deploy, monitor, retrain)? How do you leverage GPUs efficiently for your Hugging Face deep learning models? We’ll share MLOps orchestration best practices that’ll enable you to automate the continuous integration and deployment of your Hugging Face models, along with the application logic in production. Learn how to manage and monitor the application pipelines, at scale. We’ll show how to enable GPU sharing to maximize application performance while protecting your investment in AI infrastructure and share how to make the whole process efficient, effective, and collaborative.

Yaron Haviv, Co-Founder and CTO, Iguazio

Title: Democratizing ML Inference for the Metaverse [S51948]

In this talk, I will drive you through the Roblox ML Platform inference service. You will learn how we integrate Triton inference service with Kubeflow and Kserve. I will describe how we simplify the deployment for our end users to serve models on both CPU and GPUs. Finally, I will highlight few of our current cases like game recommendation and other computer vision models.

Denis Goupil, Principal ML Engineer, Roblox

Data Center / Cloud

Title: Using NVIDIA GPUs in Financial Applications: Not Just for Machine Learning Applications [S52211]

Deploying GPUs to accelerate applications in the financial service industry has been widely accepted and the trend is growing rapidly, driven in large part by the increasing uptake of machine learning techniques. However, banks have been using NVIDIA GPUs for traditional risk calculations for much longer, and these workloads present some challenges due to their multi-tenancy requirements. We’ll explore the use of multiple GPUs on virtualized servers leveraging NVIDIA AI Enterprise to accelerate an application that uses Monte Carlo techniques for risk/pricing application in a large international bank. We’ll explore various combinations of the virtualized application on VMware to show how NVIDIA AI Enterprise software runs this application faster. We’ll also discuss process scheduling on the GPUs and explain interesting performance comparisons using different VM configs. We’ll also detail best practices for application deployments.

Manvender Rawat, Senior Manager, Product Management, NVIDIA

Justin Murray, Technical Marketing Architect, VMware

Richard Hayden, Executive Director and Head of the QR Analytics Team, JP Morgan Chase

Title: AI in the Clouds: Navigating the Hybrid Sky with Ease (Presented by Run:ai) [S52352]

We’ll focus on the different use cases of running AI workloads in hybrid cloud and multi-cloud environments, and the challenges that come along with that. NVIDIA’s Michael Balint Run:ai’s and Gijsbert Janssen van Doorn will discuss how organizations can successfully implement a hybrid cloud strategy for their AI workloads. Examples of use cases include leveraging the power of on-premises resources for sensitive data while utilizing the scalability of the cloud for compute-intensive tasks. We’ll also discuss potential challenges, such as data security and compliance, and how to navigate them. You’ll gain a deeper understanding of the various use cases of hybrid cloud for AI workloads, the challenges that may arise, and how to effectively implement them in your organization.

Michael Balint, Senior Manager, Product Architecture, NVIDIA

Gijsbert Janssen van Doorn, Director Technical Product Marketing, Run:ai

Title: vSphere on DPUs Behind the Scenes: A Technical Deep Dive (Presented by VMware Inc.) [S52382]

We’ll explore how vSphere on DPUs offloads traffic to the data processing unit (DPU), allowing for additional workload resources, zero-trust security, and enhanced performance. But what goes on behind the scenes that makes vSphere on DPUs so good at enhancing performance? Is it just adding a DPU? Join this session to find the answer and more technical nuggets to help you see the power of DPUs with vSphere on DPUs.

Dave Morera, Senior Technical Marketing Architect, VMware

Meghana Badrinath, Technical Product Manager, VMware

Title: Developer Breakout: What’s New in NVAIE 3.0 and vSphere 8 [SE52148]

NVIDIA and VMware have collaborated to unlock the power of AI for all enterprises by delivering an end-to-end enterprise platform optimized for AI workloads. This integrated platform delivers NVIDIA AI Enterprise, the best-in-class, end-to-end, secure, cloud-native suite of AI software running on VMware vSphere. With the recent launches of vSphere 8 and NVIDIA AI Enterprise 3.0, this platform’s ability to deliver AI solutions is greatly expanded. Let’s look at some of these state-of-the-art capabilities.

Jia Dai, Senior MLOps Solution Architect, NVIDIA

Veer Mehta, Solutions Architect, NVIDIA

Dan Skwara, Senior Solutions Architect, NVIDIA

Autonomous Vehicles

Title: From Tortoise to Hare: How AI Can Turn Any Driver into a Race Car Driver [S51328]

Performance driving on a racetrack is exciting, but it’s not widely accessible as it requires advanced driving skills honed over many years. Rimac’s Driver Coach enables any driver to learn from the onboard AI system, and enjoy performance driving on racetracks using full autonomous driving at very high speeds (over 350km/h). We’ll discuss how AI can be used to accelerate driver education and safely provide racing experiences at incredibly high speeds. We’ll dive deep into the overall development pipeline, from collecting data to training models to simulation testing using NVIDIA DRIVE Sim, and finally, implementing software on the NVIDIA DRIVE platform. Discover how AI technology can beat human professional race drivers.

Sacha Vrazic, Director – Autonomous Driving R&D, Rimac Technology

Deep Learning

Title: Scaling Deep Learning Training: Fast Inter-GPU Communication with NCCL [S51111]

Learn why fast inter-GPU communication is critical to accelerate deep learning training, and how to make sure your system has the right level of performance for your model. Discover NCCL, the inter-GPU communication library used by all deep learning frameworks for inter-GPU communication, and how it combines NVLink with high-speed networks like Infiniband to accelerate communication by an order of magnitude, allowing training to be run on hundreds, or even thousands, of GPUs. See how new technologies in Hopper GPUs and ConnectX-7 allow for NCCL performance to reach new highs on the latest generation of DGX and HGX systems. Finally, get updates on the latest improvements in NCCL, and what should come in the near future.

Sylvain Jeaugey, Principal Engineer, NVIDIA

Title: FP8 Mixed-Precision Training with Hugging Face Accelerate [S51370]

Accelerate is a library that allows you to run your raw PyTorch training loop on any kind of distributed setup with multiple speedup techniques. One of these techniques is mixed precision training, which can speed up training by a factor between 2 and 4. Accelerate recently integrated Nvidia Transformers FP8 mixed-precision training which can be even faster. In this session, we’ll dive into what mixed precision training exactly is, how to implement it in various floating point precisions and how Accelerate provides a unified API to use all of them.

Sylvain Gugger, Senior ML Open Source Engineer, Hugging Face

HPC

Title: Accelerating MPI and DNN Training Applications with BlueField DPUs [S51745]

Learn how NVIDIA Bluefield DPUs can accelerate the performance of HPC applications using message passing interface (MPI) libraries and deep neural network (DNN) training applications. Under the first direction, we highlight the features and performance of the MVAPICH2-DPU library in offloading non-blocking collective communication operations to the DPUs. Under the second direction, we demonstrate how some parts of computation in DNN training can be offloaded to the DPUs. We’ll present sample performance numbers of these designs on various computing platforms (x86 and AMD) and Bluefield adapters (HDR-100Gbps and HDR-200 Gbps), along with some initial results using the newly proposed cross-GVMI support with DPU.

Dhabaleswar K. (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University

Title: Tuning Machine Learning and HPC Workloads Performance in Virtualized Environments using GPUs [S51670]

Today’s machine learning (ML) and HPC applications run in containers. VMware vSphere runs containers in virtual machines (VMs) with VMware Tanzu for container orchestration and Kubernetes cluster management. This allows servers in the hybrid cloud to simultaneously host multi-tenant workloads like ML inference, virtual desktop infrastructure/graphics, and telco workloads that benefit from NVIDIA AI and VMware virtualization technologies. NVIDIA AI Enterprise software in VMware vSphere combines the outstanding virtualization benefits of vSphere with near-bare metal, or in HPC applications, better than bare-metal performance. NVIDIA AI Enterprise on vSphere supports NVLink and NVSwitch, which allows ML training, and HPC applications to maximize multi-GPU performance. We’ll describe these technologies in detail, and you’ll learn how to leverage and tune performance to achieve significant savings in total cost of ownership for your preferred cloud environment. We’ll highlight the performance of the latest NVIDIA GPUs in virtual environments.

Uday Kurkure, Staff Engineer, VMware

Lan Vu, Senior Member of the Technical Staff, VMware

Manvender Rawat, Senior Manager, Product Management, NVIDIA

ML Session at CTEX VMware Explore

November 4, 2022 by frankdenneman

Next week during VMware Explore, VMware is also organizing the Customer Technical Exchange. I’m presenting the session “vSphere Infrastructure for Machine Learning workloads”. I will discuss how vSphere act as a self-service platform for data science teams to easily and quickly deploy ML platforms with acceleration resources. I

CTEX is happening at the Fira Barcelona Gran Via in room CC4 4.2. This is an NDA event. Therefore, you will need to register vi

Next week during VMware Explore, the VMware Office of the CTO is organizing the Customer Technical Exchange. I’m presenting the session “vSphere Infrastructure for Machine Learning workloads”. I will discuss how vSphere act as a self-service platform for data science teams to easily and quickly deploy ML platforms with acceleration resources. I

CTEX is happening on the 8th and 9th of November at the Fira Barcelona Gran Via in room CC4 4.2. This is an NDA event. Therefore, you will need to register via

https://via.vmw.com/CTEXExploreEurope2022-Register.

vSphere 8 CPU Topology Device Assignment

October 25, 2022 by frankdenneman

There seems to be some misunderstanding about the new vSphere 8 CPU Topology Device Assignment feature, and I hope this article will help you understand (when to use) this feature. This feature defines the mapping of the virtual PCIe device to the vNUMA topology. The main purpose is to optimize guest OS and application optimization. This setting does not impact NUMA affinity and scheduling of vCPU and memory locality at the physical resource layer. This is based on the VM placement policy (best effort). Let’s explore the settings and their effect on the virtual machine. Let’s go over the basics first. The feature is located in the VM Options menu of the virtual machine.

Click on CPU Topology

By default, the Cores per Socket and NUMA Nodes settings are “assigned at power on” and this prevents you from assigning any PCI device to any NUMA node. To be able to assign PCI devices to NUMA nodes, you need to change the Cores per Socket setting. Immediately, a warning indicates that you need to know what you are doing, as incorrectly configuring the Cores per Socket can lead to performance degradation. Typically we recommend aligning the cores per socket to the physical layout of the server. In my case, my ESXi host system is a dual-socket server, and each CPU package contains 20 cores. By default, the NUMA scheduler maps vCPU to cores for NUMA client sizing; thus, this VM configuration cannot fit inside a single physical NUMA node. The NUMA scheduler will distribute the vCPUs across two NUMA clients equally; thus, 12 vCPUs will be placed per NUMA node (socket). As a result, the Cores per Socket configuration should be 12 Cores per Socket, which will inform ESXi to create two virtual sockets for that particular VM. For completeness’ sake, I specified two NUMA nodes as well. This setting is a PER-VM setting, it is not NUMA Nodes per Socket. You can easily leave this to the default setting, as ESXi will create a vNUMA topology based on the Cores per Socket settings. Unless you want to create some funky topology that your application absolutely requires. My recommendation, keep this one set to default as much as possible unless your application developer begs you otherwise.

This allows us to configure the PCIe devices. As you might have noticed, I’ve added a PCIe device. This device is an NVIDIA A30 GPU in Dynamic Direct Path I/O (Passthrough) mode. But before we dive into the details of this device, let’s look at the virtual machine’s configuration from within the guest OS. I’ve installed Ubuntu 22.04 LTS and used the command lstopo. (install using: sudo apt install hwloc)

You see the two NUMA nodes, with each twelve vCPUs (Cores) and a separate PCI structure. This is the way a virtual motherboard is structured. Compare this to a physical machine, and you notice that each PCI device is attached to the PCI controller that is located within the NUMA node.

And that is exactly what we can do with the device assignment feature in vSphere 8. We can provide more insights to the guest OS and the applications if they need this information. Typically, this optimization is not necessary, but for some specific network load-balancing algorithms or machine learning use cases, you want the application to understand the NUMA PCI locality of the PCIe devices.
In the case of the A30, we need to understand its PCIe-NUMA locality. The easiest way to do this is to log on to the ESXi server through an SSH session and search for the device via the esxcli hardware pci list command. As I’m searching for an NVIDIA device, I can restrict the output of this command by using the following command “esxcli hardware pci list | grep “NVIDIA -A 32 -B 6”. This instructs the grep command to output 32 lines (A)after and 6 lines (B)before the NVIDIA line. The output shows us that the A30 card is managed by the PCI controller located in NUMA node 1 (Third line from the bottom).

We can now adjust the device assignment accordingly and assign it to NUMA node 1. Please note the feature allows you to also assign it to NUMA node 0. You are on your own here. You can do silly things. But just because you can, doesn’t mean you should. Please understand that most PCIe slots on a server motherboard are directly connected to the CPU socket, and thus a direct physical connection exists between the NIC or the GPU and the CPU. You cannot logically change this within the ESXi schedulers. The only thing you can do is to map the virtual world as close to the physical world as possible to keep everything as clear and transparent as possible. I mapped PCI device 0 (the A30) to NUMA node 1.

Running lstopo in the virtual machine provided me this result:

Now the GPU is a part of NUMA node 1. How we can confirm that is true is by taking the PCI device at address 04:00:00 given in the small green box that is inside Package 1 and seeing that is the same address as that given in the “esxcli hardware pci list” for the GPU – that is seen at the line titled “Device Layer Bus Address” in that esxcli output. Because the virtual GPU device is now a part of NUMA node 1 the guest OS memory optimization can allocate memory within NUMA node 1 to store the dataset there so that it is as close to the device as possible. The NUMA scheduler and the CPU and memory scheduler within the ESXi layer attempt to follow these instructions to the best of their ability. If you want to be absolutely sure, you can assign NUMA affinity and CPU affinity at the lowest layers, but we recommend starting at this layer and testing this first before impacting the lowest scheduling algorithms.

Could not initialize plugin ‘libnvidia-vgx.so – Check SR-IOV in the BIOS

October 18, 2022 by frankdenneman

I was building a new lab with some NVIDIA A30 GPUs in a few hosts, and after installing the NVIDIA driver onto the ESXi host, I got the following error when powering up a VM with a vGPU profile:

Typically that means three things:

Shared Direct passthrough is not enabled on the GPU
ECC memory is enabled
VM Memory reservation was not set to protect its full memory range/

But shared direct passthrough was enabled, and because I was using a C-type profile and an NVIDIA A30 GPU, I did not have to disable ECC memory. According to the NVIDIA Virtual GPU software documentation: 3.4 Disabling and Enabling ECC Memory

Reserve all guest memory (all locked) was enabled, and this setting is recommended. If someone changes the memory setting of the VM at a later stage, the memory reservation is automatically updated, and no errors will emerge.

I discovered that my systems did not have SR-IOV enabled in the BIOS. By enabling “SR-IOV Global Enable” I could finally boot a VM

SR-IOV is also required if you want to use vGPU Multi-Instance GPU, so please check for this setting when setting up your ESXi hosts.

But for completeness’ sake, let’s go over shared direct passthrough and GPU ECC memory configurations and see how to check both settings:

Shared Direct Passthrough

Step 1: Select the ESXi host with the GPU in the inventory view in vCenter

Step 2: Select Configure in the menu shown on the right side of the screen

Step 3: Select Graphics in the Hardware section

Select the GPU and click on Edit – the Edit Graphics Device Settings window opens

If you are going to change a setting, ensure that the ESXi host is in maintenance mode.
Select Shared Direct and click on OK.

Disabling ECC Memory on the GPU Device

To disable ECC memory on the GPU Device, you must use the NVIDIA-SMI command, which you need to operate from the ESXi host shell. Ensure you have SSH enabled on the host. (Select ESXi host, go to configure, System, Services, select SSH and click on Start)

Open an ssh session to the host and enter the following command:

nvidia-smi --query-gpu=ecc.mode.current --format=csv

If you want to disable ECC on your GPU (you do not need to if you use C-type vGPU Profiles for ML workloads), run the following command. Please ensure your ESXi host is in maintenance mode if you change a setting on the ESXi host.

nvidia-smi -e 0

you can now reboot your host, or if you want to verify whether the setting has been changed, enter the following command:

nvidia-smi --query-gpu=ecc.mode.pending --format=csv