vSphere ML Accelerator Spectrum Deep Dive - GPU Device Differentiators

The two last parts reviewed the capabilities of the platform. vSphere can offer fractional GPUs to Multi-GPU setups, catering to the workload’s needs in every stage of its development life cycle. Let’s look at the features and functionality of each supported GPU device. Currently, the range of supported GPU devices is quite broad. In total, 29 GPU devices are supported, dating back from 2016 to the last release in 2023. A table at the end of the article includes links to each GPUs product brief and their datasheet. Although NVIDIA and VMware form a close partnership, the listed support of devices is not a complete match. This can lead to some interesting questions typically answered with; it should work. But as always, if you want bulletproof support, follow the guides to ensure more leisure time on weekends and nights.

VMware HCL and NVAIE Support

The first overview shows the GPU device spectrum and if NVAIE supports them. The VMware HCL supports every device listed in this overview, but NVIDIA decided not to put some of the older devices through their NVAIE certification program. As this is a series about Machine Learning, the diagram shows the support of the device and a C-series vGPU type. The VMware compatibility guide has an AI/ML column, listed as Compute, for a specific certification program that tests these capabilities. If the driver offers a C-series type, the device can run GPU-assisted applications; therefore, I’m listing some older GPU devices that customers still use. With some other devices, VMware hasn’t tested the compute capabilities, but NVIDIA has, and therefore there might be some discrepancies between the VMware HCL and NVAIE supportability matrix. For the newer models, the supportability matrix is aligned. Review the table and follow the GPU device HCL page link to view the supported NVIDIA driver version for your vSphere release.

The Y axis shows the device Interface type and possible slot consumption. This allows for easy analysis of whether a device is the “right fit” for edge locations. Due to space constraints, single-slot PCIe cards allow for denser or smaller configurations. Although every NVIDIA device supported by NVAIE can provide time-shared fractional GPUs, not all provide spatial MIG functionality. A subdivision is made on the Y-axis to show that distinction. The X-axis represents the GPU memory available per device. It allows for easier selection if you know the workload’s technical requirements.

The Ampere A16 is the only device that is listed twice in these overviews. The A16 device uses a dual-slot PCIe interface to offer four distinct GPUs on a single PCB card. The card contains 64GB GPU memory, but vSphere shall report four devices offering 16G of GPU memory. I thought this was the best solution to avoid confusion or remarks that the A16 was omitted, as some architects like to calculate the overall available GPU memory capacity per PCIe slot.

NVLink Support

If you plan to create a platform that supports distributed training using multi-GPU technology, this overview shows the available and supported NVLinks bandwidth capabilities. Not all GPU devices include NVLink support, and the ones with support can wildly differ. The MIG capability is omitted as MIG technology does not support NVLink.

NVIDIA Encoder Support

The GPU decodes the video file before running it through an ML model. But it depends on the process following the outcome of the model prediction, whether to encode the video again and replay it to a display. With some models, the action required after, for example, an anomaly detection, is to generate a warning event. But if a human needs to look at the video for verification, a hardware encoder must be available on the GPU. The Q-series vGPU type is required to utilize the encoders. What may surprise most readers is that most high-end datacenter does not have encoders. This can affect the GPU selection process if you want to create isolated media streams at the edge using MIG technology. Other GPU devices might be a better choice or investigate the performance impact of CPU encoding.

NVIDIA Decoder Support

Every GPU has at least one decoder, but many have more. With MIG, you can assign and isolate decoders to a specific workload. When a GPU is time-sliced, the active workload utilizes all GPU decoders available. Please note that the A16 list has eight decoders, but each distinct GPU on the A16 exposes two decoders to the workload.

GPUDirect RDMA Support

GPUDirect RDMA is supported on all time-sliced and MIG-backed C-series vGPUs on GPU devices that support single root I/O virtualization (SR-IOV). Please note that Linux is the only supported Guest OS for GPUDirect technology. Unfortunately, MS Windows isn’t supported.

Power Consumption

When deploying at an edge location, power consumption can be a constraint. This table list the specified power consumption of each GPU device.

Supported GPUs Overview

The table contains all the GPUs depicted in the diagrams above. Instead of repeating non-descriptive labels like webpage or PDFs, the table shows the GPU release date while linking to its product brief. The label for the datasheet indicates the amount of GPU memory, allowing for easy GPU selection if you want to compare specific GPU devices. Please note that VMware has not conducted C-series vGPU type tests on the device if the HCL Column indicates No. However, the NVIDIA driver does support the C-series vGPU type.

Architecture	GPU Device	HCL/ML Support	NVAIE 3.0 Support	Product Brief	Datasheet
Pascal	Tesla P100	No	No	October 2016	16GB
Pascal	Tesla P6	No	No	March 2017	16GB
Volta	Tesla V100	No	Yes	September 2017	16GB
Turing	T4	No	Yes	October 2018	16GB
Ampere	A2	Yes	No	November 2021	16GB
Pascal	P40	No	Yes	November 2016	24GB
Turing	RTX 6000 passive	No	Yes	December 2019	24GB
Ampere	RTX A5000	No	Yes	April 2021	24GB
Ampere	RTX A5500	N/A	Yes	March 2022	24GB
Ampere	A30	Yes	Yes	March 2021	24GB
Ampere	A30X	Yes	Yes	March 2021	24GB
Ampere	A 10	Yes	Yes	March 2021	24GB
Ada Lovelace	L4	Yes	Yes	March 2023	24GB
Volta	Tesla V100(S)	No	Yes	March 2018	32GB
Ampere	A100 (HGX)	N/A	Yes	September 2020	40GB
Turing	RTX 8000 passive	No	Yes	December 2019	48GB
Ampere	A40	Yes	Yes	May 2020	48GB
Ampere	RTX A6000	No	Yes	December 2022	48GB
Ada Lovelace	RTX 6000 Ada	N/A	Yes	December 2022	48GB
Ada Lovelace	L40	Yes	Yes	October 2020	48GB
Ampere	A 16	Yes	Yes	June 2021	64GB
Ampere	A100	Yes	Yes	June 2021	80GB
Ampere	A100X	Ye s	Yes	June 2021	80GB
Ampere	A100 HGX	N/A	Yes	November 2020	80GB
Ada Lovelace	H100	Yes	Yes	September 2022	80GB

Other articles in the vSphere ML Accelerator Spectrum Deep Dive

vSphere ML Accelerator Spectrum Deep Dive – GPU Device Differentiators

VMware HCL and NVAIE Support

NVLink Support

NVIDIA Encoder Support

NVIDIA Decoder Support

GPUDirect RDMA Support

Power Consumption

Supported GPUs Overview

The misconception of self-learning capabilities of Large Language Models…

Basic Terminologies Large Language Models

My Sessions at VMware Explore 2023 Las Vegas

2 Replies to “vSphere ML Accelerator Spectrum Deep Dive – GPU Device…”