frankdenneman.nl - frankdenneman.nl

Which Multi-GPU Configurations Are You Planning to Deploy?

September 2, 2025 by frankdenneman Leave a Comment

During VMware Explore, numerous conversations highlighted that most customers plan to deploy systems with two or more GPUs. The next challenge is deciding which type of multi-GPU configuration to adopt — a choice that depends on intra-node communication, inter-node interconnects, and cooling strategies.

To better understand where organizations are heading, I’ve created a short survey. The diagram below illustrates the options available in the NVIDIA-certified systems portfolio, which I use as a reference point in the questions. Your feedback will help map out how different configurations are being considered and provide valuable input as we align our product strategy with customer needs.

How to Read the Diagram

The diagram is intended to illustrate the spectrum of multi-GPU configurations, ranging from PCIe-based systems to NVLink bridge topologies and NVSwitch-enabled SXM platforms. To make this more tangible, I’ve used Dell’s AI server portfolio as an example, allowing you to check out the exact systems on Dell’s website to view their specifications. As an additional detail, the boxes are color-coded:

White boxes represent air-cooled servers
Blue boxes represent liquid-cooled servers

PCIe Connected GPUs

PCIe-connected systems support 1–16 GPUs as standard add-in cards. Communication between GPUs relies on PCIe lanes or, in some cases, NVLink bridges. Some NVIDIA datacenter GPUs, such as the L4 and the highly anticipated RTX PRO 6000, do not offer NVLink capabilities. H100 and H200 GPUs can be deployed without an NVLink Bridge, resulting in a theoretical GPU-to-GPU bandwidth of between 64 GB/s and 128 GB/s, depending on the PCIe specification of the device.

NVLink

An NVLink bridge is a direct, high-bandwidth connection between GPUs, designed to bypass the limitations of PCIe and enable GPUs to share memory and exchange data at significantly higher speeds. Unlike PCIe, which is a general-purpose bus, NVLink is purpose-built for GPU-to-GPU communication.

2-way NVLink Bridge (H100 PCIe): In a 2-way setup, two GPUs are directly linked, creating a fast point-to-point connection. On the NVIDIA H100 PCIe, each GPU can communicate with its peer at up to 600 GB/s of bandwidth (NVLink 4.0), compared to just 128 GB/s over PCIe Gen5. This is typically used in dual-GPU or paired configurations.

4-way NVLink Bridge (H200 PCIe): In a 4-way setup, four GPUs are interconnected in a mesh topology. Each GPU can communicate with multiple peers simultaneously, thereby improving bandwidth across the group. On the NVIDIA H200 PCIe, each GPU supports up to 900 GB/s of GPU-to-GPU bandwidth (NVLink 5.0). This enables stronger scaling for 4-GPU systems, though it doesn’t provide the full all-to-all fabric that NVSwitch delivers.

NVSwitch with SXM GPUs

SXM is NVIDIA’s server GPU module format; instead of a PCIe add-in card, the GPU is mounted directly onto the motherboard using an SXM socket. This allows for higher power delivery, denser designs, and direct integration with high-bandwidth GPU interconnects.

SXM GPUs can be deployed in two main HGX configurations:

4-GPU HGX (NVLink): In 4-GPU systems, SXM GPUs are interconnected using NVLink without an NVSwitch fabric. On the H100 SXM, each GPU provides up to 900 GB/s of GPU-to-GPU bandwidth (NVLink 4.0), enabling strong scaling within the node while avoiding PCIe bottlenecks.

8-GPU HGX (NVSwitch): In larger systems with eight SXM GPUs, NVSwitch components are integrated on the motherboard to create a fully connected all-to-all fabric. Every GPU can communicate with every other GPU at NVLink speeds, eliminating peer-to-peer bottlenecks. On both the H100 SXM and H200 SXM, each GPU supports up to 900 GB/s of GPU-to-GPU bandwidth when connected through NVSwitch. The B200 HGX server supports 1.8 TB/s of GPU-to-GPU bandwidth. This design enables GPUs to operate effectively as a single, unified pool of compute and memory.

Call to Action

I’d like to ask you to take a few minutes to answer the six questions in this survey. The goal is to gain a better understanding of how organizations plan their future GPU server configurations — from retrofitting versus new systems, to GPU scale, interconnects, and cooling.

Your input will help me build a clearer picture of the distribution of server choices across the industry. I’ll use this insight to take back into our discussions and help align VMware’s product strategy with the directions customers are heading.

Loading…

Talking VCF 9 and Private AI Foundation on the Unexplored Territory Podcast

September 2, 2025 by frankdenneman Leave a Comment

Just before VMware Explore, I joined the Unexplored Territory Podcast to talk about the enhancements in VMware Cloud Foundation 9 and the Private AI Foundation with NVIDIA. We covered new functionality, such as Agent Builder, and walked through the broader enhancements for AI workloads. We also highlighted a few must-attend sessions at Explore.

You can listen to the full episode here:

During Explore, many people told me this episode was a great starting point to wrap their heads around VMware Private AI Foundation. If you’re looking for a concise way to catch up, this is a good place to begin.

If you prefer shorter clips, here are a few highlights:

Because Great AI Starts With Great Infrastructure!

What is Dynamic DirectPath I/O and why is it useful for AI workloads?

What are Deep Learning VM templates and how do they help developers?

What about HA and DRS when it comes to GPUs and AI workloads?

If you’ve had a chance to listen or watch, I’d love to hear what stood out most about the changes in VCF 9 or the Private AI Foundation.

VMware Private AI Foundation – Privacy and Security Best Practices white paper

July 1, 2024 by frankdenneman

I’m excited to announce the release of my latest white paper, “VMware Private AI Foundation – Privacy and Security Best Practices.” As many of you know, the world of artificial intelligence is rapidly evolving, and with that comes a new set of challenges, particularly around privacy and security.

This white paper is not just about theory. It’s a practical guide introducing the foundational concepts, frameworks, and models underpinning private AI security. It’s a deep dive into the critical aspects of privacy and security in the context of AI, providing you with the tools to implement these principles in your own work. You’ll learn about the principle of shared responsibility, threat modeling for Gen-AI applications, and the CIA triad – confidentiality, integrity, and availability – as a guiding model for information security.

And that’s not all. In the near future, we’ll be taking these concepts to the next level with a follow-up white paper focused on VMware Cloud Foundation (VCF) settings. This paper will be your go-to resource for detailed guidance on configuring and optimizing VCF to establish a robust and secure private AI environment.

Stay tuned for this next installment, where we’ll bridge the gap between theory and practice, empowering you to build and deploy private AI solutions confidently.

Thank you for your continued support, and I look forward to hearing your feedback!

The misconception of self-learning capabilities of Large Language Models during Production

February 22, 2024 by frankdenneman

I enjoyed engaging with many customers about bringing Gen-AI to the on-prem data center at VMware Explore. Many customers want to keep their data and IP between the four walls of their organization, and rightly so.

With VMware Private AI Foundation, we aim to utilize foundation models and build upon the great work of many smart data scientists. Foundation models like Llama 2, StarCoder, and Mistral 7b. Instead of building and training a large language model (LLM) from the ground up, which can be time-consuming and computationally expensive, organizations can leverage foundation models pre-trained on a massive dataset of text and code. If necessary, organizations can further fine-tune a foundation model on specific tasks and data in a short period of time.

At VMware, we believe in using Vector DBs with Retrieval Augmented Generation (RAG) to decouple the IP and data from the foundation model. Using RAG, you offload the knowledge updates to another system so the Gen-AI application is always up to date. The vector DB is used for memorizing facts, while the foundation model is used for reasoning functionality. If necessary, the foundation model can be replaced by a newer version. Typically, this doesn’t happen, but if a data science team thinks it will improve the Gen-AI application’s reasoning or generative capability, they can do that without losing any IP.

And this particular fact, not losing any IP by replacing the model, got me some pushback from the people I spoke to. By digging into this topic a bit more, I discovered misconceptions among many about the learning ability of a neural network (LLM) model.

When you use an LLM, i.e., asking it a question (prompt), the model does NOT learn from your question. LLMs have no self-learning mechanisms during the deployment phase (inference). Let’s dive a bit deeper into the difference between inference and training.

Inference: When asking a model a question, you ask it to make a prediction, and the model feeds the input data to the network and its weights (parameters) to compute the output. This sequence is also called a forward pass.

Data scientists freeze the parameters when the neural network is accurate enough, and inference uses the same parameters for every question during inference.

Training: When training a neural network, the first step is called the forward pass, which is the same as inference. The forward pass calculates the prediction of the model for the input data. After the forward pass, the loss is calculated by comparing the prediction to the expected result. The loss is used to calculate a gradient. The gradient guides the framework to increase or decrease the values of each parameter. The Backpropagation pass adjusts the parameters layer by layer to minimize the loss.

The training process repeats the forward pass and backpropagation until the model converges, meaning the loss no longer decreases. Once the model converges, a checkpoint is created, and the parameters are frozen. The Gen-AI application uses this version of the model to generate answers.

I believe the misconception of self-learning capabilities occurs when thinking about either recommender systems (Netflix proposing which series to look at next or Amazon telling you other customers bought item X along with item Y), but a recommender system uses a converged model (frozen weights), with an online feature store. An online feature store provides real-time access to essential features for generating accurate and personalized recommendations. Amazon uses an online feature store to store features about its products and users.

Product features: These describe the products themselves, such as their price, category, popularity, brand, color, size, rating, and reviews.
User features: These describe the users, such as their past purchase history, demographics (age, gender, location), interests, and browsing behavior.

Suppose an Amazon customer has purchased many books in the past. In that case, the recommender system might recommend other books to the user based on collaborative or content-based filtering. With collaborative filtering, the algorithm identifies users with similar tastes to the target user and recommends items that those users have liked. With content-based filtering, the algorithm recommends items similar to items the target user has liked in the past. It seems like the model is learning when using it. In reality, the model really calculates predictions by using its parameters and data (features) from the online feature store (a database).

The context window is another example of why a Gen-AI application like ChatGPT seems to be learning while using it. Models can appear to “learn” during inference in the same context, but it is because the same information is still part of its context window. The best analogy for an LLM context window is the working memory of a human. For example, if I ask Llama 2, “Who is the CEO of VMware?” and the answer it returns is “Pat Gelsinger,” and I respond with, “No, it’s Raghu,” and it responds with, “Yes, that is correct, the CEO of VMware is Raghu Raghuram.” Then, if I ask it, “Who is the CEO of VMware?” it will respond with “Raghu Raghuram,” but that is only because the same context has the answer. Google’s Bard supports up to 2048 tokens, and ChatGPT supports up to 4096 tokens. By default, Llama 2 has a context window of 4096 tokens, but Huge Llama 2 supports up to 32K tokens. A token is a unit of text that the model uses to process and generate text. Tokens can be words, subwords, or characters.

Everything comes at a price. The larger the context window is, the more memory and compute inference are consumed. Outside of the same conversational context, models are completely stateless and cannot permanently “learn” anything in inference mode. What most LLMaaS do is capture the user prompts, which might contain sensitive data, and use it to form training data sets. On top of that, they use a technique called Reinforcement Learning with Human Feedback (RHLF), which allows it to learn more frequently. Chip Huyen published an excellent article describing RLHF in detail while being understandable for non-data-scientists.

Why does this distinction matter of self-learning matter for non-data scientists? By understanding that default foundation models do not alter state during use, VI-admins, architects, and application developers can design an infrastructure and application that offers high availability while offering a proper life cycle management strategy for the Gen-AI application.

For example, the checkpoint of a model can be loaded and exposed by an inference framework running in separate VMs on separate accelerated ESXi hosts with an anti-affinity rule that ensures that each model API endpoint will not share the same physical infrastructure to reduce the blast radius. A load-balancer in front of the inference frameworks offers the flexibility to take one inference framework offline during maintenance jobs without seeing some behavior change. As the model is frozen, no model version would have learned more than the other during their service time.

We are currently building and developing VMware Private AI Foundation to allow VMware customers to deploy LLMs on-prem securely and keep the data and Gen-AI application access secure. Private data should be kept private, and together with using state-of-the-art foundation models, organizations can safely build their Gen-AI applications to support their business goals.

A special thanks goes out to @Steve Liang for making this article more informative.

Gen AI Sessions at Explore Barcelona 2023

November 1, 2023 by frankdenneman

I’m looking forward to next week’s VMware Explore conference in Barcelona. It’s going to be a busy week. Hopefully, I will meet many old friends, make new friends, and talk about Gen AI all week. I’m presenting a few sessions, listed below, and meeting with customers to talk about the VMware Private AI foundation. If you are interested and you see me, walk by, come, and have a talk with me.

Hopefully, I will see you at one of the following sessions:

Monday, Nov 6:
Executive Summit
For invite only
Time: 11:00 AM – 1:00 PM CET

For VMware Certified Instructors only: VCI Forum Keynote.
Time: 2:15 PM – 3:00 PM CET
Location: Las Arenas II (2) Hotel Porta Fira, Pl. d’Europa, 45, 08908 L’Hospitalet de Llobregat, Barcelona, Spain 15 min walk from conference, 3 min taxi)

Meet the Experts Sessions
Machine Learning Accelerator Deep Dive [CEIM1199BCN]
Time: 4:30 PM – 5:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 4

Tuesday, Nov 7:
Meet the Experts Sessions
Machine Learning Accelerator Deep Dive [CEIM1199BCN]
Time: 11:30 AM – 12:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 1

Wednesday, Nov 8:
AI and ML Accelerator Deep Dive [CEIB1197BCN]
Time: 9:00 AM – 9:45 AM CET
Location: Hall 8.0, Room 31

CTEX: Building a LLM Deployment Architecture: Five Lessons Learned
Speakers: Shawn Kelly & Frank Denneman
Time: 4:00 PM – 5:00 PM CET
Location: Room CC5 501
Register here: https://lnkd.in/eiTbZusU

Recommended AI Sessions:
Monday, Nov 6:
‘Til the Last Drop of GPU: Run Large Language Models Efficiently [VIT2101BCN] (Tutorial)
Speakers: Agustin Malanco – Triple VCDX
Time: 11:00 AM – 12:30 PM CET
Location: Hall 8.0, Room 18

Tuesday, Nov 7:
Using VMware Private AI for Large Language Models and Generative AI on VMware Cloud Foundation and VMware vSphere [CEIB2050BCN]
Speakers: Justin Murray, Shawn Kelly
Time: 10:30 AM – 11:15 AM CET
Location: Hall 8.0, Room 12

Empowering Business Growth with Generative AI [VIB2368BCN]
Speakers: Robbie Jerrom, Serge Palaric, Shobhit Bhutani
Time: 2:15 PM – 3:00 PM CET
Location: Hall 8.0, Room 14

Why do I need AI in my Data Center? Does this help me become a differentiator?
Speaker: Gareth Edwards.
Time: Nov 7, 3:30 – 4:40 PM
Location: CTEX: Room CC5 501

Meet the Experts:
ML/AI and Large Language Models – Implications for VMware Infrastructure [CEIM2282BCN]
Expert: Justin Murray
Time: 12:30 PM – 1:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 2

Wednesday, Nov 8
AI Without GPUs: Run AI/ML Workloads on Intel AMX CPUs with vSphere 8 and Tanzu [MAPB2367BCN]
Speakers: Chris J Gully, Earl Ruby
Time: 12:45 PM – 1:30 PM CET
Location: Hall 8.0, Room 33

Meet the Experts:
ML/AI and Large Language Models – Implications for VMware Infrastructure [CEIM2282BCN]
Expert: Justin Murray
Time: 4:30 PM – 5:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 4