Frank Denneman

WHY MULTI GPU REQUIRES TOPOLOGY AWARENESS

Architecting AI Infrastructure Series - Part 9 The AI Memory series has been showing how AI workloads use GPU memory in different ways. The Dynamic World of LLM Runtime Memory explains how the KV cache grows with each new token and becomes a main user of GPU resources. Understanding Activation Memory in Mixture of Experts Models looks at the hardware pressure that happens when activation memory spikes during the prefill phase. The series also covers how agentic systems keep memory active to stay on track during complex tasks, as discussed in Durable Agentic AI Sessions in GPU Memory.

Mon, Mar 16, 2026 ai AI Infrastructure

DURABLE AGENTIC AI SESSIONS IN GPU MEMORY

The durable memory of agentic systems When a user asks a question in a chat interface and the model responds, the interaction is a single prompt completion. A prompt goes in, tokens come out. From an infrastructure perspective this is a predictable transaction. As described in The Dynamic World of LLM Runtime Memory, the KV cache grows with the prompt, peaks during generation, and is released when the session ends. The memory footprint is bounded and relatively easy to plan for.

Thu, Mar 12, 2026 ai AI Memory

MIG PARTITIONING, PLACEMENT GEOMETRY, AND STRANDED CAPACITY

Architecting AI Infrastructure — Part 8 Previous articles in this series explained how time-sliced GPU sharing works in both same-size and mixed-size environments. They showed that choices like profiles and the order in which workloads start can directly affect GPU utilization and whether workloads are placed successfully. In this part, we look at MIG and the design choices that affect placement success and overall resource utilization. MIG takes a different approach to GPU sharing. Instead of multiplexing compute resources between workloads, MIG splits the GPU into hardware instances. Each instance gets its own dedicated compute and memory slices slices.

Fri, Mar 6, 2026 ai AI Infrastructure

SAME SIZE VS MIXED SIZE PLACEMENT AT CLUSTER SCALE

Architecting AI Infrastructure — Part 7 The Silo Capacity Visualizer from Part 6 shows how profile selection and placement-ID alignment affect memory layout inside a single GPU. While that’s helpful for understanding the basics, real capacity planning happens at the cluster level. This article introduces the Same-size vs Mixed-size Placement simulator, the second tool in the Cluster Profile Strategy Toolset. It lets you simulate vGPU placement across an entire cluster using both same-size and mixed-size policies simultaneously, with the same workload sequence for both. This way, you can directly compare their results.

Sun, Mar 1, 2026 ai AI Infrastructure

MIXED SIZE VGPU MODE IN PRACTICE

Architecting AI Infrastructure - Part 6 Last time, I looked at how Same Size vGPU mode works with different assignment policies and how right-sizing profiles can make placement more flexible. The main point was that both profile variety and assignment choices have a big impact on how much GPU capacity you can actually use over time. Understanding Placement IDs and Siloed Capacity This article focuses on Mixed Size mode. Unlike locking a GPU to one profile after the first placement, Mixed Size lets you use different profile sizes on the same device. This might seem like an easy fix for fragmentation, but it brings a new challenge: placement IDs. These are fixed memory spots on the GPU where a profile can begin, so even if memory appears free, you can’t always use it unless it aligns with a valid placement spot. For more details on how placement IDs work, see Part 4.

Tue, Feb 24, 2026 ai AI Infrastructure GPU Placement AI Platform VMware Private AI Foundation Kubernetes vSphere Scheduling

HOW SAME SIZE VGPU MODE AND RIGHT-SIZING SHAPE GPU PLACEMENT EFFICIENCY

Architecting AI Infrastructure - Part 5 In the previous article, we looked at how GPUs are placed within an ESXi host and how GPU modes and assignment policies determine which physical GPU a workload uses. These decisions impact more than just the initial placement of workloads. They also shape how GPU capacity changes over time, affecting fragmentation, consolidation, and how easily new workloads can be scheduled. In this article, we will look at workloads that use fractional GPU profiles and how their sizing choices impact overall platform efficiency.

Thu, Feb 19, 2026 ai AI Infrastructure GPU Placement AI Platform VMware Private AI Foundation Kubernetes vSphere Scheduling

HOW VSPHERE GPU MODES AND ASSIGNMENT POLICIES DETERMINE HOST LEVEL PLACEMENT

Architecting AI Infrastructure - Part 4 In the last article, we tracked a GPU-backed VM from resource configuration to host selection. DRS evaluated the cluster, Assignable Hardware filtered hosts for GPU compatibility, DRS ran its Goodness calculation, and picked a destination host. Now, the host is selected. But the placement is not finished. Inside the host, another set of decisions decides which physical GPU gets the workload and what types of workloads that GPU will handle from then on. These host-level choices are less visible than DRS decisions. They do not show up in dashboards or trigger alerts. However, their effects add up over time, and they play a key role in keeping a shared AI platform healthy or letting it decline.

Tue, Feb 17, 2026 ai AI Infrastructure GPU Placement AI Platform VMware Private AI Foundation Kubernetes vSphere Scheduling

HOW VSPHERE DRS MAKES GPU PLACEMENT DECISIONS

Architecting AI Infrastructure - Part 3. In the first two articles, I looked at GPU consumption models and how AI workloads state their accelerator needs. In vSphere, these models take shape through virtual machine settings. CPU reservations, memory guarantees, and GPU profile choices together create a clear resource contract. This article moves from discussing resource consumption to explaining placement. After a workload states its requirements, how does the platform decide where to run it? GPU placement in vSphere is not random or based on guesswork. Instead, it is a structured orchestration process that combines declarative intent, hardware awareness, and cluster-wide optimization.

Fri, Feb 13, 2026 ai AI Infrastructure GPU Placement AI Platform VMware Private AI Foundation Kubernetes vSphere Scheduling

GPU CONSUMPTION MODELS AS THE FIRST ARCHITECTURAL CHOICE IN PRODUCTION AI

Architecting AI Infrastructure - Part 2 The previous article covered GPU placement as part of the platform’s lifecycle, not just a scheduling step. These choices affect what the platform can handle as workloads evolve. Before making placement decisions, it’s worth asking: how do AI workloads use GPUs? This question is important because not every GPU workload requires the same resources. Two services might both need accelerators, but can be very different in memory use, how they run, and how much they depend on other GPUs. These differences set the platform’s limits well before the scheduler gets involved. So, GPU consumption models are not just an optimization detail. They are the first architectural choice for any AI platform.

Wed, Feb 11, 2026 ai AI Infrastructure GPU Placement AI Platform VMware Private AI Foundation Kubernetes vSphere Scheduling

WHY GPU PLACEMENT BECOMES THE DEFINING PROBLEM

Architecting AI Infrastructure Series - Part 1 In earlier articles, I looked at how modern AI models use GPU resources. I covered dynamic memory consumption, activation patterns, and how designs like mixture-of-experts change resource needs over time. Those pieces focused on what models require from accelerators. This new series shifts the focus. Instead of starting with the model, we will look at the platform itself. The goal of Architecting AI Infrastructure is to understand what changes when AI workloads move from ad-hoc experiments to long-running production services. At this stage, models need to be deployed, scaled, upgraded, and retired in a predictable way. GPUs are no longer tied to a single project, but become shared resources that support many teams, models, and use cases over time.

Mon, Feb 9, 2026 ai AI Infrastructure GPU Placement AI Platform VMware Private AI Foundation Kubernetes vSphere Scheduling

…