Architecting AI Infrastructure

The Architecting AI Infrastructure series examines how AI workloads operate across different systems and how this affects platform design choices. The articles break down the basics of placement, scheduling, and resource use, and the included tools show how these ideas play out in practice. Tools in this series include the vGPU Silo Capacity Calculator (how profile catalogs influence long-term deployable capacity under placement limits) and Same-size vs Mixed-size Placement (how Same-size and Mixed-size modes behave as placement decisions accumulate over time).

Series overview
  1. Architecting AI Infrastructure Series - Part 1 In earlier articles, I looked at how modern AI models use GPU resources. I covered dynamic memory consumption, activation patterns, and how designs like mixture-of-experts …
  2. Architecting AI Infrastructure - Part 2 The previous article covered GPU placement as part of the platform’s lifecycle, not just a scheduling step. These choices affect what the platform can handle as workloads evolve. …
  3. Architecting AI Infrastructure - Part 3. In the first two articles, I looked at GPU consumption models and how AI workloads state their accelerator needs. In vSphere, these models take shape through virtual machine …
  4. Architecting AI Infrastructure - Part 4 In the last article, we tracked a GPU-backed VM from resource configuration to host selection. DRS evaluated the cluster, Assignable Hardware filtered hosts for GPU compatibility, …
  5. Architecting AI Infrastructure - Part 5 In the previous article, we looked at how GPUs are placed within an ESXi host and how GPU modes and assignment policies determine which physical GPU a workload uses. These …
  6. Architecting AI Infrastructure - Part 6 Last time, I looked at how Same Size vGPU mode works with different assignment policies and how right-sizing profiles can make placement more flexible. The main point was that both …
  7. How Same-size and Mixed-size vGPU placement behavior evolves at cluster scale and how profile strategy influences deployable capacity over time.
  8. A deep dive into MIG partitioning, placement geometry, and stranded capacity in GPU infrastructure for AI workloads.
  9. Explains why distributed inference turns GPU communication into part of the critical path and why topology-aware scheduling is required when models span multiple GPUs.