How Same Size VGPU Mode and Right-Sizing Shape GPU Placement Efficiency

Architecting AI Infrastructure - Part 5

In the previous article, we looked at how GPUs are placed within an ESXi host and how GPU modes and assignment policies determine which physical GPU a workload uses. These decisions impact more than just the initial placement of workloads. They also shape how GPU capacity changes over time, affecting fragmentation, consolidation, and how easily new workloads can be scheduled. In this article, we will look at workloads that use fractional GPU profiles and how their sizing choices impact overall platform efficiency.

AI platforms typically run a wide range of workloads. Along with large language models that use full GPUs, span several GPUs, or span across multiple hosts, most setups also include smaller but important services. Embedding models support retrieval pipelines, small language models handle tasks like moderation and security analysis, and data science notebooks support experimentation and model development. As mentioned in the earlier article about same-size and mixed-size deployments, these workloads often use fractional vGPU profiles. To get the most from the platform, you need to choose the right vGPU profile and vGPU mode. It can be tempting to size profiles as closely as possible to actual memory use, but sizing too tightly can isolate GPU capacity and lower overall efficiency.

Slim-fitting vGPU Profiles to Workload Memory Footprint

The following examples show a representative selection of common AI workloads, their typical GPU memory consumption, and the vGPU profile each would receive when sized as closely as possible to its observed memory footprint. Instead of right-sizing (you’ll see later), let’s call this slim-fitting.

Workload TypeModelStatic MemoryDynamic MemoryTotal GPU MemorySlim-Fit
EmbeddingBGE-M3 (560M)~1.1GB (FP16)2–3GB (activations + batch)3–4GB4GB
EmbeddingQwen3-Embedding-8B~16GB (FP16)4–6GB (larger batches + activations)20–22GB20GB (lean) or 40GB (production)
Small LLMGLM-4.7-Flash (30B MoE)12–15GB (Q3–INT4)3–5GB (KV cache, low concurrency)18–20GB20GB
Small LLMGPT-OSS-20B (MXFP4)~10GB static (MXFP4 weights)3–5GB (KV cache + activations)13–15GB16GB
Full-slot LLMGPT-OSS-120B (heuristic/security scanning)Full model footprintMinimal concurrency, low context~80GB80GB
Notebook / Data ScienceJupyter workloadsVariableVariableTypically <20GB10GB

Theoretical Slot Capacity

Before we look at placement behavior, it’s useful to understand the cluster’s theoretical capacity. With six 80GB GPUs, the number of possible slots depends on the vGPU profile size. The numbers below show a scenario where every GPU uses the same profile type. These are the maximum slot counts if only one vGPU profile is used. In real setups with different profile sizes, it’s much harder to predict available slots because it depends on how profiles are distributed across GPUs.

Profile SizeSlots per 80GB GPUTotal Slots (6 GPUs)
4GB20120
10GB848
16GB530
20GB424
40GB212
80GB16

These numbers show the highest possible density. In reality, factors like vGPU mode, assignment policy, profile variety, and how workloads start will decide how much of this capacity you can actually use over time.

Choosing the vGPU Mode

This points to a catalog of six profiles: 4GB, 10GB, 16GB, 20GB, 40GB, and 80GB. But defining the catalog is just the first step. Next, you need to decide how these profiles can share physical GPUs, which depends on the chosen vGPU mode and assignment policy. To see how these choices affect real-world behavior, let’s look at practical examples of how GPU modes work when workloads start up.

Before we go through the distribution example, it’s important to know the default behavior. By default, DRS uses Best Performance mode, which evenly distributes workloads across hosts rather than consolidating them. There is an advanced setting to favor consolidation, but it mainly affects full-GPU and multi-GPU setups; fractional vGPU profiles are not considered during initial placement. The example below uses an empty cluster to make the behavior clear. In real environments with running workloads, host selection can change based on cluster balance and VM scoring, which can affect where new workloads are placed.

Same Size vGPU Mode – Best Performance Assignment Policy

With Same Size mode and the Best Performance assignment policy, vSphere spreads workloads across available GPUs to balance load. In a self-service environment where startup order is unpredictable, smaller workloads often land first, causing GPUs to lock to different profile sizes early in the process.

In this example, embedding models, notebooks, and small LLMs start before the larger 40GB and 80GB workloads. Since identical profiles are spread across hosts, GPUs quickly lock into 4GB, 10GB, 16GB, and 20GB profiles throughout the cluster. When the larger workloads try to start, there are no GPUs left that can take a 40GB or 80GB profile, even though there is still enough total memory in the environment.

Once the first workloads start, each GPU becomes locked to a specific profile size. At this stage, the cluster stops acting as a shared pool of GPU memory. Now, capacity is only available within the profile domains set during startup.

Profile SizeAvailable Slots in Cluster
4GB38
10GB14
16GB4
20GB3
40GB0
80GB0

Same Size vGPU Mode – Consolidation Assignment Policy

Consolidation does not affect which host is chosen; it only changes how GPUs within the selected host are used.

If you use the same workload mix with Same Size mode but switch to the Consolidation assignment policy, the outcome changes a lot. Host selection still uses round-robin, but within each host, workloads are grouped onto compatible GPUs instead of being spread out. This means fewer GPUs get locked early, leaving some GPUs free for larger profiles later on.

As a result, the 40GB workload can use an unlocked GPU, and a full-slot 80GB workload can still be placed because at least one GPU is left completely free. The total resources in the cluster stay the same, but the assignment strategy decides whether capacity stays usable or becomes fragmented.

The failure is not caused by a lack of GPU memory, but by profile fragmentation introduced during random startup.

Available Slot Capacity After Consolidation

The consolidation example shows that placement policy does not increase total GPU capacity, but it does affect how much of that capacity you can actually use.

Profile SizeSpread (Best Performance)Group (Consolidation)
4GB3818
10GB146
16GB43
20GB32
40GB01
80GB00

In Group mode, an empty GPU is available for any profile because it is not locked yet.

The main difference from the spread example is not total memory, but flexibility. Consolidation leaves fewer profile domains active, so larger profiles still have valid placement options.

Re-evaluating Right-sizing

Right-sizing is therefore not only a workload decision but also a platform decision. The goal is not simply to match a workload’s minimum memory requirement (slim-fit), but to choose profile sizes that improve opportunities for cluster-wide capacity allocation (right-size).

Profile SizeSpread – PlacedSpread – AvailableGroup – PlacedGroup – Available
4GB238218
10GB31335
16GB1414
20GB2222
40GB0011
80GB0010

Right-sizing GPUs = workload fit + cluster efficiency

Same Size vGPU Mode – Best Performance (Right-sized Profiles)

In the right-sized spread scenario, the cluster converges toward only two active profile sizes: 10GB and 20GB. While this reduces profile diversity compared to the slim-fit experiment, the spreading behavior still distributes workloads across hosts and GPUs, leading to early profile locking. As a result, only a subset of the workload types can be placed, and larger profiles never become active. The diagram below illustrates how the cluster ends up operating with just two profile domains despite having sufficient raw GPU capacity.

Profile SizePlacedAvailable
10GB527
20GB35
40GB00
80GB00

Same Size vGPU Mode – Consolidation (Right-sized Profiles)

With consolidation enabled, the same workload mix produces a very different outcome. All four profile sizes become active, meaning every workload type in the catalog is successfully deployed. Because compatible workloads are grouped onto existing GPUs, one GPU remains completely uncommitted. This fully open device represents flexible capacity that can still accept any profile size, including large or full-slot workloads. The diagram below highlights how consolidation preserves both workload coverage and future placement flexibility.

Profile SizeSpread – PlacedSpread – AvailableGroup – PlacedGroup – Available
10GB527519
20GB3535
40GB0013
80GB0011

Consolidation increases the benefits of right-sizing because having fewer profiles means there are fewer lock boundaries.

Observation

Right-sizing profiles does not add more total GPU capacity. Instead, it helps make better use of what you have. With fewer profile sizes, consolidation works better, fragmentation goes down, and the cluster acts more like a shared resource pool instead of a set of isolated profile domains.

Looking Ahead: Mixed Size Mode

This article focused on Same Size mode, where GPUs lock to a profile after the first workload is placed. In this setup, right-sizing helps efficiency by reducing the number of profile domains and keeping placement flexible. Mixed Size mode changes things completely. Profile locking goes away, but a new challenge appears: placement IDs and how profile sizes fit into GPU memory ranges. In the next part of this series, we will repeat the experiment using Mixed Size mode to show how placement changes, why fragmentation still happens, and why right-sizing still matters even without GPU locking.

Same Size mode fragments GPUs across different profiles, while Mixed Size mode can cause fragmentation within placement ranges.