UNDERSTANDING ACTIVATION MEMORY IN MIXTURE OF EXPERTS MODELS

In my previous article, The Dynamic World of LLM Runtime Memory, I focused on KV-cache as the primary driver of runtime memory pressure. Today, as inference workloads move toward long-context and agentic execution, activation memory has emerged as an equally important and often overlooked constraint. Long-context inference, once niche, is now expected as models handle tens of thousands of tokens in lengthy prefill phases. Agentic inference introduces variable execution, including reasoning, tool calls, pauses, and uneven token generation. These patterns put sustained pressure on both KV-cache and intermediate activations.

THE DYNAMIC WORLD OF LLM RUNTIME MEMORY

When meeting with customers and architectural teams, we often perform a specific exercise to separate a model’s static consumption (its weights) from its dynamic runtime consumption. In the unpredictable world of production AI, where concurrent users, complex system prompts, and varying RAG content create constant flux, it is easy to view memory as an elusive target. This article is designed to move your service level from probabilistic to deterministic concurrency. To make this accessible to those managing the hardware, I have intentionally used language common to system administrators rather than data scientists. Instead of focusing on the mathematical constructs of vectors and matrices, we will use the term representations to highlight the actual memory consumption of these data structures.

TALKING VCF 9 AND PRIVATE AI FOUNDATION ON THE UNEXPLORED TERRITORY PODCAST

Just before VMware Explore, I joined the Unexplored Territory Podcast to talk about the enhancements in VMware Cloud Foundation 9 and the Private AI Foundation with NVIDIA. We covered new functionality, such as Agent Builder, and walked through the broader enhancements for AI workloads. We also highlighted a few must-attend sessions at Explore. You can listen to the full episode here: Apple Podcasts Spotify During Explore, many people told me this episode was a great starting point to wrap their heads around VMware Private AI Foundation. If you’re looking for a concise way to catch up, this is a good place to begin.

WHICH MULTI-GPU CONFIGURATIONS ARE YOU PLANNING TO DEPLOY?

During VMware Explore, numerous conversations highlighted that most customers plan to deploy systems with two or more GPUs. The next challenge is deciding which type of multi-GPU configuration to adopt — a choice that depends on intra-node communication, inter-node interconnects, and cooling strategies. To better understand where organizations are heading, I’ve created a short survey. The diagram below illustrates the options available in the NVIDIA-certified systems portfolio, which I use as a reference point in the questions. Your feedback will help map out how different configurations are being considered and provide valuable input as we align our product strategy with customer needs. ** How to Read the Diagram**

ENHANCED VMOTION FOR VGPU VMS IN VCF 9.0

VMware’s latest release of Cloud Foundation 9.0 introduces an important new feature for managing AI infrastructure: Enhanced vMotion for vGPU VMs. This new feature substantially improves the management of large language models (LLMs) in virtualized environments. For an in-depth technical overview, please read Justin Murray’s detailed article on the subject. The Power of vMotion Traditionally, vMotion has been a cornerstone of VMware’s value proposition, enabling two critical benefits: Infrastructure maintenance without workload disruption Maintenance without coordination with workload owners These capabilities have allowed vAdmins to perform updates and maintenance with minimal impact on running services, a crucial advantage in today’s always-on digital landscape.

BUILDING AN EFFICIENT AI INGESTION PIPELINE: DATA INGESTION STRATEGIES

Traditionally, deploying applications is a straightforward process that moves from development to production. For instance, enterprise apps usually work with databases to perform standard tasks, which makes resource management and maintenance predictable. Generative AI (Gen-AI) applications, however, are more flexible and complex. They need to adapt quickly, since they work with constantly changing data and must handle a wide range of demands. Gen-AI apps, especially those using Large Language Models (LLMs) and Retrieval Augmented Generation (RAG), don’t follow the same linear path as traditional workloads. Instead, they move through a circular, adaptive lifecycle with two main stages: research and production.

HOW TO BUILD AN EFFICIENT AI INGESTION PIPELINE

On-premises AI deployments are becoming increasingly important, but infrastructure administrators and architects often face a steep learning curve due to unfamiliar terminology. While much AI information is tailored to data scientists, there’s a growing need for resources to clarify how these workloads impact infrastructure. Understanding the RAG Ingestion Pipeline for AI Workloads When planning for AI, you’ll often hear terms like “embedding models,” “vector embeddings,” and “vector databases,” especially in Retrieval-Augmented Generation (RAG) frameworks that support scalable, responsive AI. But what actually happens when you run a RAG pipeline? How much computing power do you need? How much I/O is involved when processing a 60 MB dataset? These questions, along with key scaling factors, are important for sizing your infrastructure and planning for changes.

VMWARE PRIVATE AI FOUNDATION - PRIVACY AND SECURITY BEST PRACTICES WHITE PAPER

I’m excited to announce the release of my latest white paper, “VMware Private AI Foundation - Privacy and Security Best Practices.” As many of you know, the world of artificial intelligence is rapidly evolving, and with that comes a new set of challenges, particularly around privacy and security. This white paper is not just about theory. It’s a practical guide introducing the foundational concepts, frameworks, and models underpinning private AI security. It’s a deep dive into the critical aspects of privacy and security in the context of AI, providing you with the tools to implement these principles in your own work. You’ll learn about the principle of shared responsibility, threat modeling for Gen-AI applications, and the CIA triad – confidentiality, integrity, and availability – as a guiding model for information security.

RAG ARCHITECTURE DEEP DIVE

Retrieval Augmented Generation (RAG) is a way to enhance Large Language Models (LLMs) by giving them access to extra data. In a typical Gen-AI setup, the LLM answers questions using only what it learned during training. It does not look up new information beyond its training data. RAG changes this by combining retrieval and generation. It uses a retriever to find relevant information from a large text collection, called a corpus, which is stored in a vector database. The generative part, powered by the LLM, then uses this information to create responses.

THE MISCONCEPTION OF SELF-LEARNING CAPABILITIES OF LARGE LANGUAGE MODELS DURING PRODUCTION

I enjoyed engaging with many customers about bringing Gen-AI to the on-prem data center at VMware Explore. Many customers want to keep their data and IP between the four walls of their organization, and rightly so. With VMware Private AI Foundation, we aim to utilize foundation models and build upon the great work of many smart data scientists. Foundation models like Llama 2, StarCoder, and Mistral 7b. Instead of building and training a large language model (LLM) from the ground up, which can be time-consuming and computationally expensive, organizations can leverage foundation models pre-trained on a massive dataset of text and code. If necessary, organizations can further fine-tune a foundation model on specific tasks and data in a short period of time.