Understanding AI Memory

The Understanding AI Memory series examines how memory behaves while AI models run and how this affects platform design choices. The articles break down the main runtime memory components that shape capacity planning, concurrency, and infrastructure design.

Series overview

3 parts • Latest: Part 3 - Durable Agentic AI Sessions in GPU Memory

Part 1 Part 1 - The Dynamic World of LLM Runtime Memory
Explains how KV cache and context length drive LLM runtime memory growth and how this determines predictable GPU concurrency during inference workloads.
Part 2 Part 2 - Understanding Activation Memory in Mixture of Experts Models
Explains how activation memory behaves in Mixture of Experts models and why long-context and agentic inference introduce unpredictable activation peaks during prefill phases.
Part 3 Part 3 - Durable Agentic AI Sessions in GPU Memory
How agentic AI workloads accumulate KV cache across reasoning steps and tool calls and why this changes GPU memory planning for on prem infrastructure.